this post was submitted on 17 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 11 months ago
MODERATORS
 

Researchers at Meta AI announced Emu Edit today. It can edit images precisely based on text instructions. It's a big advance for "instructable" image editing.

Existing systems struggle to interpret instructions correctly - making imprecise edits or changing the wrong parts of images. Emu Edit tackles this through multi-task training.

They trained it on 16 diverse image editing and vision tasks like object removal, style transfer, segmentation etc.

Emu Edit learns unique "task embeddings" to guide it towards suitable edits based on the instruction text. Like a "texture change" vs "object removal".

In evaluations, Emu Edit significantly outperformed prior systems like InstructPix2Pix on following instructions faithfully while preserving unrelated image regions.

With just a few examples, it can adapt to wholly new tasks like image inpainting by updating the task embedding rather than full retraining.

There's still room for improvement on complex instructions. But Emu Edit demonstrates how multi-task training can majorly boost AI editing abilities. It's now much closer to human-level performance on translating natural language to precise visual edits.

TLDR: Emu Edit uses multi-task training on diverse edits/vision tasks and task embeddings to achieve big improvements in instruction-based image editing fidelity.

Full summary is here. Paper here.

top 3 comments
sorted by: hot top controversial new old
[–] crantob@alien.top 1 points 10 months ago

Looks like too much work to recreate easily.

[–] Xanian123@alien.top 1 points 10 months ago

I was just talking to a friend yesterday about how AI images won't take off unless tweaks can be done using natural language. If the paper's claims are true, this is going to be revolutionary.

[–] evanthebouncy@alien.top 1 points 10 months ago

We'll have a finer definition on what an edit is.

Currently from flipping the image vertically, to swapping out sub regions, to truly semantic edits like "make the person stand up". They're all lumped together and called "edits".

Something like different tiers of autonomous driving will be needed. Tier1 edits, all the way to tier5.

The proposed method is like tier2. Capable of swapping out sub regions via style transfer, but cannot meaningfully change the structure of the scene, ie "make the man stand up".