Successful-Western27

joined 1 year ago
 

(Reposting since the previous post was removed, I think because non-Arxiv posts are only allowed on weekends and now it's a weekend)

Materials discovery is critical but tough. New materials enable big innovations like batteries or LEDs. But there are ~infinitely many combinations to try. Testing for them experimentally is slow and expensive.

So scientists and engineers want to simulate and screen materials on computers first. This can check way more candidates before real-world experiments. However, models historically struggled at accurately predicting if materials are stable.

Researchers at DeepMind made a system called GNoME that uses graph neural networks and active learning to push past these limits.

GNoME models materials' crystal structures as graphs and predicts formation energies. It actively generates and filters candidates, evaluating the most promising with simulations. This expands its knowledge and improves predictions over multiple cycles.

The authors introduced new ways to generate derivative structures that respect symmetries, further diversifying discoveries.

The results:

  1. GNoME found 2.2 million new stable materials - equivalent to 800 years of normal discovery.
  2. Of those, 380k were the most stable and candidates for validation.
  3. 736 were validated in external labs. These include a totally new diamond-like optical material and another that may be a superconductor.

Overall this demonstrates how scaling up deep learning can massively speed up materials innovation. As data and models improve together, it'll accelerate solutions to big problems needing new engineered materials.

TLDR: DeepMind made an AI system that uses graph neural networks to discover possible new materials. It found 2.2 million candidates, and over 300k are most stable. Over 700 have already been synthesized.

Full summary available here. Paper is here.

 

A new paper focuses on improving arithmetic skills in LLMs, primarily GPT-like models. There are three primary challenges faced by LLMs in arithmetic:

  1. Complex Calculations: LLMs struggle with intricate arithmetic, particularly involving large numbers, leading to difficulties in internal intermediate steps.
  2. Length Limitations: LLMs are limited to handling numbers within the range of their training data, restricting practicality.
  3. Integration with Language: Merging arithmetic and natural language data encounters obstacles due to differences in surface formats, causing position-dependent representations that conflict.

To address these challenges, the article introduces techniques to enhance multiplication:

  • Padding: Number factors are padded to a fixed 15-digit length, ensuring uniformity and position invariance.
  • Reordering: The order of digits in the product is reversed to align with the natural progression of multiplication.

The outcomes are impressive. In testing, their approach achieves 99% accuracy in calculating products for numbers up to 12 digits. Simply asking GPT-4 to multiple two 4-digit numbers has an accuracy of less than 1%, by comparison.

To overcome length limitations, the paper explores data formats and positional encodings, including random spacing and alternative encodings. These innovations enable LLMs to generalize addition to handle additional digits.

The article also addresses the integration of arithmetic and language data by randomizing formats and using alternative positional encodings, enabling effective data integration.

TLDR: As the paper title says, positional description matters for transformer arithmetic.

Full summary is here. Paper is here.

 

Materials discovery is critical but tough. New materials enable big innovations like batteries or LEDs. But there are ~infinitely many combinations to try. Testing for them experimentally is slow and expensive.

So scientists and engineers want to simulate and screen materials on computers first. This can check way more candidates before real-world experiments. However, models historically struggled at accurately predicting if materials are stable.

Researchers at DeepMind made a system called GNoME that uses graph neural networks and active learning to push past these limits.

GNoME models materials' crystal structures as graphs and predicts formation energies. It actively generates and filters candidates, evaluating the most promising with simulations. This expands its knowledge and improves predictions over multiple cycles.

The authors introduced new ways to generate derivative structures that respect symmetries, further diversifying discoveries.

The results:

  1. GNoME found 2.2 million new stable materials - equivalent to 800 years of normal discovery.
  2. Of those, 380k were the most stable and candidates for validation.
  3. 736 were validated in external labs. These include a totally new diamond-like optical material and another that may be a superconductor.

Overall this demonstrates how scaling up deep learning can massively speed up materials innovation. As data and models improve together, it'll accelerate solutions to big problems needing new engineered materials.

TLDR: DeepMind made an AI system that uses graph neural networks to discover possible new materials. It found 2.2 million candidates, and over 300k are most stable. Over 700 have already been synthesized.

Full summary available here. Paper is here.

[–] Successful-Western27@alien.top 1 points 11 months ago (1 children)

Those are just **four** links **I myself** have posted and they have ~75 comments on them. I pointed that out in the original comment but you skipped over that.

[–] Successful-Western27@alien.top 1 points 11 months ago

The premise of this post is so weird... we have TONS of technical convos in this sub

 

When generating videos from text prompts, directly mapping language to high-res video tends to produce inconsistent, blurry results. The high dimensionality overwhelms models.

Researchers at Meta took a different approach - first generate a high-quality image from the text, then generate a video conditioned on both image and text.

The image acts like a "starting point" that the model can imagine moving over time based on the text prompt. This stronger conditioning signal produces way better videos.

They built a model called Emu Video using diffusion models. It sets a new SOTA for text-to-video generation:

  • "In human evaluations, our generated videos are strongly preferred in quality compared to all prior work– 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video."
  • "Our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work."

The key was "factorizing" into image and then video generation.

Being able to condition on both text AND a generated image makes the video task much easier. The model just has to imagine how to move the image, instead of hallucinating everything.

They can also animate user-uploaded images by providing the image as conditioning. Again, reported to be way better than previous techniques.

It's cool to see research pushing text-to-video generation forward. Emu Video shows how stronger conditioning through images sets a new quality bar. This is a nice compliment to the Emu Edit model they released as well.

TLDR: By first generating an image conditioned on text, then generating video conditioned on both image and text, you can get better video generation.

Full summary is here. Paper site is here.

 

Researchers at Meta AI announced Emu Edit today. It can edit images precisely based on text instructions. It's a big advance for "instructable" image editing.

Existing systems struggle to interpret instructions correctly - making imprecise edits or changing the wrong parts of images. Emu Edit tackles this through multi-task training.

They trained it on 16 diverse image editing and vision tasks like object removal, style transfer, segmentation etc.

Emu Edit learns unique "task embeddings" to guide it towards suitable edits based on the instruction text. Like a "texture change" vs "object removal".

In evaluations, Emu Edit significantly outperformed prior systems like InstructPix2Pix on following instructions faithfully while preserving unrelated image regions.

With just a few examples, it can adapt to wholly new tasks like image inpainting by updating the task embedding rather than full retraining.

There's still room for improvement on complex instructions. But Emu Edit demonstrates how multi-task training can majorly boost AI editing abilities. It's now much closer to human-level performance on translating natural language to precise visual edits.

TLDR: Emu Edit uses multi-task training on diverse edits/vision tasks and task embeddings to achieve big improvements in instruction-based image editing fidelity.

Full summary is here. Paper here.

Hey this is really cool. Would love to hear if you felt my writeup was good or if there's anything I can improve/change :)

 

Enabling AI to navigate and interact with smartphone UIs is hard, requiring a model that goes beyond mere text processing to handle intricate visual and interactive tasks. A new paper proposes MM-Navigator, an agent based on GPT-4V that can use an iPhone and make purchases on the Amazon app.

The agent can "understand" and interact with smartphone interfaces in a much more human-like manner than previous attempts.

The key innovation lies in GPT-4V's ability to process both text and image inputs. The agent takes a user's text instructions and the current screen image, then outputs a description of the next action, including precise screen locations. The researchers improved interaction accuracy by adding numeric tags to interactive elements on the screen, which GPT-4V references to indicate specific actions.

Testing on iOS and Android datasets showed promising results. GPT-4V's actions were correct 75% of the time for iOS screens, a notable achievement in visual grounding. A standout example was its successful navigation through various apps to purchase a milk frother on Amazon within a set budget.

There are limitations:

  1. False negatives often arose from dataset or annotation issues. These errors often stem from issues with the dataset or the annotation process. In some cases, GPT-4V's predictions are correct but are marked as incorrect due to inaccuracies in Set-of-Mark annotation parsing or because the dataset annotations are imperfect.
  2. True negatives highlighted limitations in the model's zero-shot testing approach. Without examples to guide its understanding of user action patterns, the model tends to prefer clicking over scrolling, leading to decisions that don't align with typical human actions.

If these limitations can be reduced, I could see this being useful for automating QA testing or assisting individuals with disabilities. This research underscores the complexities of developing AI for such sophisticated tasks and emphasizes the importance of accurate data and adaptable testing methods.

TLDR: MM-Navigator is an agent that can navigate a smartphone, combining text and image processing to interact with GUIs. Promising but still has plenty of flaws.

Full summary here. Paper is here.

 

Urban planning is tricky - governments push top-down changes while locals want bottom-up ideas. It's hard to find compromises that make everyone happier.

A new research paper proposes using Multi-Agent Reinforcement Learning (MARL) to vote on land use. Some agents represent officials, others are for residents.

The AI is trained to balance competing interests. It learns to optimize for "consensus rewards" that keep all sides content. The AI acted like an impartial mediator to find win-win solutions.

Testing on a real neighborhood showed the AI model:

  • Created more sustainable land use per city goals
  • Improved the variety of housing/shops to liven up the area
  • Made the end results more fair for lower/middle/upper income folks

There's more details on how the model was evaluated in the paper. There were a number of different metrics used to score the model's results.

I like how they turned urban planning into a spatial graph that the AI can process. This seems like a pretty interesting approach - although there are some limits like relying on a lot of land parcel data that seems hard to find for larger communities.

TLDR: AI helps find compromises in urban planning that balance government and community interests more fairly.

Full summary is here. Paper is here.