this post was submitted on 12 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 1 year ago
MODERATORS
 

Dataset: https://huggingface.co/datasets/allenai/MADLAD-400

"Note that the english subset in this version is missing 18% of documents that were included in the published analysis of the dataset. These documents will be incoporated in an update coming soon."

arXiv paper: https://arxiv.org/abs/2309.04662

Models: https://github.com/google-research/google-research/tree/master/madlad_400

u/jbochi's work on getting the models to run: https://www.reddit.com/r/LocalLLaMA/comments/17qt6m4/translate_to_and_from_400_languages_locally_with/

top 4 comments
sorted by: hot top controversial new old

Found 2 relevant code implementations for "MADLAD-400: A Multilingual And Document-Level Large Audited Dataset".

If you have code to share with the community, please add it here ๐Ÿ˜Š๐Ÿ™

--

To opt out from receiving code links, DM me.

[โ€“] APaperADay@alien.top 1 points 1 year ago (1 children)

Credit to u/jbochi for getting the models to run + telling Google to fix their model checkpoints.

[โ€“] jbochi@alien.top 1 points 1 year ago
[โ€“] maizeq@alien.top 1 points 1 year ago

There use of monolingual and multilingual to describe the same dataset is unusual.

I get that they're probably trying to say "monolingual at the document-level", but the back and forth is quite confusing.

E.g.

"We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset

"We use both supervised parallel data with a machine translation objective and the monolingual MADLAD-400 dataset"

"Through MADLAD-400, we introduce a highly multilingual, general web-domain, document-level text dataset"

Unless I am missing something obvious, these are either typos or poor wording decisions.