Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[R] MADLAD-400 - 4.6 / 2.6 trillion token dataset covering 419 languages + translation models up to 10.8B parameters (alien.top)

submitted 2 years ago by APaperADay@alien.top to c/machinelearning@academy.garden

2 comments fedilink hide all child comments

Dataset: https://huggingface.co/datasets/allenai/MADLAD-400

"Note that the english subset in this version is missing 18% of documents that were included in the published analysis of the dataset. These documents will be incoporated in an update coming soon."

arXiv paper: https://arxiv.org/abs/2309.04662

Models: https://github.com/google-research/google-research/tree/master/madlad_400

u/jbochi's work on getting the models to run: https://www.reddit.com/r/LocalLLaMA/comments/17qt6m4/translate_to_and_from_400_languages_locally_with/

you are viewing a single comment's thread
view the rest of the comments

[–] CatalyzeX_code_bot@alien.top 1 points 2 years ago

Found 2 relevant code implementations for "MADLAD-400: A Multilingual And Document-Level Large Audited Dataset".

If you have code to share with the community, please add it here 😊🙏

To opt out from receiving code links, DM me.