Found 2 relevant code implementations for "MADLAD-400: A Multilingual And Document-Level Large Audited Dataset".
If you have code to share with the community, please add it here ๐๐
--
To opt out from receiving code links, DM me.
Found 2 relevant code implementations for "MADLAD-400: A Multilingual And Document-Level Large Audited Dataset".
If you have code to share with the community, please add it here ๐๐
--
To opt out from receiving code links, DM me.
Credit to u/jbochi for getting the models to run + telling Google to fix their model checkpoints.
thanks
There use of monolingual and multilingual to describe the same dataset is unusual.
I get that they're probably trying to say "monolingual at the document-level", but the back and forth is quite confusing.
E.g.
"We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset
"We use both supervised parallel data with a machine translation objective and the monolingual MADLAD-400 dataset"
"Through MADLAD-400, we introduce a highly multilingual, general web-domain, document-level text dataset"
Unless I am missing something obvious, these are either typos or poor wording decisions.