this post was submitted on 12 Nov 2023
1 points (100.0% liked)
Machine Learning
1 readers
1 users here now
Community Rules:
- Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
- Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
- Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
- Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.
founded 11 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
There use of monolingual and multilingual to describe the same dataset is unusual.
I get that they're probably trying to say "monolingual at the document-level", but the back and forth is quite confusing.
E.g.
"We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset
"We use both supervised parallel data with a machine translation objective and the monolingual MADLAD-400 dataset"
"Through MADLAD-400, we introduce a highly multilingual, general web-domain, document-level text dataset"
Unless I am missing something obvious, these are either typos or poor wording decisions.