overview for jbochi

[R] MADLAD-400 - 4.6 / 2.6 trillion token dataset covering 419 languages + translation models up to 10.7B parameters in c/machinelearning@academy.garden

[–] jbochi@alien.top 1 points 2 years ago

thanks

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago

Good question. ALMA compares itself against NLLB and GPT3.5, and the 13B barely surpasses GPT3.5. MADLAD-400 probably beats GPT3.5 on lower resource languages only.

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago

Got it. Can you please share the full prompt?

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago

Thanks!

- I'm not familiar with ALMA, but it seems to be similar to MADLAD-400. Both are smaller than NLLB-54B, but competitive with it. Because ALMA is a LLM and not a seq2seq model with cross-encoding, I'd guess it's faster.
- You can translate up to 128 tokens at the time.
- You can only specify the target language, not the source language.

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago (2 children)

How are you running it? Did you prepended a "<2xx>" token for the target language? For example, "<2fr> hello" will translate "hello" to French. If you are using this space, you can select the target language in the dropdown.

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago

One approach is to install rust, candle, and then run one of the cargo commands from here.

You can also try oobabooga, which has a one click installer, and should support this model, but I haven't tested it.

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago (4 children)

Sorry, but what is not working?

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago (3 children)

It should. Support for T5 based models was added in https://github.com/oobabooga/text-generation-webui/pull/1535

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago

es, such as en cn jp. If there are multiple combination versions, I will use it to develop my own translation applic

Check the OPUS models by Helsinki-NLP: https://huggingface.co/Helsinki-NLP?sort_models=downloads#models

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago

Sorry to be pedantic, but the translation models they released are not LLMs. They are T5 seq2seq models with cross-encoding, as in the original Transformer paper. They did also release a LM that's a Decoder-Only T5. They tried few-shot learning with it, but it performs much worse than the MT models.

I think that the first multilingual Neural Machine Translation model is from 2016: https://arxiv.org/abs/1611.04558. However, specialized models for pairs of languages are still popular. For example: https://huggingface.co/Helsinki-NLP/opus-mt-de-en

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] jbochi@alien.top 1 points 2 years ago

The MADLAD-400 paper has a bunch of comparisons with NLLB. MADLAD beats NLLB in some benchmarks, it's quite close in others, and it loses some. But the largest MADLAD is 5x smaller than the original NLLB. It also supports more 2x more languages.

1

Translate to and from 400+ languages locally with MADLAD-400 (alien.top)

submitted 2 years ago by jbochi@alien.top to c/localllama@poweruser.forum

41 comments fedilink

Google released T5X checkpoints for MADLAD-400 a couple of months ago, but nobody could figure out how to run them. Turns out the vocabulary was wrong, but they uploaded the correct one last week.

I've converted the models to the safetensors format, and I created this space if you want to try the smaller model.

I also published quantized GGUF weights you can use with candle. It decodes at ~15tokens/s on a M2 Mac.

It seems that NLLB is the most popular machine translation model right now, but the license only allows non commercial usage. MADLAD-400 is CC BY 4.0.