this post was submitted on 29 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hey folks!

I'm diving into a this project and could really use some insights from this awesome community. I’m aiming to build a chatbot based on top-performing large language model that can chat in Turkish as smoothly as ChatGPT does in English.

Here's the deal: I'm super curious about adding Turkish to an existing open-source model. But, let’s be real, adapting a language with its own quirks, like Turkish, is not a walk in the park.

I'm also thinking about this interesting approach: adding an extra layer to the model that would first translate Turkish to English, process the data, and then translate it back to Turkish. What do you think about this?

So, I’m reaching out to see if anyone's tread this path before:

  1. If you've tried adding/fine-tuning a new language to a model, I’d love to hear about your adventure. What were the big challenges, or any "aha!" moments?
  2. Tech Tips: Any tech advice, tools, or resources you know by would be awesome. Especially if it’s about datasets or methods for this kind of linguistic gymnastics.
  3. Join Forces? If you’re working on something similar or know the ropes and are up for collaborating, let’s chat!

Based upon my observations on orca 2, mistral or llama 2, they do have some kind of Turkish data in them. But it's Turkish level is not comparable to the English ofc. Sometimes the outputs doesn't even make sense :(

Can’t wait to hear your thoughts or any advice you've got!

you are viewing a single comment's thread
view the rest of the comments
[–] AutomataManifold@alien.top 1 points 11 months ago (1 children)

I know there's several projects for finetuning llama for Chinese. I haven’t worked on them but it might be worth looking in to what they did.

[–] nefarkederki@alien.top 1 points 11 months ago

Hey there! Thanks for the tip. I have made some research and found out this : https://github.com/ymcui/Chinese-LLaMA-Alpaca

So for those who are interested here is what I understand from what needs to be done :

  1. If Llama2 tokenizer does not support your language, you need to expand that vocabulary first
  2. You'll need data for further fine tuning, and also for instruction tuning.
  3. And you will need money :D For training