this post was submitted on 23 Nov 2023
1 points (100.0% liked)

LocalLLaMA

4 readers
4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago
MODERATORS
 

I'm planning to fine-tune a mistral model with my own dataset. (full fine-tune, not LORAs)

The dataset is not that large, around 120 mb in jsonl format.

My questions are:

  1. Will I be able to fine-tune the model with 4 cards of 40G A100?
  2. If not, is using runpod the easiest approach?
  3. I'm trying to instill knowledge in a certain language, for a field it does not have sufficient knowledge in said language. Is fine-tuning my only option? RAG is not viable in my case.

Thanks in advance!

you are viewing a single comment's thread
view the rest of the comments
[–] thereisonlythedance@alien.top 1 points 2 years ago (3 children)
  1. I think you should be okay. I've been doing full fine-tunes on Mistral using either two A100s (80GB VRAM) with a total batch size of 8, or three A100s (80GB again) with a batch size of 12. This is using the Adam 8-bit optimizer and training with a max sequence length of 4096, lots of long samples. I think it should be possible to do a full fine tune on a single 80GB A100, but I haven't tried. This is without Deep Speed. I've done a few Deep Speed runs and that significantly lowers VRAM usage.
  2. RunPod is what I've been using and it's straight forward.
  3. Instilling new knowledge is possible with a fine tune, much less possible with a LoRA. Can't comment on languages. People seem to have had mixed success creating multi-lingual models.
[–] herozorro@alien.top 1 points 2 years ago (2 children)

how much does it cost to do these fine tunes on RunPod? How much compute time is used

Lik $1000+?

[–] thereisonlythedance@alien.top 1 points 2 years ago

An A100 ((80GB) costs between $1.70=$1.99 per hour on RunPod. How long you need depends on dataset size, sequence length, the optimizer you choose and how many epochs you train for. I can get a full finetune of Mistral (5 epochs) with an Adam 8-bit optimizer done on my small (1300 samples) but long sequence length (most samples are 4096 tokens) dataset in around an hour with 3x A100s.

load more comments (1 replies)
load more comments (1 replies)