vasileer

joined 1 year ago
[–] vasileer@alien.top 1 points 11 months ago

you need to have a knowledge base (with your legal, or financial data) and use semantic search to feed the context for a model fine-tuned to follow instructions and answer from the context,

for small (7B) models there are OpenHermes-2.5-Mistral-7B, Mistral-7B-OpenOrca, dolphin-2.1-mistral-7b,

for bigger models there is Nous-Capybara-34B

[–] vasileer@alien.top 1 points 11 months ago

file size which impacts load time:

with load_in_4bit it will download and parse the big file (which is 4x bigger if it is bfloat16, or 8x bigger if it is float32) and then will quantize on the fly,

with pre-quantized files, it downloads only the quants, so expect a 4x to 8x faster load time for 4bit quants

[–] vasileer@alien.top 1 points 11 months ago

are the finetuned models published somewhere?

[–] vasileer@alien.top 1 points 11 months ago

it also hallucinates people

and Mistral doesn't?

keep in mind that the demo is for 3B model, and the post is about 7B, which I expect to be way better

[–] vasileer@alien.top 1 points 11 months ago (6 children)

I tested the 3B model and it looks good, especially the multilingual part (demo https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2)

[–] vasileer@alien.top 1 points 11 months ago (2 children)

the question was if multiple small models can beat a single big model but also having the speed advantage, and answer is yes, and an example of that is MOE, which is a collection of small models all inside a single big model,

https://huggingface.co/google/switch-c-2048 is a such example

[–] vasileer@alien.top 1 points 11 months ago (4 children)

yes, this is done by Mixture of Experts (MoE)

and we already have this type of examples:

coding - deepseek-coder-7B is better at coding than many 70B models

answering from the context - llama2-7B is better than llama-2-13B at openbookqa test

https://preview.redd.it/1gexvwd83i2c1.png?width=1000&format=png&auto=webp&s=cda1ee16000c2e89410091c172bf4756bc8a427b

etc.

[–] vasileer@alien.top 1 points 11 months ago (3 children)

Multiple passes at lower learning rates isn't supposed to produce different results.

Overfitting is not a technical challenge, its a mathematical property which undeniably exists when ever the training data is smaller than the full problem domain and simultaneously the learning rate (importantly - multiplied by the number of epochs!) would result in a higher specialization ration on the learned to unobserved data than would be expected based on the ration of the learned to unobserved size.

Basically if you learn 1 digit addition but half your training sets involve the left number being 1 and none of your training sets involve your left number being 5 then likely your model will treat 5 and 1 the same (since it's so over trained on examples with 1s)

GPT-4:

The statement contains several inaccuracies:

  1. Multiple passes at lower learning rates: It's not entirely true that multiple passes with lower learning rates will produce identical results. Different learning rates can lead to different convergence properties, and multiple passes with lower learning rates can help in fine-tuning the model and potentially avoid overfitting by making smaller, more precise updates to the weights.
  2. Overfitting as a mathematical property: Overfitting is indeed more of an empirical observation than a strict mathematical property. It is a phenomenon where a model learns the training data too well, including its noise and outliers, which harms its performance on unseen data. It's not strictly due to the size of the training data but rather the model's capacity to learn from it relative to its complexity.
  3. Learning rate multiplied by the number of epochs: The learning rate and the number of epochs are both factors in a model's training process, but their product is not a direct measure of specialization. Instead, it's the learning rate's influence on weight updates over time (across epochs) that can affect specialization. Moreover, a model's capacity and the regularization techniques applied also significantly influence overfitting.
  4. Example of learning 1 digit addition: The example given is somewhat simplistic and does not fully capture the complexities of overfitting. Overfitting would mean the model performs well on the training data (numbers with 1) but poorly on unseen data (numbers with 5). However, the example also suggests a sampling bias in the training data, which is a separate issue from overfitting. Sampling bias can lead to a model that doesn't generalize well because it hasn't been exposed to a representative range of the problem domain.

Overall, while the intention of the statement is to describe overfitting and the effects of learning rates, it conflates different concepts and could benefit from clearer differentiation between them.

[–] vasileer@alien.top 1 points 11 months ago (1 children)

what models did you try?

[–] vasileer@alien.top 1 points 11 months ago (1 children)

since Mistral release there are (almost) no 13B models better than Mistral finetunes, and this can be seen on Open LLM Leaderboard: it is Qwen-14B and second is a Mistral finetune intel/neural-chat, and Orca-13B comes 6th

https://preview.redd.it/ddmvw3un172c1.png?width=1525&format=png&auto=webp&s=d1fb52530c48ed74cfd915b273de7cc3c92e12b2

[–] vasileer@alien.top 1 points 11 months ago (3 children)

Mistral OpenHermes 2.5

[–] vasileer@alien.top 1 points 11 months ago (1 children)

2 ideas

- use deepseek-coder-1.3b-instruct not the base model

- check that you use the correct prompting template for the model

 

I am not able to reproduce a 2x speedup that I read the others achieved with a 70B model and a 1B model draft, using a 7B model and a 1B model draft

model: dolphin-llama2-7b.Q4_K_S.gguf

model draft: tinyllama-1.1b-1t-openorca.Q4_K_S.gguf

here are the results

-------------------------

main -m ../models/tinyllama-1.1b-1t-openorca.Q4_K_S.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

llama_print_timings: load time = 278.28 ms

llama_print_timings: sample time = 110.42 ms / 400 runs ( 0.28 ms per token, 3622.56 tokens per second)

llama_print_timings: prompt eval time = 641.88 ms / 20 tokens ( 32.09 ms per token, 31.16 tokens per second)

llama_print_timings: eval time = 15281.09 ms / 399 runs ( 38.30 ms per token, 26.11 tokens per second)

llama_print_timings: total time = 16221.94 ms

Log end

------------------

main -m ../models/dolphin-llama2-7b.Q4_K_S.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

llama_print_timings: load time = 1429.41 ms

llama_print_timings: sample time = 108.39 ms / 400 runs ( 0.27 ms per token, 3690.24 tokens per second)

llama_print_timings: prompt eval time = 3139.63 ms / 20 tokens ( 156.98 ms per token, 6.37 tokens per second)

llama_print_timings: eval time = 79913.13 ms / 399 runs ( 200.28 ms per token, 4.99 tokens per second)

llama_print_timings: total time = 83348.57 ms

Log end

------------------

speculative -m ../models/dolphin-llama2-7b.Q4_K_S.gguf -md ../models/tinyllama-1.1b-1t-openorca.Q4_K_S.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

encoded 19 tokens in 3.412 seconds, speed: 5.568 t/s

decoded 402 tokens in 115.028 seconds, speed: 3.495 t/s

n_draft = 16

n_predict = 402

n_drafted = 301

n_accept = 198

accept = 65.781%

draft:

llama_print_timings: load time = 213.69 ms

llama_print_timings: sample time = 1597.32 ms / 1 runs ( 1597.32 ms per token, 0.63 tokens per second)

llama_print_timings: prompt eval time = 421.24 ms / 19 tokens ( 22.17 ms per token, 45.11 tokens per second)

llama_print_timings: eval time = 19697.97 ms / 505 runs ( 39.01 ms per token, 25.64 tokens per second)

llama_print_timings: total time = 118450.52 ms

target:

llama_print_timings: load time = 1342.55 ms

llama_print_timings: sample time = 107.07 ms / 402 runs ( 0.27 ms per token, 3754.48 tokens per second)

llama_print_timings: prompt eval time = 78435.09 ms / 431 tokens ( 181.98 ms per token, 5.49 tokens per second)

llama_print_timings: eval time = 17902.46 ms / 92 runs ( 194.59 ms per token, 5.14 tokens per second)

llama_print_timings: total time = 117198.22 ms

view more: next ›