FullOf_Bad_Ideas

joined 1 year ago
[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (1 children)
  1. Grant of Copyright License. Subject to the terms and conditions of this License, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the Complementary Material, the Model, and Derivatives of the Model.

I really really enjoy seeing perpetual irrevocable licenses.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago

Remember to try 8-bit cache If you haven't yet, it should get you to 5.5k tokens context length.

You can get around 10-20k context length with 4bpw yi-34b 200k quants on single 24GB card.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (3 children)

I've only seen merging of same-upstream-pretrained-model-at-same-size.

Not anymore.

Here's a merge of llama 2 13B and llama 1 33B https://huggingface.co/chargoddard/llama2-22b

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago

Have you checked deepseek Coder instruct 33b already? I don't know about it's knowledge of pytorch but it's pretty much the best local coding model you can run, so it's your best shot.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (1 children)

Instead, it uses an Amazon platform known as Bedrock, which connects several A.I. systems together, including Amazon’s own Titan as well as ones developed by Anthropic and META.

It's a llama! :D i wonder how they can comply with llama license, I think they have more than 700M customer.

Good to see more competitors at least, enterprise office people are totally In MS hands so that's not an area where open source end-to-end solutions have too much chance of competing, the only way to get them there is if big corp like Amazon adopts them in their infrastructure for a product like this.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago

Are you talking about base yi-34B or a fine-tuned one? Base model will be hard to use but will score pretty high. Benchmarks are generally written with completion in mind, so they work really well on base models and instruct tuning may make it much easier to work with but not necessarily score higher on benchmarks.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (1 children)

I can't corraborate results for Pascal cards. They had very limited FP16 performance, usually 1:64 of FP32 performance. Switching over to rtx 3090 ti from gtx 1080 got me around 10-20x gains in qlora training, assuming keeping the exact same batch size and ctx length, changing only calculations from fp16 to bf16.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago

yeah, it will be a sum of tokens that the next token is generated on. I don't know how often KV cache is updated.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago

Yeah if that's the case, I can see gpt-4 requiring about 220-250B of loaded parameters to do token decoding

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago

Here's the formula

batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (2 children)

Formula to calculate kv cache, as in space used by context

batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

This blog post is really good, I recommend you to read it.

Usually bigger models have more layers, heads and dimensions, but I am not sure whether heads or dimensions grow faster. It's something you can look up though.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (3 children)

Jondurbin made something like this with qlora.

The explanation that gpt-4 is MoE model doesn't make sense to me. Gpt4 api is 30x more expensive than gpt-3-5-turbo. Gpt-3-5 turbo is 175B parameters, right? So, if they had 8 220B experts, it wouldn't need to cost 30x more, it would be 20-50% more for API use. There was also some speculation that 3.5 turbo is 22B. In that case it also doesn't make sense to me that it would be 30x as expensive.

view more: next ›