overview for andrewlapp

Using Mistral Openorca to create a knowledge graph from a text document in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago (1 children)

Great work, this is impressive, especially for a 13B model!

Using Mistral Openorca to create a knowledge graph from a text document in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago (4 children)

Great work! Would you mind sharing the datasets you used and/or how you augmented the data for training?

How to train/finetune with long examples on dataset? in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago

VRAM scales quadratically as sequence length increases. I'm not aware of any solutions. Even efficient implementations of long context fine tuning such as LongLoRA only improve speed and quality, but leave memory usage the same as LoRA.

I recommend ensuring you're reducing memory in other ways:

Ensure you're using 4-bit QLoRA
Ensure batch size is 1
Ensure you're using FlashAttention-2
Ensure your optimizer state in in CPU memory by utilizing a paged optimizer.
Use gradient checkpointing.

You also could do something more experimental like employ Mistral with a sliding window of 1024 tokens to capture 2048 tokens of context while only using the memory of 1024 tokens.

Or you could just summarize or prune your long examples.

Are there any data cleaning focused LLMs? [also, rant] in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago (1 children)

Say I've got a paragraph about something and the text block contains some other unrelated comment

Have you considered creating text embeddings, calculating their distance matrix, and applying pagerank?

Proposed Alternative to Repetition Penalty - Noisy Sampling in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago

Here are some factors that may help induce repetition:

1. Llama 2 7B, Mistral 7B, or Yi 6B variant
1. Use a lossy quantization such as Q2_K (2 bit), Q4_0 (4 bit), or GPTQ (4 bit)
1. Use a sequence length of at least 1024 tokens, if not 2048
1. Use a text corpus with a lot of repetition, e.g. https://github.com/Lyrics/lyrics

Additionally, you should use lm-evaluation-harness to test for any degradation in performance in common benchmarks.

Proposed Alternative to Repetition Penalty - Noisy Sampling in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago (3 children)

Very interesting idea. If you can create a simple benchmark (really just any prompt applied to a noisy 7B model) and demonstrate a reduction in repetition compared to baseline, this method will proliferate across the open LLM development ecosystem.

Looking forward to seeing your implementation!

Is Open LLM Leaderboard reliable source ? yi:34B is at the top but I get better results with neural-chat:7B model in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago

I agree, and The Leaderboard's newly added DROP metric is a step in the right direction.

Maybe anecdotal but I have very high hopes for Yi 34b finetunes. in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago

34B Model Memory Requirements (infer)

Sequence Length vs Bit Precision
SL / BP |     4      |     6      |     8      |     16    
-----------------------------------------------------------
    512 |     15.9GB |     23.8GB |     31.8GB |     63.6GB
   1024 |     16.0GB |     23.9GB |     31.9GB |     63.8GB
   2048 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
   4096 |     16.3GB |     24.5GB |     32.7GB |     65.3GB
   8192 |     16.8GB |     25.2GB |     33.7GB |     67.3GB
  16384 |     17.8GB |     26.7GB |     35.7GB |     71.3GB
  32768 |     19.8GB |     29.7GB |     39.7GB |     79.3GB
  65536 |     23.8GB |     35.7GB |     47.7GB |     95.3GB
  131072 |     31.8GB |     47.7GB |     63.7GB |    127.3GB
  262144 |     47.8GB |     71.7GB |     95.7GB |    191.3GB

Tokens per Second in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago

It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you're presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.
No, it's mostly bound by memory bandwidth.
Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.

Cheapest site for hosting custom LLM models? in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago

You might rent a GPU from runpod or another cloud provider.

Memory requirements:

34B Model Memory Requirements (infer)

Seq Len vs Bit Precision
SL / BP |     4      |     6      |     8      |     16    
-----------------------------------------------------------
    512 |     15.9GB |     23.8GB |     31.8GB |     63.6GB
   1024 |     16.0GB |     23.9GB |     31.9GB |     63.8GB
   2048 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
   4096 |     16.3GB |     24.5GB |     32.7GB |     65.3GB
   8192 |     16.8GB |     25.2GB |     33.7GB |     67.3GB
  16384 |     17.8GB |     26.7GB |     35.7GB |     71.3GB

[D] Exclusive: Sam Altman's ouster at OpenAI was precipitated by letter to board about AI breakthrough in c/machinelearning@academy.garden

[–] andrewlapp@alien.top 1 points 11 months ago

You might rent a GPU from runpod or another cloud provider.

Memory requirements:

34B Model Memory Requirements (infer)

Seq Len vs Bit Precision
SL / BP |     4      |     6      |     8      |     16    
-----------------------------------------------------------
    512 |     15.9GB |     23.8GB |     31.8GB |     63.6GB
   1024 |     16.0GB |     23.9GB |     31.9GB |     63.8GB
   2048 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
   4096 |     16.3GB |     24.5GB |     32.7GB |     65.3GB
   8192 |     16.8GB |     25.2GB |     33.7GB |     67.3GB
  16384 |     17.8GB |     26.7GB |     35.7GB |     71.3GB

40x or more speedup by selecting important neurons in c/localllama@poweruser.forum

[–] andrewlapp@alien.top 1 points 11 months ago (1 children)

Here are my notes:

Overview:

As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide highlevel CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference.

Benchmarks Averages

base implementation with no neurons ignored: 79.9
~60% of neurons ignored: 79.2
~95% of neurons ignored: 78.1
~99.7% of neurons ignored: 77.3

Benchmarks that don't degrade at all as more neurons are ignored:

RTE ("Recognizing Textual Entailment", determining whether a statement can be inferred from a given text)
MRPC (ability to measure semantic similarity)
STSB (ability to measure semantic similarity)

Benchmarks that degrade:

SST-2: (sentiment analysis)
MNLI (determining whether a given statement is true, false, or unknown provided context)
QNLI (determine whether a sentence has an answer to a question)
QQP (determine whether one question is a paraphrase of another)

Benchmarks that degrade substantially:

CoLA, which is addressed in the paper:

Note, however, that the majority of the performance decrease due to the increasing depth is caused by only a single task – CoLA. This behaviour has previously been observed in the literature and is in line with other work trying to compress BERT behaviour into smaller models ... If we disregard CoLA, at least 98.6% of the predictive performance is preserved by all UltraFastBERT model.

Corpus of Linguistic Acceptability (CoLA): sentences annotated as grammatically acceptable or not by experts.

Applicability to CausalLM such as Llama 2

We also observe that the performance decreases with the increasing depth of the FFFs.

With substantially more FF layers in Llama 2, this is concerning. Additionally, it's not obvious to me that this works with a 7B to 70B parameter causal language model just because it it works with a ~100M parameter bidirectional encoder. Would be great to see it tested however!

Other

Only works on CPU due to GPUs not supporting "conditional matrix multiplication"