overview for kpodkanowicz

NeuralHermes-2.5: Boosting SFT models' performance with DPO in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago (2 children)

really cool! what do you think about using gpt3.5 as the worst output in the hopes to resurface some extra edge?

What prompts/questions do you use to test a model’s capabilities? Ideally ones that aren’t included in their training data. in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

started asking it as well - seems to be very hard for 34b models to get it fully right @1

Extra proof (IMO) that HumanEval is leaked in base models? in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

HumanEval is 164 function declarations and corresponding docstrings, and evaluation happens by set of unit tests while code is running in docker. Extra is coming from HumanEvalPlus that added several unit tests per each on the top.

Merging models might improve its capabilities, but this one was not able to find out of bounds of wrongly declared vector - there is no chance it magically is able to complete complex python code on the level that is basically on GPT4 level

1

Extra proof (IMO) that HumanEval is leaked in base models? (alien.top)

submitted 2 years ago by kpodkanowicz@alien.top to c/localllama@poweruser.forum

2 comments fedilink

I noticed I never posted this before - during experimenting with various merges after merging Phind v2, Speechless finetune and WizardCoder-Python34B each with 33% / averaged then adding Airoboros PEFT on the top I consistently have:
{'pass@1': 0.7926829268292683}
Base + Extra
{'pass@1': 0.7073170731707317}

Instruct prompt, greedy decoding, seed=1, 8bit.
Phind and Wizard has around 72%, Speech 75%, Airo around 60%

(That would have been SOTA back then, this is also a current score of Deepseek-33B)

The model is rather broken - it has not passed any of my regular questions. That would mean in my opinion, that by a lucky stroke, I broke the model in a way that some of the former data has resurfaced. Let me know what you think,

If someone is very interested I can push it to HF, but its waste of storage

How much does Quantization actually impact models? - KL Divergence Tests in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

you are on fire. This is your yet another great post - btw. i changed perplexity scripts to only measure responses after the instruction and using for example, the evol dataset. The preset is configured accordingly to the model - i got completely different results than normal perplexity - interestingly, when running code isntructions on normal model and for instance roleplay instructions on coding model not just perpelxity is around 1 vs. 3 but also degradate differently

ExLlamaV2: The Fastest Library to Run LLMs in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

It's not just great. It's a piece of art.

Finetuning Mistral, Llama2 & others with Lora: Proper Code Setup in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

I have never gotten Flash Attention to work despite testing both paddings, but I'm due to do a clean installation sometime next month. Currently, I use padding Right without FA.

afaik you need to run model = get_peft_model, as you need to pass peft model as argument to sfttrainer

Run an openAI powered startup. What’s the best alternative to got 3.5 with function calling that I can run in the cloud? in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

Guding output was already mentioned but maybe I will mention how this can be done even with very weak model.

You use text complete end point where you will be constructing your prompts.
You specify context and make it stand out as a separate block
Then in a prompt you ask to fill a specific detail (just one to the JSON)
In the completeion part (i.e. after assistant) you already pre-write out put in JSON format with first value,
You stop streaming after " sign
change the prompt to ask for the next value, add it as next atribute to the JSON you are generating and again start generation and stop with "

Very, very fast -you barely generate any tokens mostly eval prompts.

Test manually once you you have good result ask GPT4 to write you a python wrapper to do it.

I need people to test my experiment - Dynamic Temperature in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

This is really interesting work!!! I'm doing research on Contrastive Decoding and have pretty good results so far, moreover reading this post I realized it might fix my issues with picking the right alpha.

I have a suggestion to make to OP and people reading this post - could we start collecting "goto" questions that this community uses for testing? IT will be easier to automate and then publish all outputs at once and let people rank whether they like the output or not.

This way it will be much easier for small teams and individuals to conduct meaningful progress

ORCA 2 Released open source! in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

wow, and also they use chatml format... I know some top ppl here started to use it, but i wonder if they know something the rest of us doesnt :D

Having a hard time setting deepseek coder instruct to work in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago (1 children)

lol, I will stop wasting my time now - I spent roughly 3 hours today trying to get it to work :D Mostly around GGUF

Is it possible to fine tune a 33B model with 48GB vRAM? in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago (1 children)

i have some issues with flash attention and with 48gb i can go up to 512 rank with batch size 1 and max len 768. My last run was 1024 max len, batch 2, gradient 32, rank 128 and gives pretty nice results

TabbyAPI released! A pure LLM API for exllama v2. in c/localllama@poweruser.forum

[–] kpodkanowicz@alien.top 1 points 2 years ago

I think there is room for everyone - Text Gen is a piece of art - it's the only thing in the whole space that always works and is reliable. However, if im building an agent and getting a docker build, I can not afford to change text gen etc.