kpodkanowicz

joined 10 months ago
[–] kpodkanowicz@alien.top 1 points 9 months ago (2 children)

really cool! what do you think about using gpt3.5 as the worst output in the hopes to resurface some extra edge?

[–] kpodkanowicz@alien.top 1 points 10 months ago

started asking it as well - seems to be very hard for 34b models to get it fully right @1

[–] kpodkanowicz@alien.top 1 points 10 months ago

HumanEval is 164 function declarations and corresponding docstrings, and evaluation happens by set of unit tests while code is running in docker. Extra is coming from HumanEvalPlus that added several unit tests per each on the top.

Merging models might improve its capabilities, but this one was not able to find out of bounds of wrongly declared vector - there is no chance it magically is able to complete complex python code on the level that is basically on GPT4 level

 

I noticed I never posted this before - during experimenting with various merges after merging Phind v2, Speechless finetune and WizardCoder-Python34B each with 33% / averaged then adding Airoboros PEFT on the top I consistently have:
{'pass@1': 0.7926829268292683}
Base + Extra
{'pass@1': 0.7073170731707317}

Instruct prompt, greedy decoding, seed=1, 8bit.
Phind and Wizard has around 72%, Speech 75%, Airo around 60%

(That would have been SOTA back then, this is also a current score of Deepseek-33B)

The model is rather broken - it has not passed any of my regular questions. That would mean in my opinion, that by a lucky stroke, I broke the model in a way that some of the former data has resurfaced. Let me know what you think,

If someone is very interested I can push it to HF, but its waste of storage

[–] kpodkanowicz@alien.top 1 points 10 months ago

you are on fire. This is your yet another great post - btw. i changed perplexity scripts to only measure responses after the instruction and using for example, the evol dataset. The preset is configured accordingly to the model - i got completely different results than normal perplexity - interestingly, when running code isntructions on normal model and for instance roleplay instructions on coding model not just perpelxity is around 1 vs. 3 but also degradate differently

[–] kpodkanowicz@alien.top 1 points 10 months ago

It's not just great. It's a piece of art.

[–] kpodkanowicz@alien.top 1 points 10 months ago

I have never gotten Flash Attention to work despite testing both paddings, but I'm due to do a clean installation sometime next month. Currently, I use padding Right without FA.

afaik you need to run model = get_peft_model, as you need to pass peft model as argument to sfttrainer

[–] kpodkanowicz@alien.top 1 points 10 months ago

Guding output was already mentioned but maybe I will mention how this can be done even with very weak model.

You use text complete end point where you will be constructing your prompts.
You specify context and make it stand out as a separate block
Then in a prompt you ask to fill a specific detail (just one to the JSON)
In the completeion part (i.e. after assistant) you already pre-write out put in JSON format with first value,
You stop streaming after " sign
change the prompt to ask for the next value, add it as next atribute to the JSON you are generating and again start generation and stop with "

Very, very fast -you barely generate any tokens mostly eval prompts.

Test manually once you you have good result ask GPT4 to write you a python wrapper to do it.

[–] kpodkanowicz@alien.top 1 points 10 months ago

This is really interesting work!!! I'm doing research on Contrastive Decoding and have pretty good results so far, moreover reading this post I realized it might fix my issues with picking the right alpha.

I have a suggestion to make to OP and people reading this post - could we start collecting "goto" questions that this community uses for testing? IT will be easier to automate and then publish all outputs at once and let people rank whether they like the output or not.

This way it will be much easier for small teams and individuals to conduct meaningful progress

[–] kpodkanowicz@alien.top 1 points 10 months ago

wow, and also they use chatml format... I know some top ppl here started to use it, but i wonder if they know something the rest of us doesnt :D

[–] kpodkanowicz@alien.top 1 points 10 months ago (1 children)

lol, I will stop wasting my time now - I spent roughly 3 hours today trying to get it to work :D Mostly around GGUF

[–] kpodkanowicz@alien.top 1 points 10 months ago (1 children)

i have some issues with flash attention and with 48gb i can go up to 512 rank with batch size 1 and max len 768. My last run was 1024 max len, batch 2, gradient 32, rank 128 and gives pretty nice results

[–] kpodkanowicz@alien.top 1 points 10 months ago

I think there is room for everyone - Text Gen is a piece of art - it's the only thing in the whole space that always works and is reliable. However, if im building an agent and getting a docker build, I can not afford to change text gen etc.

view more: next ›