LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Xwin-Math: A Series of Powerful SFT Math LLMs and Evaluation Toolkit (alien.top)

submitted 2 years ago by Left_Beat210@alien.top to c/localllama@poweruser.forum

6 comments fedilink hide all child comments

Hi, everyone. Xwin-Math is intended to promote the mathematical reasoning capabilities of LLMs. Now we release the first version, which is a series of Llama 2 SFT models with CoT prompt.

GitHub link: Xwin-LM/Xwin-Math at main · Xwin-LM/Xwin-LM (github.com)

Model link: Xwin-LM (Xwin-LM) (huggingface.co)

Gradio Demo: Gradio (70B model)

Math capability on GSM8K and MATH benchmark

The Xwin-Math-70B-V1.0 model achieves 31.8 pass@1 on MATH benchmark and 87.0 pass@1 on GSM8K benchmark. This performance places it first amongst all open-source CoT models.

The Xwin-Math-7B-V1.0 and Xwin-Math-13B-V1.0 models achieve 66.6 and 76.2 pass@1 on GSM8K benchmark, ranking as top-1 among all LLaMA-2 based 7B and 13B open-source models, respectively.

We also evaluate Xwin-Math on other benchmarks such as SVAMP and MAWPS. Xwin-Math-70B-V1.0 approaches or surpasses the performance of GPT-35-Turbo (8-shot) on most benchmarks.

In addition, it also includes an evaluation toolkit that better converts LaTeX formulas into SymPy objects, enabling more accurate assessment of the mathematical abilities. We found that due to evaluation constraints, the results of GPT-4 were previously underestimated.

More information can be found in our GitHub repo. We SFT on Llama 2 with standard setting, using GPT-4 to augment the training set of MATH and GSM8K to approximately 100K in total. Our paper is still in the progress, so more training details and further results will be updated soon.

Any suggestions or comments greatly welcome! Thanks! =)

you are viewing a single comment's thread
view the rest of the comments

[–] leelweenee@alien.top 1 points 2 years ago (2 children)

It can not solve it by a long shot.

I tried your prompt, and interestingly enough it got the correct answer (16 mins) but the reasoning was very weird, using logs and whatnot

[–] pseudonerv@alien.top 1 points 2 years ago (1 children)

because 80/5=16 and the rest are noise

[–] uti24@alien.top 1 points 2 years ago (1 children)

rest are noise

But why? If you increase health restoring parameter it would matter on result.

Also, why it used logs then, don't seem to be right.

[–] pseudonerv@alien.top 1 points 2 years ago

of course not. I meant the LLM only needs to compute based on that equation alone. The rest, meh, it may hallucinate as it loves.

[–] uti24@alien.top 1 points 2 years ago

Well it gave me very weird results on my updated prompt.

Like it said result is 4 or something also using logs.