Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 1 year ago

MODERATORS

communick@academy.garden

[R] Xwin-Math: A Series of Powerful SFT Math LLMs and Evaluation Toolkit (alien.top)

submitted 11 months ago by Left_Beat210@alien.top to c/machinelearning@academy.garden

0 comments fedilink hide all child comments

Hi, Xwin-Math is intended to promote the mathematical reasoning capabilities of LLMs. Now we release the first version, which is a series of Llama 2 SFT models with CoT prompt.

GitHub link: Xwin-LM/Xwin-Math at main · Xwin-LM/Xwin-LM (github.com)

Model link: Xwin-LM (Xwin-LM) (huggingface.co)

Gradio Demo: Gradio

Math capability on GSM8K and MATH benchmark

The Xwin-Math-70B-V1.0 model achieves 31.8 pass@1 on MATH benchmark and 87.0 pass@1 on GSM8K benchmark. This performance places it first amongst all open-source CoT models.

The Xwin-Math-7B-V1.0 and Xwin-Math-13B-V1.0 models achieve 66.6 and 76.2 pass@1 on GSM8K benchmark, ranking as top-1 among all LLaMA-2 based 7B and 13B open-source models, respectively.

We also evaluate Xwin-Math on other benchmarks such as SVAMP and MAWPS. Xwin-Math-70B-V1.0 approaches or surpasses the performance of GPT-35-Turbo (8-shot) on most benchmarks.

In addition, it also includes an evaluation toolkit that better converts LaTeX formulas into SymPy objects, enabling more accurate assessment of the mathematical abilities. We found that due to evaluation constraints, the results of GPT-4 were previously underestimated.

More information can be found in our GitHub repo. Training details and further progress will also be continuously updated.

Any suggestions or comments greatly welcome! Thanks!

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here