llamaShill

joined 1 year ago
[–] llamaShill@alien.top 1 points 1 year ago

That's the prevailing idea based on all the info we have so far:

  • Llama 1 training was from around July 2022 to January 2023, Llama 2 from January 2023 to July 2023, so Llama 3 could plausibly be from July 2023 to January 2024.
  • In August, a credible rumor from an OpenAI researcher claimed that Meta talked about having the compute to train Llama 3 and Llama 4, with Llama 3 being as good as GPT-4.
  • In an interview with Lex Fridman published Sept. 28, Mark Zuckerberg has said they're always training another model and already working on the next generation when talking about Llama.
  • At Meta Connect on Sept. 27 - 28, they said more news about Llama will be put out next year.

WSJ published an exclusive on Sept. 10 that said Meta's next LLM won't start training until early 2024, meaning a release wouldn't happen until much later, but they may have been mistaken since this seems to contradict Mark's recent words. Meta could have also accelerated their plans to stay relevant in the LLM race, especially since leaks about their LLM development have shown they've put more emphasis on productizing Llama and incorporating it within their apps.

 

There's two noteworthy things covered here:

  1. Skywork-13B, a new bilingual foundation model for English and Chinese. They also announce Skywork-13B-Chat enhanced specially for creative writing, Skywork-13B-Math for math, Skywork-13B-MM for multimodal capability, and a segment of their SkyPile Corpus comprising 150 billion tokens of Chinese web text.
  2. Research into pretraining on in-domain data. Specifically, they show that some recent foundation models may be excessively overfitted and have had test data leakage during training.

GitHub and models: https://github.com/SkyworkAI/Skywork/blob/main/README_EN.md

Tech report: https://arxiv.org/abs/2310.19341

Abstract

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

Training loss and validation loss:

Trajectory of important monitoring metrics during Stage-1 pre-training. Stage-1 pre-training consists of two sequential training sessions, represented by different colors in the loss curves (red for session 0 ∼ 2T and blue for session 2 ∼ 3T).

Benchmark evaluation:

https://preview.redd.it/tqvuls0cmixb1.png?width=786&format=png&auto=webp&s=2c339537baaecc8cc8fa3fdd71f44df732cd8674

Pre-training on in-domain data: a common practice?

Important points at a glance:

We evaluate an LLM’s language modeling loss on three datasets drawn from the same distribution: 1) The official GSM8K training set, 2) The official GSM8K test set, 3) A set composed of GSM8K-like samples generated by GPT-4. The corresponding losses are denoted as Ltrain, Ltest, and Lref , respectively. Theoretically, if a language model has not been exposed to any of the three datasets during pre-training, the three losses Ltrain, Ltest, and Lref should be approximately equivalent. However, if the model has been pre-trained on the training set or if the test data has been inadvertently exposed during the pre-training process, we would anticipate a notable discrepancy between Ltrain, Ltest, and Lref .

Models such as ChatGLM3-6B, Baichuan2-13B, Qwen-7B/14B, and Aquila2-34B display markedly lower loss on the training split than on the test split. Consequently, we postulate that these models may have been considerably pre-trained on GSM8K training split or similar data.

We believe that there is valid risk on the practice of targeted pre-training, in that it compromise fairness in benchmarking. While through pre-training on in-domain data a model may excel at specific tasks, it remains uncertain how well it would perform on unseen tasks. Its capabilities may be overestimated based on the benchmark alone, which can lead to unfair comparisons between models and mislead users or stakeholders about the true capabilities of the model.

Regular vs irregular results:

https://preview.redd.it/dll4shngmixb1.png?width=775&format=png&auto=webp&s=0438bab27bf25edcacdbb879279e0959c04b277c

To put this into perspective, QwenLM reports GSM8K 8-shot scores of 16.7 for Llama 2 7B, 29.6 for Llama 2 13B, and 42.2 for Code Llama 34B. From their same chart, Qwen-7B has a score of 51.7, Baichuan-13B comes in at 52.7, and Qwen-14B tops it off with a whopping 61.3.

It reminds me of the paper that came out last week from researchers at Google DeepMind and Princeton. They assessed models using a new evaluation and discerned a wide discrepancy:

A variant of the contamination issue is “cramming for the leaderboard.” It is possible to deliberately train a model on data similar to those used in the leaderboard evaluations. Such datasets are easy to generate from a small number of examples using existing strong models. If “cramming” happens during pre-training, it becomes hard to detect.

Several open models show signs of being over-trained for leaderboards at the expense of general-purpose language capabilities (“cramming”).

As the saying goes, pretraining on the test set is all you need.

 

There's two interesting things covered in this paper:

  1. Skywork-13B, a new foundation model for English and Chinese. They also announce Skywork-13B-Chat enhanced specially for creative writing, Skywork-13B-Math specialized for math, Skywork-13B-MM for multimodal capability, and a segment of their SkyPile Corpus comprising 150 billion tokens of Chinese web text.
  2. Research into pretraining on in-domain data. Specifically, they show that some recent foundation models may be excessively overfitted and have had test data leakage during training. I'll cover this second.

First things first, the models and the technical report.

GitHub and models: https://github.com/SkyworkAI/Skywork/blob/main/README_EN.md

Tech report: https://arxiv.org/abs/2310.19341

Abstract

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

Training loss and validation loss

Trajectory of important monitoring metrics during Stage-1 pre-training. Stage-1 pre-training consists of two sequential training sessions, represented by different colors in the loss curves (red for session 0 ∼ 2T and blue for session 2 ∼ 3T).

Benchmark evaluation

https://preview.redd.it/38dzg72pihxb1.png?width=786&format=png&auto=webp&s=72c23176d1731f94427e0b6adb785fbc3f3e1e6d

Pre-training on in-domain data: a common practice?

Important points at a glance from the report:

We evaluate an LLM’s language modeling loss on three datasets drawn from the same distribution: 1) The official GSM8K training set, 2) The official GSM8K test set, 3) A set composed of GSM8K-like samples generated by GPT-4. The corresponding losses are denoted as Ltrain, Ltest, and Lref , respectively. Theoretically, if a language model has not been exposed to any of the three datasets during pre-training, the three losses Ltrain, Ltest, and Lref should be approximately equivalent. However, if the model has been pre-trained on the training set or if the test data has been inadvertently exposed during the pre-training process, we would anticipate a notable discrepancy between Ltrain, Ltest, and Lref .

Models such as ChatGLM3-6B, Baichuan2-13B, Qwen-7B/14B, and Aquila2-34B display markedly lower loss on the training split than on the test split. Consequently, we postulate that these models may have been considerably pre-trained on GSM8K training split or similar data.

We believe that there is valid risk on the practice of targeted pre-training, in that it compromise fairness in benchmarking. While through pre-training on in-domain data a model may excel at specific tasks, it remains uncertain how well it would perform on unseen tasks. Its capabilities may be overestimated based on the benchmark alone, which can lead to unfair comparisons between models and mislead users or stakeholders about the true capabilities of the model.

Regular vs irregular results:

https://preview.redd.it/rnei2lv5nhxb1.png?width=775&format=png&auto=webp&s=1e7b77cda38c40e6033ad93656853cd73be02362

Some thoughts:

The points covered here reminds me of the Skill-Mix paper from researchers at Google DeepMind and Princeton, where they found a discrepancy between popular benchmarks and their own evaluation.

https://arxiv.org/abs/2310.17567

A variant of the contamination issue is “cramming for the leaderboard.” It is possible to deliberately train a model on data similar to those used in the leaderboard evaluations. Such datasets are easy to generate from a small number of examples using existing strong model. If “cramming” happens during pre-training, it becomes hard to detect.

Several open models show signs of being over-trained for leaderboards at the expense of general-purpose language capabilities (“cramming”).

Falcon-180B-Chat and Tigerbot-70B-Chat rank higher than LLaMA-2-70B-Chat on Open LLM Leaderboard, but performs worse on SKILL-MIX for both GPT-4 and LLaMA-2 grading. Tigerbot-70B-Chat performs even worse than LLaMA-2-13B-Chat.

Qwen-14B-Chat outperforms LLaMA-2-70B-Chat on MMLU, HumanEval and GSM8K (Cobbe et al., 2021), but performs worse than LLaMA-2-70B-Chat for k = 2, 3, 4 with both GPT-4 and LLaMA-2 grading.

Mistral-7B-v0.1 outperforms LLaMA-2 13B on all benchmarks that the Mistral AI team tested. Mistral-7B-Instruct-v0.1 (the model after instruction tuning) outperforms LLaMA-2-13B-Chat on MT-Bench (Zheng et al., 2023). Yet, the situation is reversed on SKILL-MIX.

Textbooks are all you need? More like pretraining on the test set is all you need.