Table of Links
3.2 Measuring Learning with Coding and Math Benchmarks (target domain evaluation)
3.3 Forgetting Metrics (source domain evaluation)
4 Results
4.1 LoRA underperforms full finetuning in programming and math tasks
4.2 LoRA forgets less than full finetuning
4.3 The Learning-Forgetting Tradeoff
4.4 LoRA’s regularization properties
4.5 Full finetuning on code and math does not learn low-rank perturbations
4.6 Practical takeaways for optimally configuring LoRA
Appendix
D. Theoretical Memory Efficiency Gains with LoRA for Single and Multi-GPU Settings
3 Experimental Setup
We train on code and math datasets that have been shown to increase downstream performance. We motivate the training datasets and evaluation benchmarks below.
3.1 Datasets for Continued Pretraining (CPT) and Instruction Finetuning (IFT)
Coding CPT - Starcoder-Python This dataset (Li et al., 2023) consists of permissively licensed repositories from GitHub, including Git commits, in 80+ programming languages. We chose the Python subset and sub-sampled it to 20B tokens.
Math CPT - OpenWebMath We trained on a subset of up to 8.59B out of 14.7B tokens. The dataset (Paster et al., 2023) includes mathematical web pages from Common Crawl, correctly formatted to preserve mathematical content such as LaTeX equations.[2] We note that this dataset contains a considerable amount of full English sentences. [3]
Coding IFT - Magicoder-Evol-Instruct-110k This dataset (Wei et al., 2023) contains 72.97M tokens of programming questions and answers. It reproduces the “Evol-Instruct’ dataset of WizardCoder (Luo et al., 2023): an LLM (GPT-4) is iteratively prompted to increase the difficulty of a set of question-answer pairs (from Code Alpaca; Chaudhary (2023)).
Math IFT - MetaMathQA This dataset (Yu et al., 2023) was built by bootstrapping mathematical word problems from the training sets of GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) by rewriting the questions with variations using GPT-3.5. This dataset contains 395K question-answer pairs and roughly 103M tokens.[4]
We quantify learning and forgetting via benchmarks reported on the Open LLM Leaderboard[5] for state of the art open-source LLMs such as Llama (Touvron et al., 2023).
3.2 Measuring Learning with Coding and Math Benchmarks (target domain evaluation)
Coding - HumanEval This benchmark (Chen et al., 2021)) contains 164 problems that involve generating of a Python program given a docstring and a function signature. A generation is considered correct if it passes all supplied unit tests. We use Code Generation LM Evaluation Harness (Ben Allal et al., 2022), configured to output 50 generations per problem, sampling with softmax temperature=0.2, and calculating “pass@1”
Math - GSM8K This benchmark (Cobbe et al., 2021) includes a collection of 8.5K grade-school math word problems. We evaluate on the test split of GSM8K (1,319 samples) as implemented in LM Evaluation Harness (Gao et al., 2023), with default generation parameters (temperature=0, five few-shot, pass@1).
3.3 Forgetting Metrics (source domain evaluation)
HellaSwag This benchmark (Zellers et al., 2019) includes 70K problems, each describing an event with multiple possible continuations. The task is to pick the most plausible continuation, which requires making inferences about nuanced everyday situations.
WinoGrande This benchmark (Sakaguchi et al., 2019) also assesses commonsense reasoning. It includes 44K problems with sentences that require ambiguous pronoun resolution.
ARC-Challenge This benchmark (Clark et al., 2018) consists of 7,787 grade-school level, multiple-choice science questions, testing capabilities in complex reasoning and understanding of scientific concepts.
Authors:
(1) Dan Biderman, Columbia University and Databricks Mosaic AI ([email protected]);
(2) Jose Gonzalez Ortiz, Databricks Mosaic AI ([email protected]);
(3) Jacob Portes, Databricks Mosaic AI ([email protected]);
(4) Mansheej Paul, Databricks Mosaic AI ([email protected]);
(5) Philip Greengard, Columbia University ([email protected]);
(6) Connor Jennings, Databricks Mosaic AI ([email protected]);
(7) Daniel King, Databricks Mosaic AI ([email protected]);
(8) Sam Havens, Databricks Mosaic AI ([email protected]);
(9) Vitaliy Chiley, Databricks Mosaic AI ([email protected]);
(10) Jonathan Frankle, Databricks Mosaic AI ([email protected]);
(11) Cody Blakeney, Databricks Mosaic AI (cody.blakeney);
(12) John P. Cunningham, Columbia University ([email protected]).
This paper is
[2] https://huggingface.co/datasets/open-web-math/open-web-math
[3] Out of a random selection of 100K examples, a regex search shows that 75% of the examples contain LaTex. The data is classified as 99.7% English and "overwhelmingly English" by the langdetect and fasttext tools.
[4] https://huggingface.co/datasets/meta-math/MetaMathQA
[5] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard