Experimental Setup and Datasets for Continued Pretraining (CPT) and Instruction Finetuning (IFT)

by largemodel...June 17th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This section outlines the datasets and benchmarks used to pretrain and fine-tune a code/math LLM, and how its performance and retention were measured across domains.

People Mentioned

Mention Thumbnail

Company Mentioned

Mention Thumbnail
featured image - Experimental Setup and Datasets for Continued Pretraining (CPT) and Instruction Finetuning (IFT)
Large Models (dot tech) HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 Background

3 Experimental Setup and 3.1 Datasets for Continued Pretraining (CPT) and Instruction Finetuning (IFT)

3.2 Measuring Learning with Coding and Math Benchmarks (target domain evaluation)

3.3 Forgetting Metrics (source domain evaluation)

4 Results

4.1 LoRA underperforms full finetuning in programming and math tasks

4.2 LoRA forgets less than full finetuning

4.3 The Learning-Forgetting Tradeoff

4.4 LoRA’s regularization properties

4.5 Full finetuning on code and math does not learn low-rank perturbations

4.6 Practical takeaways for optimally configuring LoRA

5 Related Work

6 Discussion

7 Conclusion and References

Appendix

A. Experimental Setup

B. Learning rate searches

C. Training Datasets

D. Theoretical Memory Efficiency Gains with LoRA for Single and Multi-GPU Settings

3 Experimental Setup

We train on code and math datasets that have been shown to increase downstream performance. We motivate the training datasets and evaluation benchmarks below.

3.1 Datasets for Continued Pretraining (CPT) and Instruction Finetuning (IFT)

Coding CPT - Starcoder-Python This dataset (Li et al., 2023) consists of permissively licensed repositories from GitHub, including Git commits, in 80+ programming languages. We chose the Python subset and sub-sampled it to 20B tokens.


Math CPT - OpenWebMath We trained on a subset of up to 8.59B out of 14.7B tokens. The dataset (Paster et al., 2023) includes mathematical web pages from Common Crawl, correctly formatted to preserve mathematical content such as LaTeX equations.[2] We note that this dataset contains a considerable amount of full English sentences. [3]


Coding IFT - Magicoder-Evol-Instruct-110k This dataset (Wei et al., 2023) contains 72.97M tokens of programming questions and answers. It reproduces the “Evol-Instruct’ dataset of WizardCoder (Luo et al., 2023): an LLM (GPT-4) is iteratively prompted to increase the difficulty of a set of question-answer pairs (from Code Alpaca; Chaudhary (2023)).


Math IFT - MetaMathQA This dataset (Yu et al., 2023) was built by bootstrapping mathematical word problems from the training sets of GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) by rewriting the questions with variations using GPT-3.5. This dataset contains 395K question-answer pairs and roughly 103M tokens.[4]


We quantify learning and forgetting via benchmarks reported on the Open LLM Leaderboard[5] for state of the art open-source LLMs such as Llama (Touvron et al., 2023).

3.2 Measuring Learning with Coding and Math Benchmarks (target domain evaluation)

Coding - HumanEval This benchmark (Chen et al., 2021)) contains 164 problems that involve generating of a Python program given a docstring and a function signature. A generation is considered correct if it passes all supplied unit tests. We use Code Generation LM Evaluation Harness (Ben Allal et al., 2022), configured to output 50 generations per problem, sampling with softmax temperature=0.2, and calculating “pass@1”


Math - GSM8K This benchmark (Cobbe et al., 2021) includes a collection of 8.5K grade-school math word problems. We evaluate on the test split of GSM8K (1,319 samples) as implemented in LM Evaluation Harness (Gao et al., 2023), with default generation parameters (temperature=0, five few-shot, pass@1).

3.3 Forgetting Metrics (source domain evaluation)

HellaSwag This benchmark (Zellers et al., 2019) includes 70K problems, each describing an event with multiple possible continuations. The task is to pick the most plausible continuation, which requires making inferences about nuanced everyday situations.


WinoGrande This benchmark (Sakaguchi et al., 2019) also assesses commonsense reasoning. It includes 44K problems with sentences that require ambiguous pronoun resolution.


ARC-Challenge This benchmark (Clark et al., 2018) consists of 7,787 grade-school level, multiple-choice science questions, testing capabilities in complex reasoning and understanding of scientific concepts.


Authors:

(1) Dan Biderman, Columbia University and Databricks Mosaic AI ([email protected]);

(2) Jose Gonzalez Ortiz, Databricks Mosaic AI ([email protected]);

(3) Jacob Portes, Databricks Mosaic AI ([email protected]);

(4) Mansheej Paul, Databricks Mosaic AI ([email protected]);

(5) Philip Greengard, Columbia University ([email protected]);

(6) Connor Jennings, Databricks Mosaic AI ([email protected]);

(7) Daniel King, Databricks Mosaic AI ([email protected]);

(8) Sam Havens, Databricks Mosaic AI ([email protected]);

(9) Vitaliy Chiley, Databricks Mosaic AI ([email protected]);

(10) Jonathan Frankle, Databricks Mosaic AI ([email protected]);

(11) Cody Blakeney, Databricks Mosaic AI (cody.blakeney);

(12) John P. Cunningham, Columbia University ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.

[2] https://huggingface.co/datasets/open-web-math/open-web-math


[3] Out of a random selection of 100K examples, a regex search shows that 75% of the examples contain LaTex. The data is classified as 99.7% English and "overwhelmingly English" by the langdetect and fasttext tools.


[4] https://huggingface.co/datasets/meta-math/MetaMathQA


[5] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks
OSZAR »