mlabonne / llm-datasets
- четверг, 2 мая 2024 г. в 00:00:05
High-quality datasets, tools, and concepts for LLM fine-tuning.
🐦 Follow me on X • 🤗 Hugging Face • 💻 Blog • 📙 Hands-on GNN
High-quality datasets, tools, and concepts for LLM fine-tuning.
Data is the most valuable asset in LLM development. While datasets can't be directly evaluated like models, high-quality datasets have the following characteristics:
Measuring accuracy can be easy in the case of mathematical problems using a Python interpreter, or near-impossible with open-ended, subjective questions. On the other hand, clustering datasets by topic is a good way of measuring diversity. Finally, complexity is difficult to assess without involving frontier models.
Once a model has been pre-trained on a next-token prediction task, supervised fine-tuning is used to turn it into an assistant capable of answering questions and achieving tasks. These datasets contain pairs of instructions and outputs to train LLMs to go beyond their pre-training objective. All the datasets listed here should be under permissive licensing (Apache 2.0, MIT, cc-by-4.0, etc.).
The goal of general-purpose datasets is to transform base models into versatile and capable assistants by exposing them to a wide range of high-quality data. These datasets often include a diverse mix of real-world and synthetic data, commonly generated using models like GPT-4.
Dataset | # | Authors | Date | Notes |
---|---|---|---|---|
Bagel | >2M? | Jon Durbin | Jan 2024 | Collection of datasets decontaminated with cosine similarity. |
Hercules v4.5 | 1.72M | Sebastian Gabarain | Apr 2024 | Large-scale general-purpose dataset with math, code, RP, etc. See v4 for the list of datasets. |
Dolphin-2.9 | 1.39M | Cognitive Computations | Apr 2023 | Large-scale general-purpose dataset used by the Dolphin models. |
OpenHermes-2.5 | 1M | Teknium | Nov 2023 | Another large-scale dataset used by the OpenHermes models. |
SlimOrca | 518k | Lian et al. | Sep 2023 | Curated subset of OpenOrca using GPT-4-as-a-judge to remove wrong answers. |
Tulu V2 Mix | 326k | Ivison et al. | Nov 2023 | Mix of high-quality datasets. See Tulu 2 paper. |
UltraInteract SFT | 289k | Yuan et al. | Apr 2024 | Focus on math, coding, and logic tasks with step-by-step answers. See Eurus paper. |
NeurIPS-LLM-data | 204k | Jindal et al. | Nov 2023 | Winner of NeurIPS LLM Efficiency Challenge, with an interesting data preparation strategy. |
UltraChat 200k | 200k | Tunstall et al., Ding et al. | Oct 2023 | Heavily filtered version of the UItraChat dataset, consisting of 1.4M dialogues generated by ChatGPT. |
WizardLM_evol_instruct_V2 | 143k | Xu et al. | Jun 2023 | Latest version of Evol-Instruct applied to Alpaca and ShareGPT data. See WizardLM paper. |
sft_datablend_v1 | 128k | NVIDIA | Jan 2024 | Blend of publicly available datasets: OASST, CodeContests, FLAN, T0, Open_Platypus, and GSM8K and others (45 total). |
Synthia-v1.3 | 119k | Migel Tissera | Nov 2023 | High-quality synthetic data generated using GPT-4. |
FuseChat-Mixture | 95k | Wan et al. | Feb 2024 | Selection of samples from high-quality datasets. See FuseChat paper. |
oasst1 | 84.4k | Köpf et al. | Mar 2023 | Human-generated assistant-style conversation corpus in 35 different languages. See OASST1 paper and oasst2. |
WizardLM_evol_instruct_70k | 70k | Xu et al. | Apr 2023 | Evol-Instruct applied to Alpaca and ShareGPT data. See WizardLM paper. |
airoboros-3.2 | 58.7k | Jon Durbin | Dec 2023 | High-quality uncensored dataset. |
ShareGPT_Vicuna_unfiltered | 53k | anon823 1489123 | Mar 2023 | Filtered version of the ShareGPT dataset, consisting of real conversations between users and ChatGPT. |
lmsys-chat-1m-smortmodelsonly | 45.8k | Nebulous, Zheng et al. | Sep 2023 | Filtered version of lmsys-chat-1m with responses from GPT-4, GPT-3.5-turbo, Claude-2, Claude-1, and Claude-instant-1. |
Open-Platypus | 24.9k | Lee et al. | Sep 2023 | Collection of datasets that were deduplicated using Sentence Transformers (it contains an NC dataset). See Platypus paper. |
databricks-dolly-15k | 15k | Conover et al. | May 2023 | Generated by Databricks employees, prompt/response pairs in eight different instruction categories, including the seven outlined in the InstructGPT paper. |
LLMs often struggle with mathematical reasoning and formal logic, which has led to the creation of specialized datasets. These datasets extend beyond pure mathematics, encompassing a wide range of problems that require systematic thinking and step-by-step reasoning, ultimately enabling LLMs to tackle complex real-world challenges that involve logical deduction and quantitative analysis.
Dataset | # | Authors | Date | Notes |
---|---|---|---|---|
OpenMathInstruct-1 | 5.75M | Toshniwal et al. | Feb 2024 | Problems from GSM8K and MATH, solutions generated by Mixtral-8x7B |
MetaMathQA | 395k | Yu et al. | Dec 2023 | Bootstrap mathematical questions by rewriting them from multiple perspectives. See MetaMath paper. |
MathInstruct | 262k | Yue et al. | Sep 2023 | Compiled from 13 math rationale datasets, six of which are newly curated, and focuses on chain-of-thought and program-of-thought. |
Orca-Math | 200k | Mitra et al. | Feb 2024 | Grade school math world problems generated using GPT4-Turbo. See Orca-Math paper. |
Code is another challenging domain for LLMs that lack specialized pre-training. Code datasets, containing diverse programming language examples, are used to fine-tune LLMs and enhance their ability to understand, generate, and analyze code, enabling them to serve as effective coding assistants.
Dataset | # | Authors | Date | Notes |
---|---|---|---|---|
CodeFeedback-Filtered-Instruction | 157k | Zheng et al. | Feb 2024 | Filtered version of Magicoder-OSS-Instruct, ShareGPT (Python), Magicoder-Evol-Instruct, and Evol-Instruct-Code. |
Tested-143k-Python-Alpaca | 143k | Vezora | Mar 2024 | Collection of generated Python code that passed automatic tests to ensure high quality. |
glaive-code-assistant | 136k | Glaive.ai | Sep 2023 | Synthetic data of problems and solutions with ~60% Python samples. Also see the v2 version. |
Magicoder-Evol-Instruct-110K | 110k | Wei et al. | Nov 2023 | A decontaminated version of evol-codealpaca-v1. Decontamination is done in the same way as StarCoder (bigcode decontamination process). See Magicoder paper. |
dolphin-coder | 109k | Eric Hartford | Nov 2023 | Dataset transformed from leetcode-rosetta. |
synthetic_tex_to_sql | 100k | Gretel.ai | Apr 2024 | Synthetic text-to-SQL samples (~23M tokens), covering diverse domains. |
sql-create-context | 78.6k | b-mc2 | Apr 2023 | Cleansed and augmented version of the WikiSQL and Spider datasets. |
Magicoder-OSS-Instruct-75K | 75k | Wei et al. | Nov 2023 | OSS-Instruct dataset generated by gpt-3.5-turbo-1106 . See Magicoder paper. |
Code-Feedback | 66.4k | Zheng et al. | Feb 2024 | Diverse Code Interpreter-like dataset with multi-turn dialogues and interleaved text and code responses. See OpenCodeInterpreter paper. |
self-oss-instruct-sc2-exec-filter-50k | 50.7k | Lozhkov et al. | Apr 2024 | Created in three steps with seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. See the blog post. |
Many datasets focus on pairs of instructions and outputs, but chat models are often used in conversational settings. Conversational and role-play datasets expose LLMs to the patterns, nuances, and context-dependent nature of real conversations, allowing them to generate more natural, and engaging dialogues.
Dataset | # | Authors | Date | Notes |
---|---|---|---|---|
Bluemoon | 290k | Squish42 | Jun 2023 | Posts from the Blue Moon roleplaying forum cleaned and scraped by a third party. |
PIPPA | 16.8k | Gosling et al., kingbri | Aug 2023 | Deduped version of Pygmalion's PIPPA in ShareGPT format. |
Capybara | 16k | LDJnr | Dec 2023 | Strong focus on information diversity across a wide range of domains with multi-turn conversations. |
RPGPT_PublicDomain-alpaca | 4.26k | practical dreamer | May 2023 | Synthetic dataset of public domain character dialogue in roleplay format made with build-a-dataset |
Pure-Dove | 3.86k | LDJnr | Sep 2023 | Highly filtered multi-turn conversations between GPT-4 and real humans |
Opus Samantha | 1.85k | macadelicc | Apr 2024 | Multi-turn conversations with Claude 3 Opus. |
LimaRP-augmented | 804 | lemonilia, grimulkan | Jan 2024 | Augmented and cleansed version of LimaRP, consisting of human roleplaying conversations. |
Function calling allows large language models (LLMs) to execute predefined functions with parameters inferred from user prompts, rather than generating standard text responses. This enables LLMs to seamlessly integrate with external systems, perform complex operations, and provide more accurate and contextually relevant responses.
Dataset | # | Authors | Date | Notes |
---|---|---|---|---|
glaive-function-calling-v2 | 113k | Sahil Chaudhary | Sep 2023 | High-quality dataset with pairs of instructions and answers in different languages. See Locutusque/function-calling-chatml for a variant without conversation tags. |
Agent-FLAN | 34.4k | internlm | Mar 2024 | Mix of AgentInstruct, ToolBench, and ShareGPT datasets. |
W.I.P.
To create a high-quality dataset, focus on carefully curating a diverse set of relevant, accurate and informative examples rather than simply maximizing dataset size.
Start by aggregating available data from various sources (open-source or not) and apply filters like data deduplication and data quality. If the initial dataset is small or insufficient, consider synthetically generating additional data that mirrors its quality and style. Iteratively explore and refine the dataset by assessing model performance, identifying gaps and collecting or generate data to address those shortcomings.
Special thanks to geronimi73 for the PR.
Please let me know if a dataset is not properly credited.