How much data is enough data?

This short article provides you a framework to know how much data is needed to fine-tune AI.

Contents:

Intro

Wondering how to know how much data is required to fine-tune AI? In this article, I aim to provide a simple way to determine the correct number of data points required to fine-tune AI.

Before giving you the answer though, it is crucially important to understand what evaluations are (referred to commonly as evals).

What are evals and why are they important?

Before fine-tuning ever begins, it is essential to be able to measure exactly what you are looking to improve. Depending on your use case, this is a simple (yet time consuming) task to do.

The first step is to know what the perfect output would be from an AI given a certain input. Then once you know this, you can have an input/output pair for each possible edge case that you would like to measure.

Once you have all of your edge cases coupled with their perfect AI output for a given use case, you can take any given AI model & measure how well it performs on all of your input/output pairs. This measurement should include accuracy, speed, and cost.

So how much data is enough data?

The short answer is: as much data as it takes to surpass the evals of your previous model. Lets see an example:

Imagine you have successfully compiled a set of inputs and their perfect outputs for your specific AI use case. After this, you have called GPT-4-Turbo on every single edge case and recorded the results. The final results are that GPT-4-Turbo achieved a respectable 60% accuracy and cost £3 to complete the evals.

At this point, you will take a smaller model, say Llama3 8B, and run the same evals. The final results from Llama 3 8B show 30% total accuracy but a drastically reduced cost of just £0.20 to complete the evals.

Now you begin curating a dataset consisting of extremely high quality data, sticking to between 50-100 data points. When reached, you fine-tune Llama3 8B and run the evals on the new model. Here you see the costs remain the same, and the accuracy has jumped to 50%.

This is an iterative process, after each eval run of a newly fine-tuned model, the failed evals are reviewed (by a human) to see why the fine-tuned AI model got the output wrong. Then, new data points are added to the dataset to cater for the model's knowledge gaps, and Llama3 8B is fine-tuned again with the new dataset.

Once the accuracy reaches or surpasses (depending on your goal) the previous model's eval scores, then you have enough data.

Conclusion

In conclusion, the key to determining how much data is needed to fine-tune AI lies in the iterative process of evaluation and refinement. By setting clear performance benchmarks and continuously comparing your fine-tuned AI model's output against these targets, you can effectively gauge the adequacy of your data.

This method involves a cycle of testing, analyzing failures, and enhancing the dataset until the AI's performance meets or surpasses that of its predecessors. This tailored approach prevents the wasteful accumulation of irrelevant data, making your fine-tuning efforts precise and goal-oriented.

Please get in touch if you have any questions about this or fine-tuning in general and I'd love to have a chat.

Best,
Dan Austin