10/13/2024
Is it feasible to teach an AI just using data produced by another AI? It may seem like a foolish notion. However, it is one that has been around for a while, and it is becoming more and more popular as fresh, accurate data becomes harder to come by.
Anthropic trained Claude 3.5 Sonnet, one of its flagship models, using some synthetic data. Using data produced by AI, Meta improved its Llama 3.1 models accordingly. Furthermore, it has been reported that OpenAI is using o1, its "reasoning" model, to provide artificial training data for the impending Orion.
However, why exactly does AI require data, and what kind of data does it require? Can synthetic data truly take the place of this data?
Annotations' significance for AI systems
Artificial intelligence systems are statistical apparatuses. After being exposed to a large number of samples, they are trained to recognize patterns in the examples, such as the fact that "to whom" usually comes before "it may concern" in an email.
Key components in these examples are annotations, which are typically text labels that describe the contents or meaning of the data these systems consume. They act as markers, "teaching" a model how to discriminate between objects, locations, and concepts.
A photo-classifying model is presented with numerous images of kitchens that are labeled with the word "kitchen." The model will start to associate the word "kitchen" with generic features of kitchens as it trains (e.g. that they have fridges and counters). Following instruction, after being shown a picture of a kitchen that wasn't in the original samples,
It should be detectable as such by the model. (Of course, labeling the kitchen photos as "cow" would identify them as cows; this highlights the significance of accurate annotation.)
The market for annotation services has expanded dramatically due to the demand for AI and the requirement for labeled data to support its development. According to Dimension Market Research, it is currently valued at $838.2 million and will increase in value to $10.34 billion over the next ten years. Although exact figures about the number of persons involved in labeling work are unknown, a 2022 paper places the figure in the "millions."
Businesses of all sizes depend on the labor of data annotation companies to provide labels for AI training sets. Certain tasks pay decently, especially if you need specific knowledge (math skills, for example) to categorize the items. Some people can give you back pain. In impoverished nations, annotators typically receive meager pay of a few dollars per hour, with no benefits or assurances of additional work.
A desiccating data set
Thus, there are humanitarian justifications for looking for alternatives to labels that are created by humans. However, there are also useful ones.
Labeling is a human limitation. Biases of annotators can also be seen in the annotations they make and, consequently, in the models that are trained on them. Annotators label instructions incorrectly or stumble over them. And it costs money to hire people to accomplish tasks.
For that matter, data is expensive in general. Reddit has made hundreds of millions of dollars from licensing data to Google, OpenAI, and other companies, while Shutterstock is charging AI suppliers tens of millions of dollars to access its archives.
Finally, obtaining data is likewise getting more difficult.
The majority of models are trained on vast sets of publicly available data, which owners are increasingly preferring to fence out of concern that their work will be copied or that they won't be given credit for it. OpenAI's web scraper is currently blocked on more than 35% of the top 1,000 websites worldwide. Additionally, a recent study discovered that the main datasets used to train models had restricted about 25% of the data from "high-quality" sources.
The research company Epoch AI predicts that between 2026 and 2032, developers will run out of data to train generative AI models, should the current trend of access limiting continue. This has made AI vendors face up to date, along with worries about negative content finding its way into public data sets and copyright disputes.
Artificial substitutes
Synthetic data would seem to be the answer to all of these issues at first appearance. Do you require annotations? Make them. More illustrative information? Not an issue. There are no limits.
And this is true to some degree.
According to Os Keyes, a PhD candidate at the University of Washington who focuses on the ethical implications of developing technologies, "if 'data is the new oil,' synthetic data pitches itself as biofuel, creatable without the negative externalities of the real thing," as reported by Dharia Knowledge . "You can simulate and extrapolate new entries from a small starting set of data."
The AI sector has embraced the idea wholeheartedly.
This month, the enterprise-focused generative AI startup Writer unveiled Palmyra X 004, a model that was nearly exclusively trained on artificial intelligence data. According to Writer, it only cost $700,000 to develop, but estimates for an OpenAI model of a similar size are $4.6 million.
A portion of the training data for Microsoft's Phi open models came from synthetic data. Likewise, Google's Gemma models were. Hugging Face, an AI firm, just released what it claims to be the largest AI training dataset of synthetic language, while Nvidia announced a model family this summer that is intended to provide artificial intelligence training data.
The creation of synthetic data has grown into a separate industry that, by 2030, may be valued at $2.34 billion. According to Gartner, synthetically generated data will account for 60% of the data utilized in AI and analytics initiatives this year.
Senior research scientist at the Allen Institute for AI Luca Soldaini pointed out that training data can be produced using synthetic data techniques in a manner that is difficult to obtain through scraping (or even content licensing). For instance, Llama 3 was used by Meta to generate captions for film in the training data for Movie Gen, a video generator. Humans then improved the captions to include more information, such as lighting descriptions.
In a similar vein, OpenAI claims to have optimized GPT-4o through the use of synthetic data in order to create ChatGPT's sketchpad-like Canvas functionality. Furthermore, Amazon has acknowledged that in order to enhance the real-world data it uses to train Alexa's speech recognition models, it also creates fake data.
"Human intuition of what data is needed to achieve a specific model behavior can be quickly expanded upon with synthetic data models," Soldaini added.
Artificial dangers
But synthetic data is not a cure-all. Like all AI, it has the "garbage in, garbage out" issue. Models produce synthetic data, and their outputs will also be polluted if the data used to train them contains biases and limits. For example, with the synthetic data, the groups that were underrepresented in the original data will still be there.
Keyes stated, "The issue is, there's only so much you can do." Let's say your dataset contains just 30 Black individuals. Extending the analysis could be beneficial, but if those thirty individuals are all middle-class or fair-skinned, that is what the "representative" data will all look like.
As of right now, an excessive dependence on synthetic data during training can result in models whose "quality or diversity progressively decrease," according to a 2023 study conducted by experts at Stanford and Rice Universities. After a few generations of training, the researchers found that sampling bias, or inadequate representation of the real world, exacerbates a model's lack of diversity (though they also discovered that incorporating some real-world data helps to alleviate this).
Complex models like OpenAI's o1 pose extra hazards, according to Keyes, as they may result in more difficult-to-identify hallucinations in their synthetic data. This could also lower the accuracy of models trained on the data, particularly if it is difficult to pinpoint the cause of the hallucinations.
Keyes went on, "Complex models have hallucinations; the data generated by complex models contains hallucinations." Furthermore, "the developers themselves can't always explain why artefacts appear with a model like o1."
Hallucinations compounded together can produce models that speak incoherently. A research article in the journal Nature describes how models that are trained on data that contains errors produce even more errors, and how this feedback loop deteriorates the models that are produced in the future. Over generations, models lose their ability to comprehend more sophisticated information, the researchers observed, becoming more generic and frequently generating answers that have nothing to do with the questions posed.
see images
Additional research reveals that other kinds of modelsโsuch as image generatorsโare also susceptible to this kind of collapse.
At the very least, if the intention is to prevent creating uniform image generators and forgetful chatbots, Soldaini concurs that "raw" synthetic data shouldn't be trusted. According to him, in order to use it "safely," you should carefully examine, choose, and filter it. You should also ideally match it with new, authentic data, just like you would with any other dataset.
Failure to do so may eventually result in model collapse, a situation in which a model's outputs become increasingly biased and less "creative," ultimately jeopardizing the model's usefulness. There is a chance that this process will be discovered and stopped before it becomes dangerous.
Failure to do so may eventually result in model collapse, a situation in which a model's outputs become increasingly biased and less "creative," ultimately jeopardizing the model's usefulness. There is a chance that this process will be discovered and stopped before it becomes dangerous.
According to Soldaini, "researchers need to review the generated data, improve the generation process, and find safeguards to eliminate low-quality data points." "The output of synthetic data pipelines needs to be thoroughly examined and refined before being utilized for training; they are not a self-improving machine."
One day, according to OpenAI CEO Sam Altman, AI will generate synthetic data of sufficient quality for it to be able to teach itself. But the technology isn't there yet, assuming that's even possible. Not a single large AI lab has released a model that was only trained on synthetic data.
It appears that human intervention will be necessary in some capacity to ensure that a model's training proceeds smoothly, at least in the near future.