OpenAI and other firms are using synthetic data to train AI models

Advanced AI models now being trained using computer-made 'synthetic' data

Image:
Advanced AI models now being trained using computer-made 'synthetic' data

Major tech firms developing generative AI models are actively exploring a new approach to acquiring the vast amounts of information they need for their advanced models: creating it from scratch using computer-generated data.

Players like Microsoft, OpenAI, and Cohere are employing synthetic data to train their large language models (LLMs), primarily due to the constraints in the availability of human-made data.

Microsoft-backed OpenAI's introduction of ChatGPT in November sparked a wave of products this year, from companies including Google and Anthropic. These products can generate plausible text, code or images when provided with simple prompts.

The large language models (LLMs) that drive chatbots like ChatGPT and Google's Bard primarily depend on web scraping techniques to gather data from digitised books, news articles, social media, search queries, images, videos and other online sources.

After gathering the data, humans refine the information and fill in any gaps in the dataset. This process is commonly referred to as "reinforcement learning by human feedback" (RLHF).

The diverse range of data enables chatbots to provide comprehensive and contextually relevant responses to user queries and prompts.

However, as generative AI tools becomes more advanced, AI companies are starting to encounter significant challenges related to data access and privacy. Additionally, developers have observed that relying solely on generic data sourced from the web is no longer sufficient to achieve further advancements in AI model performance.

In order to continue improving, AI models will need access to unique and sophisticated data sets. These specialised datasets may need to be curated and generated by domain experts such as scientists, doctors, and engineers.

Alternatively, AI companies may seek to acquire proprietary data from large corporations, which could provide valuable insights and unique information.

But human-created data can be very costly and time-consuming to collect.

Computer-generated synthetic data offers a cost-effective solution to this challenge.

The CEO of AI firm Cohere, Aiden Gomez, told the Financial Times that synthetic data is already actively employed in training AI models, but its adoption is not widely publicised.

A notable example he cited involves training a model in advanced mathematics. In this approach, two AI models take on the roles of a teacher and a student, engaging in a discussion about a specific topic, such as trigonometry.

During the exchange, a human observer intervenes to correct any inaccuracies in the conversation. This interactive process helps the AI models learn and improve their understanding of complex mathematical concepts effectively.

Last year, a collaborative team of researchers from MIT, the MIT-IBM Watson AI Lab, and Boston University developed a synthetic dataset containing 150,000 video clips, capturing diverse human actions. This dataset served as training material for their machine-learning models.

Subsequently, the researchers exposed these models to six datasets comprising real-world videos to assess their ability to recognise actions in those clips.

Surprisingly, the models trained on synthetic data outperformed those trained on real data, especially in videos with fewer background objects.

The finding suggested that synthetic data can be a valuable resource for training machine-learning models.

The Financial Times highlights a research paper by Microsoft Research titled 'Textbooks Are All You Need.'

The paper [pdf] illustrated how training a coding model using high-quality textbook data resulted in impressive performance on coding tasks.

A recent Microsoft study has demonstrated the efficacy of synthetic data in training smaller and less complex models. In one example, a synthetic dataset of short stories, generated using GPT-4, was successfully used to train a simple language model.

The trained model was able to generate coherent and grammatically correct stories, showcasing the potential of synthetic data in training efficient AI models.

Startups have emerged to serve the market for synthetic data services. Companies like Scale AI and Gretel.ai say their datasets address concerns around preserving privacy and eliminating biases.

However, critics warn that the use of AI-generated raw data could lead to the degradation of AI technology over time, due to the inclusion of falsehoods and inaccuracies.

Nevertheless, AI researchers view synthetic data as a promising path towards achieving superintelligent AI.