Reddit versus OpenAI bodes well for content owners

Owners are driven to act when scraping hits their bottom line

Scraping web data is an easy way to train AI models, but content owners object to the practice. With no legislative end in sight, Reddit's deal with OpenAI marks a way forward, says Travers Smith's James Longster.

Generative AI systems need huge volumes of data to train the large language models underpinning their offerings. This isn't news to anyone with an understanding of the technology, but a critical question is where do you find the data?

Synthetic data may be an option for certain models, but the obvious place to look for data is the internet. The use of data that is publicly available from the internet, however, is not straightforward. Far from it, in fact.

Problems around the "scraping" of data from websites are not new. I wrote an article for this publication on this very topic in 2015 in light of a spate of cases relating to, amongst other things, the scraping of flight schedule data from airline websites by price aggregators.

The problem with unauthorised scraping, whatever the reason you are doing it for, is that it will likely breach the terms of the relevant website from which you are scraping it, and it may also constitute an IP infringement (with the main focus being on copyright, but potentially other IP rights too). There's a risk, too, that the output from AI models trained on the data will infringe the content creators' copyright if the output is a substantial copy of that content.

This all has the potential to expose organisations to claims from the relevant website owners. Often, the website owners decide not to take action, but this isn't always the case, and the owners are significantly more likely to act if the scraping is hitting their bottom line. For example, Getty is currently suing Stability AI in the UK (there is also a corresponding action in the US); Getty is claiming that Stability AI infringed Getty's IP rights by scraping its library of images to train, develop and generate outputs from 'Stable Diffusion' - a system that uses AI to generate images.

There's no legislative solution on the horizon in the UK. The text and data mining (TDM) exception to copyright does not cover commercial use, and the government has confirmed that it is not proceeding with a broader exception for TDM. A voluntary code of practice, which aimed to protect rights holders while avoiding stunting AI growth due to data mining issues, has also been shelved. A House of Lords committee has heavily criticised the government's lack of progress on the matter.

Pending legislative reform or the buck being passed to the courts, what's the way through the impasse?

Open AI's CEO, Sam Altman called for "new economic models" to compensate content owners at Davos in January 2024. A few weeks ago Reddit's COO, Jen Wong, issued a warning to AI firms that are scraping Reddit's platform that they would face legal action if they were doing it without authorisation.

The ultimatum appears to have paid dividends, with both OpenAI and Google subsequently striking a deal to use content from the platform. The OpenAI deal in particular had a significant impact on Reddit's share price, leading to increases of up to 15% in subsequent trading.

This deal with Reddit is by no means unique, with similar deals having recently been agreed with the Financial Times, the Associated Press and News Corp (owner of the Wall Street Journal, the Times and Sunday Times).

Whilst many will agree that generative AI solutions will have significant power and influence over our lives and the economy in coming years, these commercial deals very clearly show the value of content and the importance of guarding that value, because there are significant opportunities for the gatekeepers to that content to monetise their data assets.

James Longster is a partner in Travers Smith's Technology & Commercial Transactions Department