The Inevitability of Data Usage Rights in Generative Models

2023 Apr 21

The Inevitability of Data Usage Rights in Generative Models

thanos

One of the least-discussed topics in all the conversations around Artificial Intelligence, generative models, and Large Language Models (LLMs) is the inevitability of the discussion around usage rights, fair use, and royalty payments for platforms built on user-generated data.

For example, Stack Overflow is set to demand financial compensation for the right to use its database of programming questions and answers, which has been used as training input for various LLMs — ChatGPT in particular:

OpenAI, Google, and other companies building large-scale AI projects have traditionally paid nothing for much of their training data, scraping it from the web. But Stack Overflow, a popular internet forum for computer programming help, plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service, CEO Prashanth Chandrasekar says. The site has more than 20 million registered users.

Reddit is already changing its API terms of service, and it appears it will become much harder to use their database as input for model training:

Reddit has not specified the cost, but said in its news release that it will introduce a “new premium access point for third parties who require additional capabilities, higher usage limits, and broader usage rights.” The company says it will update its Terms and Conditions to clarify what cases are acceptable to utilize Reddit’s data, saying as of Tuesday, developers and third parties will be notified of the new terms which will take effect within 60 days of receiving the notice.

Steve Huffman, Reddit’s founder and chief executive, told The New York Times, “Crawling Reddit, generating value, and not returning any of that value to our users is something we have a problem with,” Mr. Huffman said. “It’s a good time for us to tighten things up.” He added, “We think that’s fair.”

Three conversations that will inevitably have to happen in these uncharted waters of generative models and LLMs are:

(I) What will the definition of fair use of data be, who has the right to use and reproduce it, and what compensation models will exist for creators (with special attention to Stack Overflow’s CEO saying he will package all community responses and sell them);
(II) How these generative models will handle data poisoning at scale, given that very few people have the resources and expertise to detect and combat it; and
(III) Who will bear accountability in cases of mischaracterization, errors, omissions, and general harm? Examples such as defamation, false accusations of sexual harassment, and even false corruption accusations are already starting to surface via ChatGPT.