It’s time to zoom out from using AI and ask a basic question: How can “training” large language models (LLMs) — the structure that ChatGPT, Dall-E and other generative AI models are built on — with anything and everything on the internet, including all of our content, possibly be legal? Especially when AI is built for other people’s commercial products? I’ve heard from countless local news leaders this year who are concerned and convinced they’re seeing the end of their news business, while having no protections to cover it.
We reached out to Danielle Coffey, the CEO of American Press Institute’s parent corporation, the News/Media Alliance, to learn more about the legal fight for news organizations’ rights with AI.
DANIELLE: We filed with the U.S. Patent and Trademark Office a few years ago about data retention and mining — it was already going on, but nobody cared. But in January or February, all of a sudden every article in my inbox and call was about AI and that was because of ChatGPT. Once it was commercialized, that’s when it’s like wildfire, but this has been a wildfire that has maintained the level of buzz like nothing I’ve ever seen.
News organizations are uniquely situated because our content is used to feed the training that becomes AI tools that feed into creating more journalism — it’s almost cannibalistic. How do we feed into the training of AI and machines that produce datasets that are fed into commercial products?
The very first thing we [News Media Alliance] did was a report that showed the business landscape and how these players use our information. We did a survey of membership to determine what licenses exist, who’s using what, how, and if they have permission.
We have IP copyright-protected content and exclusive rights. If our content was being used in a way that was not granted permission, it was unauthorized and requires compensation and/or a license. You need permission, essentially.
ELITE: There are some privacy concerns here. When you put something in ChatGPT, you’re giving the company the right to that, so don’t use your embargoed investigations for an outline. The bigger concern, NMA is uniquely positioned to address: you don’t know if your content can be accessed or has been accessed for machine learning. Currently, that’s not legal for companies like OpenAI to use, right? What rights do publishers have and how do you know if they’ve used your content?
D: If it’s being used, how do we stop them? Whether or not it’s legal, this is a deep question.
There are two developments that [tech companies] are worried about. Lawmakers and the public have been made aware of how LLMs work, so they don’t know if they’re on solid ground anymore. OpenAI has now said you can opt out and Google, in a fuzzy way (via a Guardian article), said they’ll respect opt outs, but the problem is how do we know if they’re using our content?
That’s where the technology piece comes in. We realized that if you want to prevent things like nonprofit Common Crawl that’s easy — you know what their bot looks like when they come on your site so you can deny them — but do you want to and how will it affect your traffic? Most of our companies have blocked them.
OpenAI becomes more complicated. What they use will be in a lot of commercial products. If I opt out and [my content could be] used at a later stage and I could get compensated, do I want to? Hence litigation against OpenAI — don’t tell me I can opt out because then I may not get compensated. It would be waiving your right to payment and I don’t think companies are ready to do that. Hence Sarah Silverman and others filing a lawsuit against these tools.
However, with Google’s SGE and its products, it becomes much more complicated. Google currently crawls and indexes our content for search, which you want them to, but it exacerbates the previous problem. If you’re being crawled for search and it’s meshed with grounding for AI datasets and they use that to put together super snippets, how do you stop one activity for the other — and do you want to? I want to be found.
I think ultimately, it’s a business question. Even if we can opt out of using a tech solution to grant permissions for specific uses of your content, will we?
Editor’s note: After Danielle and Elite’s conversation, The New York Times blocked OpenAI’s web crawler.
E: Your eyes lit up when we mentioned the archives issue of ChatGPT retroactively using our content to learn and improve their own product, by using ours, for free. It’s a mature product because of everything that’s been fed into it. It’s already been trained. What does archive use legally look like in the fight you’re experiencing?
D: Sam Han at the Washington Post says you can’t untrain a model. If you were to say you’ve illegally used our information, there’s no taking it back — only payment is a remedy. When you go to court and say this guy used my stuff and it’s illegal, the court will ask if there’s market harm. When it comes to archives, one company alone might make millions off of archives. Think about the sites where you have to log in or pay to access, like Newspapers.com, ProQuest or LexisNexis — these companies already exist, and there is a market for us to be paid for our archival content. So when AI crawlers take it for free, a court will determine that we’re being deprived of what we can be charged.
E: That seems like the easiest path forward — showing that what AI has already done is a problem. That’s promising to me because that’s a huge concern to people just now exploring their options.
I’m curious — OpenAI hired Tom Rubin, a lawyer who used to work for Microsoft, to represent publishers in this issue. Is that weird? Has that happened before? Why would they do that and how would that benefit us?
D: It’s certainly not unusual. Tom cares about the industry and is a familiar face that will be helpful. It’s a win-win because the company has someone who is familiar and welcomed by our publishers, and publishers have someone they trust on the inside. That’s why they do it. They are having conversations — OpenAI seems open to payment, which was proven by the AP/American Journalism Project partnership.
E: Is there anything that makes it difficult for the media industry to tackle AI and licensing?
D: When I first started working, I noticed every other content industry is like “my content” — FBI warnings when you’re watching movies, intense music copyrights. They’re very conscious of the protection of their property and not letting it be used by others. But when I started in this industry, the mentality was, go ahead and take it, it will come back and we’ll make money through advertising. That’s our downfall. We now get cents on the dollar, and we let everyone have it, with the faith that it will come back, but because of the distribution model, it went out the door and never came back. It was hard to shift that mentality to hold tighter and put up paywalls. People recognized the need for exclusive use and requiring permissions for use of content. What I saw when AI came was a quick embracing of the mentality of, my content’s being used, it has value, it’s being used without permission and that’s not allowed. That was a big conceptual shift of our companies recognizing the value of our content out of the gate.
E: That’s an interesting mentality and something that’s tricky to deal with because we think news should be for everyone, we want the biggest reach possible, but it does bite us. Anything else you want to talk about for API readers?
D: AI can be incredibly empowering. It can create productivity and innovate in all ways. AI tools, natural language processing — it can be an amazing thing for newsrooms. But the two are connected: if you don’t fix the piece of it where we need to be compensated or at least provide permissions for the use of our content to train these machines that could become the replacement of us, what happens when this first part goes away if there’s nothing to feed on in the first place? What happens over time? If that goes away, what becomes of the system that relies on this content in the first place? It’s not a good thing for society and democracy.
- Check out N/MA’s work on artificial intelligence
- During our conversation, During our conversation, Danielle said this about the legal cases to watch and learn from as we navigate this:
- Read about the Getty cases in the U.K. and U.S. Getty has always been aggressive, but it’s easier with images, since there’s a strong foundational precedent in the courts because there’s a lot of clarity around that law. In their complaint you can see the watermark [in the AI-generated images]! But how do you attach metadata to a string of words?
- Judge Orrick’s comments [in a lawsuit brought by artists against text-to-image AI developers] look at the input stage only, but I think output is essential to determine how the output is harming my product.
- In U.S. legislation, Senate majority leader Chuck Schumer has taken this issue up which is interesting because majority leaders typically don’t run with legislation. Senator Chris Coons is handling a piece that has to do with copyright, so there will be forum meetings this fall and NMA will be weighing in. As far as legislation in Congress, I don’t think it will happen before the end of the year, but when it comes to disclosures, transparency, compensation, IP around the use of our content that’s used and the outputs, the legislative proposals are exploring all of the above.
- Are you looking for help creating an AI strategy for your newsroom? To work with API on creating a clear, transparent and audience-centered AI strategy, reply to this email or reach out to firstname.lastname@example.org with the subject line “AI Strategy.”