- Content owners are wising up to their work being freely used by Big Tech to build new AI tools.
- Bots like Common Crawl are scraping and storing billions of pages of content for AI training.
- With less incentive to share online freely, the web could become a series of paywalled gardens.Â
AI is undermining the web’s grand bargain, and a decades-old handshake agreement is the only thing standing in the way.
A single bit of code, robots.txt, was proposed in the late 1990’s as a way for websites to tell bot crawlers they don’t want their data scraped and collected. It was widely accepted as one of the unofficial rules supporting the web.
At the time, the main purpose of these crawlers was to index information so results in search engines would improve. Google, Microsoft’s Bing and other search engines have crawlers. They index content so it can be later served up as links to billions of potential consumers. This is the essential deal that created the flourishing web we know today: Creators share abundant information and exchange ideas online freely because they know consumers will visit and either see an ad, subscribe, or buy something.
Now, though, generative AI and large language models are changing the mission of web crawlers radically and rapidly. Instead of working to support content creators, these tools have been turned against them.
The bots feeding Big Tech
Web crawlers now collect online information to feed into giant datasets that are used for free by wealthy tech companies to develop AI models. CCBot feeds Common Crawl, one of the biggest AI datasets. GPTbot feeds data to OpenAI, the company behind ChatGPT and GPT-4, currently the most powerful AI model. Google just calls its LLM training data “Infiniset,” without mentioning where the vast majority of the data comes from. Although 12.5% comes from C4, a cleaned up version of Common Crawl.
The models use all this free information to learn how to answer user questions immediately. That’s a long way from indexing a web site so users can be sent through to the original work.
Without a supply of potential consumers, there’s little incentive for content creators to let web crawlers continue to suck up free data online. GPTbot is already being blocked by Amazon, Airbnb, Quora, and hundreds of other websites. Common Crawl’s CCBot is beginning to be blocked more, too.
‘A crude tool’
What hasn’t changed is how to block these crawlers. Implementing robots.txt on a web site, and excluding specific crawlers, is the only option. And it’s not very good.
“It’s a bit of a crude tool,” said Joost de Valk, a former WordPress executive, tech investor and founder of digital marketing firm Yoast. “It has no basis in law, and is basically maintained by Google, although they say they do that together with other search engines.”
It’s also open to manipulation, especially given the voracious appetite for quality AI data. The only thing a company like OpenAI has to change is the name of its bot crawler to bypass all the disallow rules people put in place using robots.txt, de Valk explained.
Because robots.txt is voluntary, web crawlers can also simply ignore the blocking instructions and siphon the information from a site anyway. Some crawlers, like that of Brave, a newer search engine, don’t bother disclosing the name of their crawler, making it impossible to block.
“Everything online is being sucked up into a vacuum for the models,” said Nick Vincent, a computer science professor who studies the relationship between human-generated data and AI. “There’s so much going on under the hood. In the next six months, we will look back and want to evaluate these models differently.”
AI bot backlash
De Valk warns that owners and creators of online content may already be too late in understanding the risks of allowing these bots to scoop up their data for free and use it indiscriminately to develop AI models.
“Right now, doing nothing means, ‘I’m ok with my content being in every AI and LLM in the world,’ de Valk said. “That’s just plain wrong. A better version of robots.txt could be created, but it’d be very weird if that was done by the search engines and the large AI parties themselves.”
Several major companies and websites have responded recently, with some starting to deploy robots.txt for the first time.
As of August 22, 70 of the 1,000 most-popular websites have used robots.txt to block GPTBot since OpenAI revealed the crawler about three weeks ago, according to Originality.ai, a company that checks content to see if it’s AI-generated or plagiarized.
The company also found that 62 of the 1,000 most popular websites are blocking Common Crawl’s CCBot, with an increasing number doing so only this year as awareness of data crawling for AI has grown.
Still, it is not enforceable. Any crawler could ignore a robots.txt file and collect every last bit of data it found on a webpage, with the owner of the page more than likely having no idea it even happened. Even if robots.txt had any basis in law, its original purpose has little to do with information on the internet being used to create AI models.
“Robots.txt is unlikely to be seen as a legal prohibition on use of data,” according to Jason Schultz, director of NYU’s Technology Law & Policy Clinic. “It was primarily meant to signal that one did not want one’s website to be indexed by search engines, not as a signal that one did not want one’s content used for machine learning and AI training.”
‘This is a minefield’
This activity has been going on for years. OpenAI revealed its first GPT model in 2018, having trained it on BookCorpus, a dataset of thousands of indie or self-published books. Common Crawl started in 2008 and its dataset became publicly available in 2011 through cloud storage provided by AWS.
Although GPTBot is now more widely blocked, Common Crawl is a larger threat to any business that is concerned about its data being used to train another company’s AI model. What Google did for internet search, Common Crawl is doing for AI.
“This is a minefield,” said Catherine Stihler, CEO of Creative Commons. “We updated our strategy only a few years ago, and now we’re in a different world.”
Creative Commons started in 2001 as a way for creators and owners to license works for use on the internet through an alternative to strict a copyright framework, known as “copyleft.” Creators and owners maintain their rights, while a Commons license let people access the content and create derivative works. Wikipedia operates through a Creative Commons license, as does Flickr, Stack Overflow and ProPublica, along with many other well-known websites.
Under it’s new five year strategy, which notes the “problematic use of open content” to train AI technologies, Creative Commons is looking to make the sharing of work online more “equitable,” through a “multifrontal, coordinated, broad-based approach that transcends copyright.”
The 160 billion-page gorilla
Common Crawl, via CCBot, holds what is perhaps the largest repository of data ever collected from the internet. Since 2011, it has crawled and saved information from 160 billion web pages and counting. Typically it crawls and saves around 3 billion web pages each month.
Its mission statement says the undertaking is an “open data” project aimed at allowing anyone to “indulge their curiosities, analyze the world, and pursue brilliant ideas.”
The reality has become very different today. The massive amount of data it holds and continues to collect is being used by some of the world’s largest corporations to create mostly proprietary models. If a big tech company isn’t already making money off of its AI output (OpenAI has many paid services), there’s a plan to do so in the future.
Some big tech companies have stopped disclosing where they get this data. However, Common Crawl has been and continues to be used to develop many powerful AI models. It helped Google create Bard. It helped Meta train Llama. It helped OpenAI build ChatGPT.
Common Crawl also feeds The Pile, which hosts more curated datasets pulled from the work of other bot crawlers. It has been used extensively on AI projects, including Llama and an LLM from Microsoft and Nvidia, called MT-NLG.
Not comical
One of The Pile’s most recent downloads from June is a massive collection of comic books, including the entire works of Archie, Batman, X-Men, Star Wars and Superman. Created by DC Comics, now owned by Warner Brothers, and Marvel, now owned by Disney, all of the works remain under copyright. The Pile also hosts a large set of copyrighted books, as The Atlantic recently reported.
“There’s a difference between the intent of crawlers and how they are used,” said NYU’s Schultz. “It is very hard to police or insist that data be used in a particular way.”
As far as The Pile is concerned, while it admits its data is full of copyrighted material, it claimed in its founding technical paper that “there is little acknowledgment of the fact that the processing and distribution of data owned by others may also be a violation of copyright law.”
Beyond that, the group, part of EleutherAI, argued its use of the material is considered “transformative” under the fair use doctrine, despite the data sets holding relatively unaltered work. It also admitted that it needs to use full-length copyrighted content “in order to produce the best results” when training LLMs.
Such arguments of fair use by crawlers and AI projects are already being put to the test. Authors, visual artists and even source code developers are suing the likes of OpenAI, Microsoft and Meta because their original work has been used without their consent to train something they get no benefit from.
“There’s no universe where putting something on the internet grants free, unlimited, commercial use of someone’s labor w/o consent,” Steven Sinofsky, a former Microsoft executive who’s a partner at VC firm Andreessen Horowitz, wrote recently on X.
No resolution in sight
For the moment, there’s no clear resolution in sight.
“We are grappling with all of this now,” said Stihler, the CEO of Creative Commons. “There are so many issues that keep cropping up: compensation, consent, credit. What does all of that look like with AI? I don’t have an answer.”
De Valk said Creative Commons, with its method of facilitating broader copyright licenses that allow owned works to be used on the internet, has been suggested as a possible model for consent when it comes to AI model development.
Stihler is not so sure. When it comes to AI, perhaps there is no single solution. Licensing and copyright, even a more flexible Commons-style agreement, likely won’t work. How do you license the whole of the internet?
“Every lawyer that I speak to says a license is not going to solve the problem,” Stihler said.
She’s is talking about this regularly to stakeholders, from authors to executives of AI companies. Stihler with representatives of OpenAI earlier this year and said the company is discussing how to “reward creators.”
Still, it’s unclear “what the commons really looks like in the age of AI,” she added.
‘If we’re not careful, we’ll end up closing the commons’
Considering just how much data web crawlers have already scraped and handed over to big tech companies, and how little power is in the hands of the creators of that content, the internet as we know it could change dramatically.
If posting information online means giving data for free to an AI model that will compete with you for users, then this activity may simply stop.
There are already signs of this: Fewer human software coders are visiting Q&A web site Stack Overflow to answer questions. Why? Because their previous work was used to train AI models that now answer many of these questions automatically.
Stihler said the future of all created work online could soon look like the current state of streaming, with content locked behind “Plus” subscription fiefdoms that get ever more costly.
“If we’re not careful, we’ll end up closing the commons,” Stihler said. “There’ll be more walled gardens, more things people can’t access. That is not a successful model for humanity’s future of knowledge and creativity.”
- Content owners are wising up to their work being freely used by Big Tech to build new AI tools.
- Bots like Common Crawl are scraping and storing billions of pages of content for AI training.
- With less incentive to share online freely, the web could become a series of paywalled gardens.Â
AI is undermining the web’s grand bargain, and a decades-old handshake agreement is the only thing standing in the way.
A single bit of code, robots.txt, was proposed in the late 1990’s as a way for websites to tell bot crawlers they don’t want their data scraped and collected. It was widely accepted as one of the unofficial rules supporting the web.
At the time, the main purpose of these crawlers was to index information so results in search engines would improve. Google, Microsoft’s Bing and other search engines have crawlers. They index content so it can be later served up as links to billions of potential consumers. This is the essential deal that created the flourishing web we know today: Creators share abundant information and exchange ideas online freely because they know consumers will visit and either see an ad, subscribe, or buy something.
Now, though, generative AI and large language models are changing the mission of web crawlers radically and rapidly. Instead of working to support content creators, these tools have been turned against them.
The bots feeding Big Tech
Web crawlers now collect online information to feed into giant datasets that are used for free by wealthy tech companies to develop AI models. CCBot feeds Common Crawl, one of the biggest AI datasets. GPTbot feeds data to OpenAI, the company behind ChatGPT and GPT-4, currently the most powerful AI model. Google just calls its LLM training data “Infiniset,” without mentioning where the vast majority of the data comes from. Although 12.5% comes from C4, a cleaned up version of Common Crawl.
The models use all this free information to learn how to answer user questions immediately. That’s a long way from indexing a web site so users can be sent through to the original work.
Without a supply of potential consumers, there’s little incentive for content creators to let web crawlers continue to suck up free data online. GPTbot is already being blocked by Amazon, Airbnb, Quora, and hundreds of other websites. Common Crawl’s CCBot is beginning to be blocked more, too.
‘A crude tool’
What hasn’t changed is how to block these crawlers. Implementing robots.txt on a web site, and excluding specific crawlers, is the only option. And it’s not very good.
“It’s a bit of a crude tool,” said Joost de Valk, a former WordPress executive, tech investor and founder of digital marketing firm Yoast. “It has no basis in law, and is basically maintained by Google, although they say they do that together with other search engines.”
It’s also open to manipulation, especially given the voracious appetite for quality AI data. The only thing a company like OpenAI has to change is the name of its bot crawler to bypass all the disallow rules people put in place using robots.txt, de Valk explained.
Because robots.txt is voluntary, web crawlers can also simply ignore the blocking instructions and siphon the information from a site anyway. Some crawlers, like that of Brave, a newer search engine, don’t bother disclosing the name of their crawler, making it impossible to block.
“Everything online is being sucked up into a vacuum for the models,” said Nick Vincent, a computer science professor who studies the relationship between human-generated data and AI. “There’s so much going on under the hood. In the next six months, we will look back and want to evaluate these models differently.”
AI bot backlash
De Valk warns that owners and creators of online content may already be too late in understanding the risks of allowing these bots to scoop up their data for free and use it indiscriminately to develop AI models.
“Right now, doing nothing means, ‘I’m ok with my content being in every AI and LLM in the world,’ de Valk said. “That’s just plain wrong. A better version of robots.txt could be created, but it’d be very weird if that was done by the search engines and the large AI parties themselves.”
Several major companies and websites have responded recently, with some starting to deploy robots.txt for the first time.
As of August 22, 70 of the 1,000 most-popular websites have used robots.txt to block GPTBot since OpenAI revealed the crawler about three weeks ago, according to Originality.ai, a company that checks content to see if it’s AI-generated or plagiarized.
The company also found that 62 of the 1,000 most popular websites are blocking Common Crawl’s CCBot, with an increasing number doing so only this year as awareness of data crawling for AI has grown.
Still, it is not enforceable. Any crawler could ignore a robots.txt file and collect every last bit of data it found on a webpage, with the owner of the page more than likely having no idea it even happened. Even if robots.txt had any basis in law, its original purpose has little to do with information on the internet being used to create AI models.
“Robots.txt is unlikely to be seen as a legal prohibition on use of data,” according to Jason Schultz, director of NYU’s Technology Law & Policy Clinic. “It was primarily meant to signal that one did not want one’s website to be indexed by search engines, not as a signal that one did not want one’s content used for machine learning and AI training.”
‘This is a minefield’
This activity has been going on for years. OpenAI revealed its first GPT model in 2018, having trained it on BookCorpus, a dataset of thousands of indie or self-published books. Common Crawl started in 2008 and its dataset became publicly available in 2011 through cloud storage provided by AWS.
Although GPTBot is now more widely blocked, Common Crawl is a larger threat to any business that is concerned about its data being used to train another company’s AI model. What Google did for internet search, Common Crawl is doing for AI.
“This is a minefield,” said Catherine Stihler, CEO of Creative Commons. “We updated our strategy only a few years ago, and now we’re in a different world.”
Creative Commons started in 2001 as a way for creators and owners to license works for use on the internet through an alternative to strict a copyright framework, known as “copyleft.” Creators and owners maintain their rights, while a Commons license let people access the content and create derivative works. Wikipedia operates through a Creative Commons license, as does Flickr, Stack Overflow and ProPublica, along with many other well-known websites.
Under it’s new five year strategy, which notes the “problematic use of open content” to train AI technologies, Creative Commons is looking to make the sharing of work online more “equitable,” through a “multifrontal, coordinated, broad-based approach that transcends copyright.”
The 160 billion-page gorilla
Common Crawl, via CCBot, holds what is perhaps the largest repository of data ever collected from the internet. Since 2011, it has crawled and saved information from 160 billion web pages and counting. Typically it crawls and saves around 3 billion web pages each month.
Its mission statement says the undertaking is an “open data” project aimed at allowing anyone to “indulge their curiosities, analyze the world, and pursue brilliant ideas.”
The reality has become very different today. The massive amount of data it holds and continues to collect is being used by some of the world’s largest corporations to create mostly proprietary models. If a big tech company isn’t already making money off of its AI output (OpenAI has many paid services), there’s a plan to do so in the future.
Some big tech companies have stopped disclosing where they get this data. However, Common Crawl has been and continues to be used to develop many powerful AI models. It helped Google create Bard. It helped Meta train Llama. It helped OpenAI build ChatGPT.
Common Crawl also feeds The Pile, which hosts more curated datasets pulled from the work of other bot crawlers. It has been used extensively on AI projects, including Llama and an LLM from Microsoft and Nvidia, called MT-NLG.
Not comical
One of The Pile’s most recent downloads from June is a massive collection of comic books, including the entire works of Archie, Batman, X-Men, Star Wars and Superman. Created by DC Comics, now owned by Warner Brothers, and Marvel, now owned by Disney, all of the works remain under copyright. The Pile also hosts a large set of copyrighted books, as The Atlantic recently reported.
“There’s a difference between the intent of crawlers and how they are used,” said NYU’s Schultz. “It is very hard to police or insist that data be used in a particular way.”
As far as The Pile is concerned, while it admits its data is full of copyrighted material, it claimed in its founding technical paper that “there is little acknowledgment of the fact that the processing and distribution of data owned by others may also be a violation of copyright law.”
Beyond that, the group, part of EleutherAI, argued its use of the material is considered “transformative” under the fair use doctrine, despite the data sets holding relatively unaltered work. It also admitted that it needs to use full-length copyrighted content “in order to produce the best results” when training LLMs.
Such arguments of fair use by crawlers and AI projects are already being put to the test. Authors, visual artists and even source code developers are suing the likes of OpenAI, Microsoft and Meta because their original work has been used without their consent to train something they get no benefit from.
“There’s no universe where putting something on the internet grants free, unlimited, commercial use of someone’s labor w/o consent,” Steven Sinofsky, a former Microsoft executive who’s a partner at VC firm Andreessen Horowitz, wrote recently on X.
No resolution in sight
For the moment, there’s no clear resolution in sight.
“We are grappling with all of this now,” said Stihler, the CEO of Creative Commons. “There are so many issues that keep cropping up: compensation, consent, credit. What does all of that look like with AI? I don’t have an answer.”
De Valk said Creative Commons, with its method of facilitating broader copyright licenses that allow owned works to be used on the internet, has been suggested as a possible model for consent when it comes to AI model development.
Stihler is not so sure. When it comes to AI, perhaps there is no single solution. Licensing and copyright, even a more flexible Commons-style agreement, likely won’t work. How do you license the whole of the internet?
“Every lawyer that I speak to says a license is not going to solve the problem,” Stihler said.
She’s is talking about this regularly to stakeholders, from authors to executives of AI companies. Stihler with representatives of OpenAI earlier this year and said the company is discussing how to “reward creators.”
Still, it’s unclear “what the commons really looks like in the age of AI,” she added.
‘If we’re not careful, we’ll end up closing the commons’
Considering just how much data web crawlers have already scraped and handed over to big tech companies, and how little power is in the hands of the creators of that content, the internet as we know it could change dramatically.
If posting information online means giving data for free to an AI model that will compete with you for users, then this activity may simply stop.
There are already signs of this: Fewer human software coders are visiting Q&A web site Stack Overflow to answer questions. Why? Because their previous work was used to train AI models that now answer many of these questions automatically.
Stihler said the future of all created work online could soon look like the current state of streaming, with content locked behind “Plus” subscription fiefdoms that get ever more costly.
“If we’re not careful, we’ll end up closing the commons,” Stihler said. “There’ll be more walled gardens, more things people can’t access. That is not a successful model for humanity’s future of knowledge and creativity.”