The text file that runs the internet

Really interesting writeup on the history of robots.txt (I’ve written previously about my feelings around robots.txt and AI bots).

the main focus of robots.txt was on search engines; you’d let them scrape your site and in exchange they’d promise to send people back to you. Now AI has changed the equation: companies around the web are using your site and its data to build massive sets of training data, in order to build models and products that may not acknowledge your existence at all.

The robots.txt file governs a give and take; AI feels to many like all take and no give.

That’s the problem: the incentives for an open web are quickly dwindling.

In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities. That has caused internet providers of all sorts to reconsider the value of the data on their servers, and rethink who gets access to what. Being too permissive can bleed your website of all its value; being too restrictive can make you invisible. And you have to keep making that choice with new companies, new partners, and new stakes all the time.

Great ending:

[the creators of robots.txt] believed that the internet was a good place, filled with good people, who above all wanted the internet to be a good thing. In that world, and on that internet, explaining your wishes in a text file was governance enough. Now, as AI stands to reshape the culture and economy of the internet all over again, a humble plain-text file is starting to look a little old-fashioned.