AI's insatiable thirst for data creates problems for those who produce and host that precious data.
Artificial Intelligence is speculated to fundamentally change the world according to its fans. Having computer models which are able to write passable 500 word texts on practically any subject (unless it happened after 2021) and only sometimes lying is apparently the missing piece holding humanity back from its apotheosis. But we also have to be careful as the models have a risk of destroying humanity, according to Elon Musk. To say that artificial intelligence (AI) is overhyped is the understatement of the decade. The idea of AI is fascinating, the reality is mostly disappointing; no matter how impressive the technology is and no matter the uses a potential better future version might have, right now it’s not all that it’s made out to be. That being said, it seems to have had quite the profound impact on one industry: social media.
To understand how recent changes at Reddit and Twitter are connected to AI it helps understanding the basics of how AI works. Basically all modern consumer-facing AI’s like ChatGPT and Dall-E are built on vast amounts of data. A recent article in The Economist reports that version four of the GPT model on which ChatGPT is built needed one trillion parameters to achieve the performance it does. In order to form parameters the model needs truly astronomical quantities of data, which is usually found by scraping the internet. Where the data comes from is not of particular concern, and the producers of the data needed for “training” models like GPT4 are often not compensated.
So if you need tens of billions of words of competently written work for use in AI development where would you find that? My guess is on text-based social media platforms. In May Reddit kicked a hornet’s nest when it decided to start charging for using its API. The prohibitive cost basically killed the cottage industry of third-party Reddit clients by dictat. The stated reason from the company was that the Reddit platform was exploited by AI developers as they were able to scrape Reddit’s precious data through the API without paying a penny. The fact that the stroke of a pen also destroyed alternative clients which helped Reddit innovate and that it destroyed academic research projects which can’t pay the fee is just a sad by-product for the company. Twitter also announced that it has implemented limits on how many tweets users can see per day with different tiers depending on whether a given user is logged in or paying for Twitter Blue.
As an AI sceptic, it is disheartening to see Silicon Valley firms consume user generated data so rapaciously as to threaten the platforms on which that very data is hosted when the end product is so underwhelming. It also pointedly illustrates the hypocrisy of internet businesses which exploit their users for data, while at the same time jealously guarding their own data. It’s my firm conviction that one of the main reasons why the internet as we know it is so full of data harvesting schemes is that there is no agreed on price for a single piece of datum. As much upheaval as AI development has caused perhaps it can set us on the path to arrive at an agreed upon price of the foundation of the internet economy: data.
If you liked this post you can read my last post about Putin vs Prigozhin here, or the rest of my writings here. It'd mean a lot to me if you recommended the blog to a friend or coworker. Come back next Monday for a new post!
I've always been interested in politics, economics, and the interplay between. The blog is a place for me to explore different ideas and concepts relating to economics or politics, be that national or international. The goal for the blog is to make you think; to provide new perspectives.
Written by Karl Johansson
Sources:
Cover photo by cottonbro studio from Pexels, edited by Karl Johansson
Comments