Over the past 15 years or so, social media data have provided great insights into both human behaviour and the effects of social media itself on society, including understanding of voting patterns, mobility and movement, as well as responses to natural disasters, emergencies and pandemics.
Social media data have been an immensely valuable and appealing resource for researchers, as masses of information have been readily available via a platform’s application programming interface (API) – an official channel that enables individuals to pull and post social media content. However, numerous platforms, including Twitter (now called X), TikTok and Reddit, have recently made substantial changes to their APIs, where access has been drastically reduced and monetised. These changes have triggered discussion within academic communities as the increasing difficulties in accessing social media data have presented researchers with challenges that, in many cases, have made research impossible to perform.
Effects of restrictions to data sharing on reproducibility
One of the biggest changes is the implementation of highly restrictive data-sharing statements in social media platforms’ terms and conditions (henceforth “terms’). Such restrictions are problematic if researchers want to replicate datasets to validate data that were previously collected. For example, our work needed to replicate datasets to train and run machine-learning models to identify bots on Twitter. However, since 2016, Twitter’s restriction to share raw data has meant that researchers can no longer share data other than tweet and user IDs. So our work had to recollect the required fields for the bot detection directly from the Twitter API. However, some fields were no longer available, thus destroying reproducibility and replicability.
- Collection: how to make social media an academic friend not foe
- The 10 commandments of academic Twitter
- Making the most of social media: practical tips for academics
Twitter changed its terms again, in mid-2023, to allow for up to 50,000 tweets (including content) to be shared per day between two individuals for research, while also stating that one cannot infer anything on an individual level regarding, for example, health, political stance or demographics – only at an aggregate (grouped) level.
In a similar vein, Reddit terms state that its users (rather than Reddit) own the content they produce and that such content “cannot be used to train machine learning (ML) or AI models without the express permission of the rights holders”. From our understanding (at present), there seems to be no distinction between commercial and non-commercial (research) use. Moreover, we find these terms vague, with no definition of ML or AI, thus leaving swathes of computational research projects between a rock and a hard place.
Requirements are that incompatible with research practice
Other terms simply render research impossible. For example, unless you’re a US academic, you cannot use the TikTok API, and you cannot use any data from TikTok unless you obtain it through the API, thus cutting off the rest of the world. Similarly, TikTok has requirements to regularly update datasets that are incompatible with research practice. TikTok’s terms state that researchers must “refresh Research API data at least every fifteen days, and delete data [that is no longer available]”. Although TikTok shared in July 2023 that it was expanding its Research API to Europe (but still excluding several developing countries), its terms remain too restrictive to be compatible with research.
While acknowledging that users delete and edit their posts, remove their accounts or switch privacy settings – which should be honoured and protected – it is important to note that this fundamentally changes the original datasets. Hence, if researchers cannot share datasets, they are working with datasets that constantly shift over time. This has large implications for reproducing work in the future.
It is worth noting that changes to API access can be well intentioned and necessary. For instance, the Cambridge Analytica scandal in 2018 provoked social media platforms to implement strict measures that prevented third-party users from gaining access to personal data without consent. They then enabled users to revoke app permissions, which gave users more control over their data to protect user privacy.
New routes to access data
We are at a point where either we accept that we cannot use or afford data like we used to, or we gather data outside official means (which falls into legal grey areas and almost always violates terms). We do not yet know what the ramifications are, as we are in uncharted territory.
However, in response to the current changes, we hold great interest in the new routes forming to access data, which appear to be more sustainable and affordable, and will protect users. For instance, new regulations are coming into effect in the European Union, likely in 2024, which aim to address this issue. For example, the EU Digital Services Act (DSA) aims to provide access to “very large online platforms” for vetted researchers. Similarly, there are updates to GDPR Article 40. The details remain vague and unknown, where no understanding yet exists of what vetted researchers are and the process to become one, nor the costs involved, the data and digital infrastructure needed, or the conditions of using such data. While this all remains in the abyss, steps appear to be being taken to rebalance the playing fields.
Brittany I. Davidson is an associate professor of analytics in Information, Decisions and Operations (IDO) in the School of Management; Joanne Hinds is an associate professor of information systems, both at the University of Bath. Daniel Racek is a doctoral candidate in statistics and machine learning at Ludwig Maximilians Universität, Munich (LMU Munich).
If you would like advice and insight from academics and university staff delivered direct to your inbox each week, sign up for the Campus newsletter.
comment