In the spirit of serving the public conversation, Twitter publishes datasets containing tweets and account history from accounts identified as being related to state-backed misinformation actors. The implication here is that these campaigns are backed by the People’s Republic of China.
This project aims to make the data accessible to the public to demonstrate examples of disinformation campaigns happening on social media today.
There's a lot of spam going on here. The top ~450 accounts tweet out 95% of the content in this dataset, much of which is spam or retweeting spammy content (i.e. recipe content, food videos).
The account with highest volume, coolchefff666, mostly tweets about recipes—this is highly “retweetable” content, and it is an easy way to amass mindless followers.
'Baby #Food #Recipes : #Beef Porridge with #Carrots http://t.co/ouclrZrhnw', 'Beef Wellington http://t.co/XNis6bFh', '#Vegetarian #sushi #recipe - deliciousness for Food Friday! http://t.co/ME3SzQpEer http://t.co/pUmjLnzR6g', 'How to Make Chocolate Truffles : Serving Tips for Homemade #Chocolate #Truffle #Recipe http://dld.bz/SbkS', 'Catching Up with Top Chef Winner Hung Huynh http://dld.bz/udWK', 'Authors at #Google: Matt Amsden and David Wolfe http://dld.bz/TSJj',
The account also intermittently comments on Chinese politics, including the China-US trade dispute and Guo Wengui, a Chinese businessman which has been the subject of a previous social media smear campaign by these exact accounts (amongst others).
Accounts seem to spend much time focusing on gaining followers, with hashtags like #openfollow, #FF, and #PlsRetweet being major hashtags. #HongKong, #HK and #香港 are also on the list. This further points to a fairly unsophisticated strategy where accounts aim to amass as many followers as possible to boost credibility in passing, without a clear strategy or targeting.
Many accounts tweet in 20+ languages. It is unlikely that they are organically engaging on Twitter in 20+ languages. Most seem to be retweeting spam from many sources.
Many accounts self-report to be located in Canada, the United States, and Europe. Twitter is banned in China, hence the noticeable lack of accounts claiming to be from China. It is, of course, important to keep in mind that people can claim to be located anywhere in the world, which suggests that these accounts report locations which they believe to lend highest credibility.
Some of these accounts have been active for a long time, and seem to have been vocal in previous controversial political conversations on Twitter. There is significant activity from these accounts in 2017 when democracy activists were organizing, suggesting that this network has been active in sowing Twitter discord around Chinese political controversies for several years now—notably in November 2016 (around the time of the Hong Kong pro-democracy march) and the Xi Jiping protests in June 2017.
Curiously, a significant amount of this spike in tweets are in Indonesian.
Attacking Miles Kwok (also known as Guo Wengui): These accounts appear to have been part of an ongoing smear campaign against controversial Chinese billionaire Guo Wengui, who has publicly accused the Chinese Communist Party of corruption and is currently exiled from China. He is the ninth most-tweeted entity by these accounts. His account has been suspended, though seemingly independent from this investigation.
For many accounts, there isn't a clear point where they turn into a propaganda-related account. It appears to be that accounts often tweet spammy content while occasionally commenting on Hong Kong politics, but also continuing to push content on movie reviews and vague inspirational quotes (accounts in this category include LifeWord6, jidade0325, and laieuaet)
The averaged tweets-by-hour graph is stunning. There is seemingly a push to get content out early during the Chinese workday, and again to hit the evening social media wave around 9pm China Standard Time.
This data raised lots of unanswered questions for me personally, including:
Why is 45% of content in Indonesian, despite 96% of the accounts using Twitter in English or Simplified Chinese? A glance at the tweets in Indonesian seem to suggest that they are mostly bots.
What percentage of these accounts have been repurposed for undermining the Hong Kong protests? What does this process look like? The shadow market for spammy Twitter accounts is an area that I am not versed in, but there is probably further investigation to be done here.
Things which didn't make it into this draft
Social network analysis: I broke down the “who tweets/retweets who” question (with text parsing rules which seem to capture a few more accounts than the Twitter field does) and mapped this out, but the viz has not been added.
Entity recognition: This worked to a medium degree of success on English data, and poorly on everything else, which wasn't surprising.
Things which didn't work include: translating all the non-English tweets into English with the Google Translate API (lol), tweet clustering with the Multilingual Sentence Encoder + clustering methods
Other reporting on this
There's a gap in some non-English language understanding methodology in the natural language processing world. Many packages support English (and Chinese is gaining traction—spaCy doesn't have models for it, but there is a tokenizer out there). However, there was little infrastructure to support analysis of Indonesian content.
I'm not a journalist, professional researcher, or affiliated with any political entities. I'm a machine learning engineer with interests in misinformation, language, democracy, and our information ecosystem. I'm solely driven by curiosity and a desire to use my software skills to contribute to pressing problems of our time—misinformation being a key one. This project was a four-week endeavour undertaken during my batch at the Recurse Center, and is not currently under active development. There is lots of further work to done.