The Hong Kong Protests: A Twitter Story

The Hong Kong Protests: A Twitter Story

This is supplementary commentary for the Hong Kong Twitter disinformation dashboard. ﻿﻿
﻿
In the spirit of serving the public conversation, Twitter publishes datasets containing tweets and account history from accounts identified as being related to state-backed misinformation actors. The implication here is that these campaigns are backed by the People’s Republic of China.
﻿
This project aims to make the data accessible to the public to demonstrate examples of disinformation campaigns happening on social media today.
﻿
Observations
There's a lot of spam going on here. The top ~450 accounts tweet out 95% of the content in this dataset, much of which is spam or retweeting spammy content (i.e. recipe content, food videos). 
﻿
The account with highest volume, coolchefff666, mostly tweets about recipes—this is highly “retweetable” content, and it is an easy way to amass mindless followers.
﻿
'Baby #Food #Recipes : #Beef Porridge with #Carrots http://t.co/ouclrZrhnw',
 'Beef Wellington http://t.co/XNis6bFh',
 '#Vegetarian #sushi #recipe - deliciousness for Food Friday! http://t.co/ME3SzQpEer http://t.co/pUmjLnzR6g',
 'How to Make Chocolate Truffles : Serving Tips for Homemade #Chocolate #Truffle #Recipe http://dld.bz/SbkS',
 'Catching Up with Top Chef Winner Hung Huynh http://dld.bz/udWK',
 'Authors at #Google: Matt Amsden and David Wolfe http://dld.bz/TSJj',﻿
The account also intermittently comments on Chinese politics, including the China-US trade dispute and Guo Wengui, a Chinese businessman which has been the subject of a previous social media smear campaign by these exact accounts (amongst others).
﻿
中国自古以来就有“两国交兵不斩来使”的战争道义，但特朗普总统无视刘鹤副总理远道而去的谈判诚意，反而用粗暴加税的方式给中国来了一个“下马威”。作为大国外交，一举一动都是承诺，都会被世界人民看在眼里、记在心上。我们相信，美国人民更会洞若观火，中美新一轮贸易纠纷究竟谁是谁非自有公断。﻿
闹剧被拆穿无数的郭文贵，最近编造谎言确实不太走心，不仅创意是烂的可以，内容也是极为可笑，空剩下一副蹭热点的激情，但苦于军师不足，结果漏洞百出，五月的黑色依旧延续，估计郭文贵的六月同样也是黑色。﻿
Accounts seem to spend much time focusing on gaining followers, with hashtags like #openfollow, #FF, and #PlsRetweet being major hashtags. #HongKong, #HK and #香港 are also on the list. This further points to a fairly unsophisticated strategy where accounts aim to amass as many followers as possible to boost credibility in passing, without a clear strategy or targeting.
Many accounts tweet in 20+ languages. It is unlikely that they are organically engaging on Twitter in 20+ languages. Most seem to be retweeting spam from many sources.
Many accounts self-report to be located in Canada, the United States, and Europe. Twitter is banned in China, hence the noticeable lack of accounts claiming to be from China. It is, of course, important to keep in mind that people can claim to be located anywhere in the world, which suggests that these accounts report locations which they believe to lend highest credibility.
Some of these accounts have been active for a long time, and seem to have been vocal in previous controversial political conversations on Twitter. There is significant activity from these accounts in 2017 when democracy activists were organizing, suggesting that this network has been active in sowing Twitter discord around Chinese political controversies for several years now—notably in November 2016 (around the time of the Hong Kong pro-democracy march) and the Xi Jiping protests in June 2017.
Curiously, a significant amount of this spike in tweets are in Indonesian.
﻿
Late 2017 saw the imprisonment of three Hong Kong democracy activists. Several high-profile Western media outlets report on the ongoing Hong Kong-China tensions in September 2018.
﻿
Attacking Miles Kwok (also known as Guo Wengui): These accounts appear to have been part of an ongoing smear campaign against controversial Chinese billionaire Guo Wengui, who has publicly accused the Chinese Communist Party of corruption and is currently exiled from China. He is the ninth most-tweeted entity by these accounts. His account has been suspended, though seemingly independent from this investigation.
For many accounts, there isn't a clear point where they turn into a propaganda-related account. It appears to be that accounts often tweet spammy content while occasionally commenting on Hong Kong politics, but also continuing to push content on movie reviews and vague inspirational quotes (accounts in this category include LifeWord6, jidade0325, and laieuaet)
The averaged tweets-by-hour graph is stunning. There is seemingly a push to get content out early during the Chinese workday, and again to hit the evening social media wave around 9pm China Standard Time.
﻿
This overlay is unexpectedly clear.
﻿
Further questions
This data raised lots of unanswered questions for me personally, including:
Why is 45% of content in Indonesian, despite 96% of the accounts using Twitter in English or Simplified Chinese? A glance at the tweets in Indonesian seem to suggest that they are mostly bots. 
What percentage of these accounts have been repurposed for undermining the Hong Kong protests? What does this process look like? The shadow market for spammy Twitter accounts is an area that I am not versed in, but there is probably further investigation to be done here.
﻿
Things which didn't make it into this draft
Social network analysis: I broke down the “who tweets/retweets who” question (with text parsing rules which seem to capture a few more accounts than the Twitter field does) and mapped this out, but the viz has not been added. 
Entity recognition: This worked to a medium degree of success on English data, and poorly on everything else, which wasn't surprising.
Things which didn't work include: translating all the non-English tweets into English with the Google Translate API (lol), tweet clustering with the Multilingual Sentence Encoder + clustering methods
﻿
Other reporting on this 
The Australian Strategic Policy Institute did an excellent deep dive into this dataset as well, and Quartz reported on their findings 
﻿
Other notes
There's a gap in some non-English language understanding methodology in the natural language processing world. Many packages support English (and Chinese is gaining traction—spaCy doesn't have models for it, but there is a tokenizer out there). However, there was little infrastructure to support analysis of Indonesian content. 
I'm not a journalist, professional researcher, or affiliated with any political entities. I'm a machine learning engineer with interests in misinformation, language, democracy, and our information ecosystem. I'm solely driven by curiosity and a desire to use my software skills to contribute to pressing problems of our time—misinformation being a key one. This project was a four-week endeavour undertaken during my batch at the Recurse Center, and is not currently under active development. There is lots of further work to done.

To reply you need to sign in.