Global Connectivity and Multilinguals in the Twitter Network


This article analyzes the global connectivity of the Twitter retweet and mentions network and the role of multilingual users engaging with content in multiple languages. The network is heavily structured by language with most mentions and retweets directed to users writing in the same language. Users writing in multiple languages are more active, authoring more tweets than monolingual users. These multilingual users play an important bridging role in the global connectivity of the network. The mean level of insularity from speakers in each language does not correlate straightforwardly with the size of the user base as predicted by previous research. Finally, the English language does play more of a bridging role than other languages, but the role played collectively by multilingual users across different languages is the largest bridging force in the network.

View article

The official version (paywall) of the article is in the ACM Digital Library, and an identical open-access copy is freely available here or through this open-access link.


Hale, S. A. (2014). Global Connectivity and Multilinguals in the Twitter Network. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’14, ACM (Toronto, Canada).

 author = {Hale, Scott A.},
 title = {Global Connectivity and Multilinguals in the Twitter Network},
 booktitle = {Proceedings of the SIGCHI Conference on Human Factors in Computing Systems},
 series = {CHI '14},
 year = {2014},
 isbn = {978-1-4503-2473-1},
 location = {Toronto, Ontario, Canada},
 pages = {833--842},
 numpages = {10},
 url = {},
 doi = {10.1145/2556288.2557203},
 acmid = {2557203},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {cross-language, information diffusion, information discovery, micro-blogs, multilingual, social media, social network analysis},


Data and code

Recording a Twitter stream

The data in the paper was recorded with Phirehose, a PHP library. The Twitter API has undergone a number of changes since the time of data collection, and I would recommend now using the example code in twitter-python which is in Python and uses the Tweepy library.

The output of the above is one file per day with a filename in the format yyyy-mm-dd.json. The files are not strictly json, but rather each line is one json object representing on tweet (or another message from Twitter: e.g., a delete message).

Map-reduce and network formation

The files produced from recording the stream output using the above code were feed into two Hadoop map-reduce jobs. One job produced node information (username, languages used, number of tweets, etc.). The other job produced a network edge list of mentions, retweets, and replies. The output of both jobs were combined together into a network in GraphML format using a standard Java class. This code is available on GitHub in the twitter-mapred repository.

More detailed, I used TweetLanguage to get a list of the languages each user wrote tweets in and the relative frequency with which each language was used. I then used UserMentions to get the total number of times users mention/reply to each other. The output of this second job was filtered using UserMentionsFilter to only include users who had language data from the TweetLanguage job. Finally, the output of UserMentionsFilter and the TweetLanguage jobs were combined into a network graph using The resulting network is the starting network analyzed in the article.

Dataset and R/igraph analysis code

Twitter’s Terms of Services does not permit the full dataset to be shared; however, an anonymized network is available for download (coming soon). I performed the bulk of the analysis in R using igraph and generated figures with ggplot2. This code is quite messy but available nonetheless. The code requires density_functions.R to draw the density plots in Figure 1.

Parallelized community detection

Community detection was done using the label propagation method as described in Raghavan, et al. I parallelized the algorithm and ran it on multiple threads using Java. The code is available on GitHub at

Related publications

Leave a Reply

Your email address will not be published. Required fields are marked *