Abstract
This article analyzes one month of edits to Wikipedia in order to examine the role of users editing multiple language editions (referred to as multilingual users). Such multilingual users may serve an important function in diffusing information across different language editions of the encyclopedia, and prior work has suggested this could reduce the level of self-focus bias in each edition. This study finds multilingual users are much more active than their single-edition (monolingual) counterparts. They are found in all language editions, but smaller-sized editions with fewer users have a higher percentage of multilingual users than larger-sized editions. About a quarter of multilingual users always edit the same articles in multiple languages, while just over 40% of multilingual users edit different articles in different languages. When non-English users do edit a second language edition, that edition is most frequently English. Nonetheless, several regional and linguistic cross-editing patterns are also present.
View article
The official version (paywall) of the article is in the ACM Digital Library, and an identical open-access copy is freely available on arXiv or through this open-access link.
Citation
Hale, S. A. (2014). Multilinguals and Wikipedia Editing. In Proceedings of the 6th Annual ACM Web Science Conference, WebSci ’14, ACM.
@inproceedings{hale-websci2014, author = {Hale, Scott A.}, title = {Multilinguals and {W}ikipedia Editing}, booktitle = {Proceedings of the 2014 ACM Conference on Web Science}, series = {WebSci '14}, year = {2014}, isbn = {978-1-4503-2622-3}, location = {Bloomington, Indiana, USA}, pages = {99--108}, numpages = {10}, url = {http://doi.acm.org/10.1145/2615569.2615684}, doi = {10.1145/2615569.2615684}, acmid = {2615684}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {cross-language, information diffusion, information discovery, multilingual, social media, social network analysis, Wikipedia} }
Presentation
Data and code
Data collection: Recording the Wikipedia IRC stream
I wrote the Wikipedia-IRC-Logger code in Java to monitor the IRC streams for Wikipedia updates and store these to text files: one file per day. After collecting the data and writing the paper, I also found WikiMon, which is written in Python to monitor the IRC stream and broadcast the edits over WebSocket.
Data cleaning and analysis
I then processed the data files to group edits by a single username together and compute other statistics per user. The most important file in this code is RecentChange.java, which is able to parse a line from the IRC file dumps stored above to a nice Java Business Object representing one edit. I used this within the Hadoop framework to parse all lines of the IRC files and group them by username. The Wikipedia-IRC-MapReduce repository has all the code I used with Hadoop.
The result of the Hadoop framework was pre-processed with a simple python script (coming soon) to produce a dataframe that was loaded into R where I preformed all of the analysis (R code coming soon).
As described in the paper, the R code filters the dataset to remove users with no break in editing (this includes undeclared bots as well as users with one editing burst) and then creates a list of usernames that edit multiple editions. These usernames were checked again the global account manager to ensure they really were global accounts connected across the language editions they edited. The list of non-global account was fed back into the Hadoop code mentioned earlier in order to separate common usernames that edited multiple editions but were not linked global accounts.
Related work
- Hale, S. A. (2014). Global Connectivity and Multilinguals in the Twitter Network. In Proceedings of the 32nd International Conference on Human Factors in Computing Systems, CHI ’14, ACM.
- Hale, S. A. (2015). Cross-language Wikipedia Editing of Okinawa, Japan. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’15, ACM.
Pingback: Collection of Graphics on Language and the Internet | Net Increase?