Multilinguals and Wikipedia Editing

Abstract

This article analyzes one month of edits to Wikipedia in order to examine the role of users editing multiple language editions (referred to as multilingual users). Such multilingual users may serve an important function in diffusing information across different language editions of the encyclopedia, and prior work has suggested this could reduce the level of self-focus bias in each edition. This study finds multilingual users are much more active than their single-edition (monolingual) counterparts. They are found in all language editions, but smaller-sized editions with fewer users have a higher percentage of multilingual users than larger-sized editions. About a quarter of multilingual users always edit the same articles in multiple languages, while just over 40% of multilingual users edit different articles in different languages. When non-English users do edit a second language edition, that edition is most frequently English. Nonetheless, several regional and linguistic cross-editing patterns are also present.

View article

The official version (paywall) of the article is in the ACM Digital Library, and an identical open-access copy is freely available on arXiv or through this open-access link.

Citation

Hale, S. A. (2014). Multilinguals and Wikipedia Editing. In Proceedings of the 6th Annual ACM Web Science Conference, WebSci ’14, ACM.

@inproceedings{hale-websci2014,
	author = {Hale, Scott A.},
	title = {Multilinguals and {W}ikipedia Editing},
	booktitle = {Proceedings of the 2014 ACM Conference on Web Science},
	series = {WebSci '14},
	year = {2014},
	isbn = {978-1-4503-2622-3},
	location = {Bloomington, Indiana, USA},
	pages = {99--108},
	numpages = {10},
	url = {http://doi.acm.org/10.1145/2615569.2615684},
	doi = {10.1145/2615569.2615684},
	acmid = {2615684},
	publisher = {ACM},
	address = {New York, NY, USA},
	keywords = {cross-language, information diffusion, information discovery, multilingual, social media, social network analysis, Wikipedia}
}

Presentation

Data and code

Data collection: Recording the Wikipedia IRC stream

I wrote the Wikipedia-IRC-Logger code in Java to monitor the IRC streams for Wikipedia updates and store these to text files: one file per day. After collecting the data and writing the paper, I also found WikiMon, which is written in Python to monitor the IRC stream and broadcast the edits over WebSocket.

Data cleaning and analysis

I then processed the data files to group edits by a single username together and compute other statistics per user. The most important file in this code is RecentChange.java, which is able to parse a line from the IRC file dumps stored above to a nice Java Business Object representing one edit. I used this within the Hadoop framework to parse all lines of the IRC files and group them by username. The Wikipedia-IRC-MapReduce repository has all the code I used with Hadoop.

The result of the Hadoop framework was pre-processed with a simple python script (coming soon) to produce a dataframe that was loaded into R where I preformed all of the analysis (R code coming soon).

As described in the paper, the R code filters the dataset to remove users with no break in editing (this includes undeclared bots as well as users with one editing burst) and then creates a list of usernames that edit multiple editions. These usernames were checked again the global account manager to ensure they really were global accounts connected across the language editions they edited. The list of non-global account was fed back into the Hadoop code mentioned earlier in order to separate common usernames that edited multiple editions but were not linked global accounts.

Related work

One Response to Multilinguals and Wikipedia Editing

  1. Pingback: Collection of Graphics on Language and the Internet | Net Increase?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.