Featured in Guardian article on language research

A recent article in the Guardian newspaper by Holly Yong surveys much research about online language divides, including my work on multilingualism and cross-language bridging:

Translation technologies offer one solution to bridging online language divides, while also opening up new markets for businesses. Although currently only available in a few languages, last year Microsoft launched the Skype translator, and both Facebook and Twitter have also paired up with Bing to offer users translation services.

Scott Hale, data scientist at the Oxford Internet Institute, argues that more could also be done to unlock the power of multilinguals online. Internet platforms he believes could be modified to make it easier for multilingual users to find content in other languages, as well as encourage them to contribute in more than one language. “Many review sites, such as TripAdvisor and Google Play, prioritise reviews in a person’s selected user-interface language or even completely hide reviews not in the user-interface language,” says Hale. Platforms like Wikipedia, he says, could allow you to search a topic in multiple language editions at the same time.

Hale also found that although only 11% of people are multilingual on Twitter, and 15% on Wikipedia, these multilingual individuals are more active, writing more tweets and creating and editing more Wikipedia content. These people, he believes, could potentially challenge the Balkanisation of information and discussion online. Whether it is translating and bringing foreign concepts into different language editions on Wikipedia, or moving breaking local news stories to new language communities and different geographies, they have the power to be influential.

Full article is available at http://labs.theguardian.com/digital-language-divide/. Some of my research is covered in the section on “Bridging the divide”.

Posted in multilingual, news only, OII, Wikipedia | Leave a comment

Design for multilinguals: Seemingly simple yet often missed

As I prepare my slides for CHI 2014, I’m struck by one implication I give for the research I will present on language and Twitter, “Allow each user to have a set of multiple preferred languages;” or, more simply:

consider bilingual and multilingual users when designing platforms

This seems super-simple and obvious as a bullet point on my slides. However, a long list of well-known platforms easily illustrates this insight is often overlooked. Since the confines of my presentation won’t allow me to look in-depth at this, I thought I would write here about one common product which I feel could be designed better for multilingual users: the Google Play Store. It is by far not the only platform, but it is a well-known platform that I use often.


Consider the app shown on the left, which I recently installed on my tablet. This is a screen shot of the app as it appears in the Google Play Store on my tablet with my user-interface language set to English.

Play tells me it has 10,000+ downloads and 3.6 stars based on 27 reviews (highlighted with a red box I added at the top). However, where are those 27 reviews? There is absolutely no indication of how I could read those 27 reviews. The web interface is even more confusing by inviting me to “Be the first to review this application.” How could I be the first to review the application if it already has 27 reviews?

As the astute reader might guess, the missing reviews are written in Japanese, which is perfectly fine since I read Japanese (and even if I didn’t I could try machine translation). However, in order to see these reviews I must switch the user-interface language of my entire tablet (or change the Language Accept parameter of my browser after logging out of my Google account and clearing all cookies). I also need to know somehow that the missing reviews are in Japanese. If I want to check if a Korean user has reviewed the app, I have to yet again change the entire user-interface language of my tablet. If I don’t want all my other apps and the system software itself to be in Japanese, I then have to yet again change the user-interface language back to English after reading the reviews.

The image below and to the right is screen shot of the app in the Google Play store after I’ve changed the user-interface language of my tablet to Japanese. Note that all the text that was previously in English (Apps, Open, Uninstall) is now in the new user-interface language of Japanese. The top area of the information is the same (3.6 stars, 10,000+ downloads), but now a new section with Reviews (レブュー) is given (which I’ve put a second red box around for emphasis).

Google Play is fully internationalized and localized on the one hand: menus are translated, the design accommodates right-to-left languages, etc. The difficulty is in how user-generated content is handled (specifically user reviews of apps). More niche apps are often reviewed in only one language in Google Play. However, reviews are grouped by the user interface language settings of users’ devices. This means that a user cannot see reviews of an app in another language without temporarily changing the user interface language of the entire Android operating system—a process that takes several minutes and affects all apps on the device.

Confusingly, apps are rated with a number of stars (1-4) averaged across reviews from all languages. This can lead to the rather odd case discussed above where the English interface shows an average rating of 3.6 stars from 27 user reviews, but then gives no indication of how the user could see these reviews. This can be particularly frustrating for multilingual users who could read the reviews in another language if the option were provided. It is further the case that some users write a review in a language different from their user interface language (e.g., a user with a Japanese UI language writing a review of an app in English). These users may think they are helping potential users in the other language, but in fact those users are very unlikely to see the review as it remains only accessible to users with the same user interface language selection as the author of the review.

So, while the bullet point on my CHI slides to “design with multilingual users in mind” seems obvious and simple, there are many common platforms that do not follow this advice (Google Play is certainly not alone). This is particularly surprising given “multilingualism…[is] the norm for most of the world’s societies,” with over half of Europe and over a fifth of the US multilingual. My work on language and Twitter and Wikipedia shows that a non-trivial percentage of users on both platforms engage in multiple languages.

As many human-computer interaction researchers gather this week for CHI in a country with a strong tradition of multilingualism, I hope that those envisioning new platforms or redesigning existing platforms will consider multilingual users specifically in their designs.

Posted in design, multilingual, OII | 4 Comments

Interactive map of Twitter mentions in geotagged tweets

I recently had the pleasure of building my first interactive map visualization using Leaflet with Joshua R. Melville and the Floatingsheep team, who have written more about the methodology. I’m drafting more about developing the visualization itself, but in the meantime thought I would simply share the results:

It is important to simply interpret this as where Twitter users who geotag their tweets mention different UK football teams. Twitter users are not representative of all humanity and users who geotag are a small fraction (about 1%) of all Twitter users; so, caution is needed in interpreting the data too far. There was a very limited period of time to develop this, so some more advanced techniques to determine location and analyze sentiment were not employed, but I think the visualization suggests exciting possibilities for the future as users increase and data analysis techniques improve.

For those interested, the code behind the visualization is available freely on GitHub (CC BY-NC-SA license).

Posted in OII, Visualizations | Leave a comment

Interactive Maps

Update: 6 November 2012 – US map featured in the Guardian.

I’ve not blogged for a while on this site, because I’ve been doing lots of blogging on the InteractiveVis project site. InteractiveVis is a project to create easy to use tools to build HTML5 interactive visualizations. These tools will be public very soon, but in the meanwhile, please enjoy some demonstrations of the types of visualizations users of the tools can create. The last demo using mentions of US Presidential candidates on Twitter is particularly timely (it remains to be seen how accurate!).

Demo 1: Visualization of followers at @OIITwitter

Background information about this visualization

Demo 2: Visualization of UK Central Government

Demo 3: Visualization of Literacy and Gender

Background information about this visualization

Demo 4: US Elections on Twitter
Background information about this visualization

Posted in OII, Visualizations | Leave a comment

Language Bubbles

Eli Pariser has raised awareness that personalization algorithms play in filtering and ranking results on the web. I think this work is very important, but another strand seemingly obvious, but surprisingly lacking study, is the role that language plays. A user searching content by keywords with most services is only likely to find content written/tagged/annotated in the language the user employs. This may make sense for some items, but for other content, say images, this is really an unexpected by-product of how the content is tagged and index.

I wrote an article looking at the role of language online, specifically with an example of image search on the Free Speech Debate website. The article initially had several comparisons of queries in Google Image search, but all but one were edited out. I’m including all of the images below and encourage you to check out the post as well.

I also gave a very general, accessible talk (only 10 minutes!) on the idea at a panel discussion for St Antony’s International Review earlier in the year. With some help from Kdenlive, I have now been able to edit the video and place this online as well.

Posted in multilingual, OII | Leave a comment

Recent contacts working on cross-language problems

I’ve recently been able to meet some spectacular individuals who are working on various aspects of cross-language communication. This blog post won’t to justice to all of their work; so, please click to their websites and learn more.

  • Irene Eleta Mogollon at the University of Maryland College of Information Studies (iSchool) and the Human-Computer Interaction Lab (HCIL). Her work has looked at multilingual social tagging of museums’ image collections and her PhD research is about multilingual communication on Twitter.
  • Chris Salzberg Chris is working on a tool, Cojiro, for cross-language curation that allows users to collect references/content around a particular topic and translate the relevant/interesting parts of that content. He’s focusing in the Japanese–English space, but the tool
  • John Dalton, a masters student at the Computer Science Department of the University of Oxford supervised by Phil Blunsom is completing fascinating research trying to develop an approach to identify human translations of content using techniques and theory from machine translation. I think this is a huge area of future research. If a (good) human translation of content has already been made, I would like Google Translate/Chrome to be able to identify that translation and recommend it to me in place of a machine translation. The work is also important to be able to develop corpora of more informal writing to improve the performance of machine translation algorithms on general web text.
  • Clare Wardle is at Storyful, which works with professional news clients to identify and verify legitimate news “from the noise of the real-time web, 24/7″. A key part of identifying and verifying content deals with language issues.
  • An Xiao, last but not least, is a design strategist, researcher, and
    artist. She co-founded a Chinese-to-English Twitter translation site with nearly 10,000 followers and a dozen contributing members and also has some blog entries touching on cross-language issues such as this post noting similarities between topics trending in two different languages on Twitter.

Finally, a quick self-promotion (and thank you to those who voted!) to say that the Interactive Visualization project I blogged about earlier was successful in receiving funding and is plowing ahead. I hope to demo some very cool interactive visualizations very soon. We just need to get the final user-interface graphics into the demos.

Posted in Japan, multilingual, OII | Leave a comment

Need your vote (if .ac.uk email)! — Interactive visualization development

Update: 25 June 2012 The project has been choosen by JISC to receive funding. Further information on the project and status updates will be communicated via the InteractiveVis project blog.

Update: 27 March 2012 We’ve been successful in getting 150 votes to be considered for funding. Thanks to everyone for the votes and support.

Since getting involved in some visualization development and research, I’ve often thought that interactivity was the best way to create great inviting visualizations that can be grasped quickly, yet allow in-depth exploration of data by users. Some interactivity is possible with existing tools, but most of these rely on users having additional software (e.g. Flash, Java) that, while common for desktops, limits the wider dissemination of these visualizations.

As web technologies have developed and become better supported, I’ve been working on extremely early stage code with OII Information Officer Kunika Kono to create interactive visualizations that run with entirely with native web technologies (HTML5, CSS3, SVG). We’ve applied for funding to further develop our ideas and create ways for all users to easily make interactive visualisations for geospatial and network data. To get considered for funding, however, we need 150 votes from .ac.uk email addresses.

Please vote, find out more information, and see some alpha code demonstrations at:

Posted in OII, Visualizations | 1 Comment

Two new publications, new research project, looking to hire

A lot has happened since my last post, and the selected publications page has been updated to reflect this. I am very pleased to announce that my work looking at cross-language linking in the blogosphere following the 2010 Haitian earthquake, which I have blogged about previously, is now published and freely available to all in the Journal of Computer-Mediated Communication. The abstract for this publication follows:

This research analyzes linguistic barriers and cross-lingual interaction through link analysis of more than 100,000 blogs discussing the 2010 Haitian earthquake in English, Spanish, and Japanese. In addition, cross-lingual hyperlinks are qualitatively coded. This study finds English-language blogs are significantly less likely to link cross-lingually than Spanish or Japanese blogs. However, bloggers’ awareness of foreign language content increases over time. Personal blogs contain most cross-lingual links, and these links point to (primarily English-language) media. Finally, most cross-lingual links in the dataset signal a citation or reference relationship while a smaller number of cross-lingual links signal a translation. Although most bloggers link to other blogs in the same language, the dataset reveals a surprising level of human translation in the blogosphere.

Full paper…

In addition, a new publication examining the sharing of off-site links in Twitter and Wikipedia following the 2011 earthquake and tsunami in Tōhoku, Japan, has just been accepted to International Conference on Human Factors in Computing Systems, CHI ’12, ACM, which will be held in Austin, Texas, in May. I’ll be blogging more about this research in the future in order to expand upon data and details that did not fit within the page limit.

This paper describes two case studies examining the impact of platform design on cross-language communications. The sharing of off-site hyperlinks between language editions of Wikipedia and between users on Twitter with different languages in their user descriptions are analyzed and compared in the context of the 2011 Tohoku earthquake and tsunami in Japan. The paper finds that a greater number of links are shared across languages on Twitter, while a higher percentage of links are shared between Wikipedia articles. The higher percentage of links being shared on Wikipedia is attributed to the persistence of links and the ability for users to link articles on the same topic together across languages.

Pre-print copy of the paper…

Finally, I am very excited to announce I have written my first grant proposal and that the proposal was funded. The resulting project, Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research, will perform research with a 30TB achieve of Web data of the .uk country-code top-level domain collected from 1996 to 2010. We are now looking to hire a Big Data Research Officer, who will contribute to this new project and to two other large-scale data projects (Leaders and Followers in Online Activism and The Internet, Political Science and Public Policy). If you have strong computer science skills and an interest in the social aspects of online technologies, please consider applying or sharing this announcement with others who might be interested. Applications close 16 March, and further information, contact details, and application information are available on the University of Oxford’s Job Search website.

Posted in multilingual, OII, social networks, Uncategorized, Wikipedia | Leave a comment

Collection of Graphics on Language and the Internet

Update (Nov. 2014): I’ve recently published two papers examining users who contribute content in multiple languages online. Please see Multilinguals and Wikipedia Editing and Global Connectivity and Multilinguals in the Twitter Network for further information and free, open-access copies of the articles.

I learned that it is possible, although not recommended, to teach until 12:30 in Oxford and then have a meeting in central London at 2:00 yesterday. Despite the travel challenges, I was happy to see a number of companies represented at my language session. I wish there had been more space for additional attendees and more time for discussion; however, I hope it was useful for those who attended.

The graphics we used in the language session at mindshare are listed below with links to PDFs and original sources. I’ve skipped analysis of them in this post: that has been done in some cases at the source, but please feel free to post comments or email questions and I’ll do my best to expand on anything.

Mentions of “beer” in various languages across Europe on Google Maps
Source: FloatingSheep

Geo-tagged photos on Flickr
Source: OII Visualization Gallery

Internet Penetration
Source: OII Visualization Gallery

Social Networking Sites over time
Source: Vincos Blog

Links between languages
Sources: Kovas Boguta (Twitter) and Scott Hale (Blogs)

More links between languages
Source: Hale, S. A. (Forthcoming) Net Increase? Cross-lingual Linking in the Blogosphere. Journal of Computer-Mediated Communication.

User-generated Content on Wikipedia and Google
Sources: OII Visualization Gallery (Wikipedia), OII Visualization Gallery (Google)

Top languages on the Internet
Source: Language Connect

Country code top-level domains and international top level domains
Source: Language Connect

Top news stories on Google News
Source: newsmap.jp Click for a live, interactive version

Posted in OII, Visualizations | 1 Comment

Wikipedia coverage by langauge

Update (November 2014): I’ve recently published a related paper examining how many users edit multiple language editions of Wikipedia and how these multilingual users connect the editions together. Please see Multilinguals and Wikipedia Editing for further information and a free, open-access copy of the article.

My absence from blogging for a few months has been personal (I got married in July) but also work related: I have a number of great project outputs that have just been released. These include a draft paper on social influence and collective action, a presentation at the Oxford Martin School, and a publication of Internet related maps resulting also in an online visualization gallery.

I’ve put my new mapping skills to work on the latest Wikipedia dumps from 30 September 2011 to uncover some patterns in geotagged articles. My methods are not perfect and not all language editions of the encyclopedia have the same level of geo-tagging; nevertheless, I think the patterns revealed are quite telling:

The map above shows which language edition out of German, Portuguese, and Spanish has the most geotagged articles in each country. There are a few ties, but for the most part a clear pattern emerges: countries in the Spanish-speaking world have more Spanish articles, German-speaking regions more German articles, etc. I will parse more dumps and add these (in particular I’d like to add Arabic, French, and English), but I think this pattern will hold across these and other languages.

I’ve received some challenge on my language-related research about what specific benefits multilingual contributors might bring, and I think one answer lies in breadth of content. Better coverage for a particular language edition of Wikipedia might not lie in energizing those in the home regions of a language, but rather in mobilizing the diaspora and language learners. As Brian Hecht’s 2010 article shows, there is little overlap in content and articles between different additions of Wikipedia and thus the possibility of greater coverage exists for all language editions.

Posted in crowd sourcing, multilingual, OII, Visualizations, Wikipedia | 1 Comment