Interactive map of Twitter mentions in geotagged tweets

I recently had the pleasure of building my first interactive map visualization using Leaflet with Joshua R. Melville and the Floatingsheep team, who have written more about the methodology. I’m drafting more about developing the visualization itself, but in the meantime thought I would simply share the results:

It is important to simply interpret this as where Twitter users who geotag their tweets mention different UK football teams. Twitter users are not representative of all humanity and users who geotag are a small fraction (about 1%) of all Twitter users; so, caution is needed in interpreting the data too far. There was a very limited period of time to develop this, so some more advanced techniques to determine location and analyze sentiment were not employed, but I think the visualization suggests exciting possibilities for the future as users increase and data analysis techniques improve.

For those interested, the code behind the visualization is available freely on GitHub (CC BY-NC-SA license).

Posted in OII, Visualizations | Leave a comment

Interactive Maps

Update: 6 November 2012 – US map featured in the Guardian.

I’ve not blogged for a while on this site, because I’ve been doing lots of blogging on the InteractiveVis project site. InteractiveVis is a project to create easy to use tools to build HTML5 interactive visualizations. These tools will be public very soon, but in the meanwhile, please enjoy some demonstrations of the types of visualizations users of the tools can create. The last demo using mentions of US Presidential candidates on Twitter is particularly timely (it remains to be seen how accurate!).

Demo 1: Visualization of followers at @OIITwitter

Background information about this visualization

Demo 2: Visualization of UK Central Government

Demo 3: Visualization of Literacy and Gender

Background information about this visualization

Demo 4: US Elections on Twitter
Background information about this visualization

Posted in OII, Visualizations | Leave a comment

Language Bubbles

Eli Pariser has raised awareness that personalization algorithms play in filtering and ranking results on the web. I think this work is very important, but another strand seemingly obvious, but surprisingly lacking study, is the role that language plays. A user searching content by keywords with most services is only likely to find content written/tagged/annotated in the language the user employs. This may make sense for some items, but for other content, say images, this is really an unexpected by-product of how the content is tagged and index.

I wrote an article looking at the role of language online, specifically with an example of image search on the Free Speech Debate website. The article initially had several comparisons of queries in Google Image search, but all but one were edited out. I’m including all of the images below and encourage you to check out the post as well.

I also gave a very general, accessible talk (only 10 minutes!) on the idea at a panel discussion for St Antony’s International Review earlier in the year. With some help from Kdenlive, I have now been able to edit the video and place this online as well.

Posted in multilingual, OII | Leave a comment

Recent contacts working on cross-language problems

I’ve recently been able to meet some spectacular individuals who are working on various aspects of cross-language communication. This blog post won’t to justice to all of their work; so, please click to their websites and learn more.

  • Irene Eleta Mogollon at the University of Maryland College of Information Studies (iSchool) and the Human-Computer Interaction Lab (HCIL). Her work has looked at multilingual social tagging of museums’ image collections and her PhD research is about multilingual communication on Twitter.
  • Chris Salzberg Chris is working on a tool, Cojiro, for cross-language curation that allows users to collect references/content around a particular topic and translate the relevant/interesting parts of that content. He’s focusing in the Japanese–English space, but the tool
  • John Dalton, a masters student at the Computer Science Department of the University of Oxford supervised by Phil Blunsom is completing fascinating research trying to develop an approach to identify human translations of content using techniques and theory from machine translation. I think this is a huge area of future research. If a (good) human translation of content has already been made, I would like Google Translate/Chrome to be able to identify that translation and recommend it to me in place of a machine translation. The work is also important to be able to develop corpora of more informal writing to improve the performance of machine translation algorithms on general web text.
  • Clare Wardle is at Storyful, which works with professional news clients to identify and verify legitimate news “from the noise of the real-time web, 24/7″. A key part of identifying and verifying content deals with language issues.
  • An Xiao, last but not least, is a design strategist, researcher, and
    artist. She co-founded a Chinese-to-English Twitter translation site with nearly 10,000 followers and a dozen contributing members and also has some blog entries touching on cross-language issues such as this post noting similarities between topics trending in two different languages on Twitter.

Finally, a quick self-promotion (and thank you to those who voted!) to say that the Interactive Visualization project I blogged about earlier was successful in receiving funding and is plowing ahead. I hope to demo some very cool interactive visualizations very soon. We just need to get the final user-interface graphics into the demos.

Posted in Japan, multilingual, OII | Leave a comment

Need your vote (if .ac.uk email)! — Interactive visualization development

Update: 25 June 2012 The project has been choosen by JISC to receive funding. Further information on the project and status updates will be communicated via the InteractiveVis project blog.

Update: 27 March 2012 We’ve been successful in getting 150 votes to be considered for funding. Thanks to everyone for the votes and support.

Since getting involved in some visualization development and research, I’ve often thought that interactivity was the best way to create great inviting visualizations that can be grasped quickly, yet allow in-depth exploration of data by users. Some interactivity is possible with existing tools, but most of these rely on users having additional software (e.g. Flash, Java) that, while common for desktops, limits the wider dissemination of these visualizations.

As web technologies have developed and become better supported, I’ve been working on extremely early stage code with OII Information Officer Kunika Kono to create interactive visualizations that run with entirely with native web technologies (HTML5, CSS3, SVG). We’ve applied for funding to further develop our ideas and create ways for all users to easily make interactive visualisations for geospatial and network data. To get considered for funding, however, we need 150 votes from .ac.uk email addresses.

Please vote, find out more information, and see some alpha code demonstrations at:
http://elevator.jisc.ac.uk/ideas/interactive-visualisations-teaching-research-and-dissemination

Posted in OII, Visualizations | 1 Comment

Two new publications, new research project, looking to hire

A lot has happened since my last post, and the selected publications page has been updated to reflect this. I am very pleased to announce that my work looking at cross-language linking in the blogosphere following the 2010 Haitian earthquake, which I have blogged about previously, is now published and freely available to all in the Journal of Computer-Mediated Communication. The abstract for this publication follows:

This research analyzes linguistic barriers and cross-lingual interaction through link analysis of more than 100,000 blogs discussing the 2010 Haitian earthquake in English, Spanish, and Japanese. In addition, cross-lingual hyperlinks are qualitatively coded. This study finds English-language blogs are significantly less likely to link cross-lingually than Spanish or Japanese blogs. However, bloggers’ awareness of foreign language content increases over time. Personal blogs contain most cross-lingual links, and these links point to (primarily English-language) media. Finally, most cross-lingual links in the dataset signal a citation or reference relationship while a smaller number of cross-lingual links signal a translation. Although most bloggers link to other blogs in the same language, the dataset reveals a surprising level of human translation in the blogosphere.

Full paper…

In addition, a new publication examining the sharing of off-site links in Twitter and Wikipedia following the 2011 earthquake and tsunami in Tōhoku, Japan, has just been accepted to International Conference on Human Factors in Computing Systems, CHI ’12, ACM, which will be held in Austin, Texas, in May. I’ll be blogging more about this research in the future in order to expand upon data and details that did not fit within the page limit.

This paper describes two case studies examining the impact of platform design on cross-language communications. The sharing of off-site hyperlinks between language editions of Wikipedia and between users on Twitter with different languages in their user descriptions are analyzed and compared in the context of the 2011 Tohoku earthquake and tsunami in Japan. The paper finds that a greater number of links are shared across languages on Twitter, while a higher percentage of links are shared between Wikipedia articles. The higher percentage of links being shared on Wikipedia is attributed to the persistence of links and the ability for users to link articles on the same topic together across languages.

Pre-print copy of the paper…

Finally, I am very excited to announce I have written my first grant proposal and that the proposal was funded. The resulting project, Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research, will perform research with a 30TB achieve of Web data of the .uk country-code top-level domain collected from 1996 to 2010. We are now looking to hire a Big Data Research Officer, who will contribute to this new project and to two other large-scale data projects (Leaders and Followers in Online Activism and The Internet, Political Science and Public Policy). If you have strong computer science skills and an interest in the social aspects of online technologies, please consider applying or sharing this announcement with others who might be interested. Applications close 16 March, and further information, contact details, and application information are available on the University of Oxford’s Job Search website.

Posted in multilingual, OII, social networks, Uncategorized, Wikipedia | Leave a comment

Collection of Graphics on Language and the Internet

I learned that it is possible, although not recommended, to teach until 12:30 in Oxford and then have a meeting in central London at 2:00 yesterday. Despite the travel challenges, I was happy to see a number of companies represented at my language session. I wish there had been more space for additional attendees and more time for discussion; however, I hope it was useful for those who attended.

The graphics we used in the language session at mindshare are listed below with links to PDFs and original sources. I’ve skipped analysis of them in this post: that has been done in some cases at the source, but please feel free to post comments or email questions and I’ll do my best to expand on anything.

Mentions of “beer” in various languages across Europe on Google Maps
Source: FloatingSheep

Geo-tagged photos on Flickr
Source: OII Visualization Gallery

Internet Penetration
Source: OII Visualization Gallery

Social Networking Sites over time
Source: Vincos Blog

Links between languages
Sources: Kovas Boguta (Twitter) and Scott Hale (Blogs)

More links between languages
Source: Hale, S. A. (Forthcoming) Net Increase? Cross-lingual Linking in the Blogosphere. Journal of Computer-Mediated Communication.

User-generated Content on Wikipedia and Google
Sources: OII Visualization Gallery (Wikipedia), OII Visualization Gallery (Google)

Top languages on the Internet
Source: Language Connect

Country code top-level domains and international top level domains
Source: Language Connect

Top news stories on Google News
Source: newsmap.jp Click for a live, interactive version

Posted in OII, Visualizations | 1 Comment

Wikipedia coverage by langauge

My absence from blogging for a few months has been personal (I got married in July) but also work related: I have a number of great project outputs that have just been released. These include a draft paper on social influence and collective action, a presentation at the Oxford Martin School, and a publication of Internet related maps resulting also in an online visualization gallery.

I’ve put my new mapping skills to work on the latest Wikipedia dumps from 30 September 2011 to uncover some patterns in geotagged articles. My methods are not perfect and not all language editions of the encyclopedia have the same level of geo-tagging; nevertheless, I think the patterns revealed are quite telling:

The map above shows which language edition out of German, Portuguese, and Spanish has the most geotagged articles in each country. There are a few ties, but for the most part a clear pattern emerges: countries in the Spanish-speaking world have more Spanish articles, German-speaking regions more German articles, etc. I will parse more dumps and add these (in particular I’d like to add Arabic, French, and English), but I think this pattern will hold across these and other languages.

I’ve received some challenge on my language-related research about what specific benefits multilingual contributors might bring, and I think one answer lies in breadth of content. Better coverage for a particular language edition of Wikipedia might not lie in energizing those in the home regions of a language, but rather in mobilizing the diaspora and language learners. As Brian Hecht’s 2010 article shows, there is little overlap in content and articles between different additions of Wikipedia and thus the possibility of greater coverage exists for all language editions.

Posted in crowd sourcing, multilingual, OII, Visualizations, Wikipedia | 1 Comment

Visualizing English, Spanish, Japanese in the blogosphere

Update (2012-02-22): The paper is now published and freely available from the Journal of Computer-Mediated Communication: http://onlinelibrary.wiley.com/doi/10.1111/j.1083-6101.2011.01568.x/full.

Edit (2011-12-06): The full paper from which this dataset comes will be published in the Journal of Computer-Mediated Communication in January 2012. The preprint copy of this paper is linked above. In addition, this visualization is now live on the new OII Visualization Gallery.

I recently revisited the data I collected last year following the January earthquake in Haiti. I found a new visualization package, Tulip, and was able to successfully visualize the largest connected component of my network. The result and a description follow:


This diagram represents 5,703 blog posts about the Haitian earthquake and the links between them in the largest connected component of the network. Blog posts are in English (yellow), Spanish (red), and Japanese (blue). The nodes are positioned using a force-directed GEM layout in Tulip.

The overall network consists of 113,117 blog posts collected in a 45-day period following the earthquake. Only about 5% of the links connect posts of different languages. Of these, most link from personal blogs in Japanese and Spanish to media and professional blogs in English. About 1% of links contain human translation of the blog content. Significantly fewer cross-lingual links originate in English posts than in Spanish or Japanese posts.

Posted in multilingual, OII, Visualizations | 2 Comments

Translating Twitter

I had the great opportunity to meet George Weyman, a project director at meedan, yesterday at an OII event. meedan has been doing great work in connecting English and Arabic speakers online through translation of news for many years.

My research has not included Arabic, unfortunately, but has found consistently that the English-language web is very insular. Other languages translate information from English and link to English sources, but English pages are significantly less likely to link to other languages. With the recent revolutions in the Arabic-speaking world, some English speakers have realized this insularity.

meedan has been using crowd sourced translation and machine translation to help bridge this gap. In addition to news coverage, meedan has also helped organize volunteers to translate tweets on Twitter. This presents a challenge as Twitter has no easy way to link translations to the original source content. (Indeed this is true of many social networking sites.) If tweets are not linked together, a conversation will fracture with every translation, work might be duplicated as multiple people translate the same tweet, and users may not easily find/know about translations.

As a great temporary fix, meedan is using Curated.by to organize the translations. This service was initially designed to organize and comment on tweets, but can somewhat be adapted to the goal of curating tweets and translations (each translation is posted as a comment to the original tweet).

While this is a great fix for now, it points to a longer term need to think about the design of platforms with users in multiple languages. One option is to add a lot of structure in advance, creating separate bins for each language with links between them as Wikipedia has done with its multiple language versions. However, I think Twitter is useful in many contexts because of its free-form, commons approach. All content, regardless of language, is posted into one commons. This could allow for wider conversations to develop. A few simple additions such as allowing for the linking of two tweets and specifying they are translations of one another (and which is the original) could add great connective power to the platform. Linking hashtags in different languages together could equally allow for wider conversations and better organization of tweets. Machine transliteration (converting non-Latin scripts to Latin characters) and machine translation might also be investigated.

I think too often the connective power of other language tools besides machine translation are forgotten. In particular on language pairs where machine translation performs poorly, other tools allowing for crowd translation, linking of content across languages, and transliteration can greatly help connect users.

Of course, the greatest connecting power is the will of the users who want to be connected. Recent events have drawn many Arabic and English speaking users together on Twitter as the network diagram of Twitter following relationships shows. I hope that these connections persist and English speakers will reach out beyond their language for information.


Please click the image for a larger version and explanation at Kovas Boguta’s website. English-only posting Twitter users are in blue, Arabic-only posting users in red. Connections are following relationships.

Posted in multilingual, OII | Leave a comment