Visualizing English, Spanish, Japanese in the blogosphere

Update (Feb. 2012): The paper is now published and freely available from the Journal of Computer-Mediated Communication: http://onlinelibrary.wiley.com/doi/10.1111/j.1083-6101.2011.01568.x/full.

Update (Dec. 2011): The full paper from which this dataset comes will be published in the Journal of Computer-Mediated Communication in January 2012. The preprint copy of this paper is linked above. In addition, this visualization is now live on the new OII Visualization Gallery.

I recently revisited the data I collected last year following the January earthquake in Haiti. I found a new visualization package, Tulip, and was able to successfully visualize the largest connected component of my network. The result and a description follow:


This diagram represents 5,703 blog posts about the Haitian earthquake and the links between them in the largest connected component of the network. Blog posts are in English (yellow), Spanish (red), and Japanese (blue). The nodes are positioned using a force-directed GEM layout in Tulip.

The overall network consists of 113,117 blog posts collected in a 45-day period following the earthquake. Only about 5% of the links connect posts of different languages. Of these, most link from personal blogs in Japanese and Spanish to media and professional blogs in English. About 1% of links contain human translation of the blog content. Significantly fewer cross-lingual links originate in English posts than in Spanish or Japanese posts.

Posted in Blog, multilingual, OII, research, Visualizations | 2 Comments

Translating Twitter

I had the great opportunity to meet George Weyman, a project director at meedan, yesterday at an OII event. meedan has been doing great work in connecting English and Arabic speakers online through translation of news for many years.

My research has not included Arabic, unfortunately, but has found consistently that the English-language web is very insular. Other languages translate information from English and link to English sources, but English pages are significantly less likely to link to other languages. With the recent revolutions in the Arabic-speaking world, some English speakers have realized this insularity.

meedan has been using crowd sourced translation and machine translation to help bridge this gap. In addition to news coverage, meedan has also helped organize volunteers to translate tweets on Twitter. This presents a challenge as Twitter has no easy way to link translations to the original source content. (Indeed this is true of many social networking sites.) If tweets are not linked together, a conversation will fracture with every translation, work might be duplicated as multiple people translate the same tweet, and users may not easily find/know about translations.

As a great temporary fix, meedan is using Curated.by to organize the translations. This service was initially designed to organize and comment on tweets, but can somewhat be adapted to the goal of curating tweets and translations (each translation is posted as a comment to the original tweet).

While this is a great fix for now, it points to a longer term need to think about the design of platforms with users in multiple languages. One option is to add a lot of structure in advance, creating separate bins for each language with links between them as Wikipedia has done with its multiple language versions. However, I think Twitter is useful in many contexts because of its free-form, commons approach. All content, regardless of language, is posted into one commons. This could allow for wider conversations to develop. A few simple additions such as allowing for the linking of two tweets and specifying they are translations of one another (and which is the original) could add great connective power to the platform. Linking hashtags in different languages together could equally allow for wider conversations and better organization of tweets. Machine transliteration (converting non-Latin scripts to Latin characters) and machine translation might also be investigated.

I think too often the connective power of other language tools besides machine translation are forgotten. In particular on language pairs where machine translation performs poorly, other tools allowing for crowd translation, linking of content across languages, and transliteration can greatly help connect users.

Of course, the greatest connecting power is the will of the users who want to be connected. Recent events have drawn many Arabic and English speaking users together on Twitter as the network diagram of Twitter following relationships shows. I hope that these connections persist and English speakers will reach out beyond their language for information.


Please click the image for a larger version and explanation at Kovas Boguta’s website. English-only posting Twitter users are in blue, Arabic-only posting users in red. Connections are following relationships.

Posted in Blog, multilingual, OII | Tagged | Leave a comment

Content Providers and Neutrality

I don’t seem to be writing too timely, but I hope these posts nevertheless remain interesting. Last week the course I TA for was discussing the idea of network neutrality. Broadly defined this is a debate over whether ISPs can discriminate (packet shape) between different services or content sources.

During the discussion group, however, I was thinking about another element of neutrality on the network—I’ll call it “device neutrality.” There are of course different capabilities of different browsers or devices that permit certain web sites to function on one device and not on another. The lack of support for Adobe Flash on iPod/iPad/iPhone devices is an easy example. Smaller screen sizes of mobile devices might also make it desirable for a web site operator to serve different content to different devices (think of Mobile Gmail, for instance).

To this end, each browser/device sends information about itself when it requests a web page from a server. When GoogleTV launched, it followed this convention and informed web sites that the device accessing them was a (slightly) modified version of its Chrome browser on a GoogleTV device. GoogleTV has the same features of a standard browser with most of the plug-ins including Adobe Flash; so, websites should have been able to serve existing versions of webpages with no issues. Most sites did this, but notably, the US video streaming service Hulu along with broadcasters NBC, CBS, and ABC, specifically singled out GoogleTV devices for special attention.

As noted on NPR and the New York Times, Hulu incorrectly told visitors using GoogleTV that the device was unsupported. I suppose this could have initially been a simple error in coding, but given that a loophole was found and then that stopped working and Hulu’s history of blocking browsers, it seems far more likely this was a deliberate decision to not permit users of GoogleTV (and Boxee, etc.) to access the service. Of course, desktop users are still welcome to simply connect a PC directly to the TV and watch Hulu just the same as a GoogleTV user would have.

These are the sort of knee-jerk reactions by content providers that keep file sharing alive and call for caution of new cloud based multimedia storage services (a.k.a. digital lockers). A downloaded file works on a device or it doesn’t, but the device generally doesn’t lie about the possibility, and isn’t susceptible to the whims of bureaucrat a distant company.

In general greater transparency is needed. Transparency, yes, in the bandwidth shaping practices of ISPs, but also in protocols and content. If Hulu wants to block GoogleTV they should have the cojones to admit what they are doing and tell the user so. To erroneously state the device is unsupported is weak and pathetic. The same can be said of most British ISPs in their filtering practices. (Many provided a fake “404 page not found” error message when they were in fact blocking the Virgin Killer Wikipedia page in late 2008.)

One way to enhance privacy and transparency would be to slightly modify the initially HTTP request protocol. The amount of information sent by a browser to webserver is staggering, and as demonstrated by the panopticlick project of the EFF, this information can be used to track users even after removing cookies, etc. In place of sending all this information, one option would be to add a method to allow the client to query a server for the versions of a page available. The client device/browser could then respond and select the version it wanted. Mobile devices could select a mobile versions (if available) and desktop browsers a desktop version; the desired language could be selected among those the server says it supports. There is no need to tell a web server with only one version of a page my screen size, operating system, browser make, language preference, etc. on every request of a page. These preferences for versions of a page could be automatically selected by the browser or the user could manually make a choice. They could be the same for every website, or each different versions could be requested from different websites.

Overall, however, I just wish that content providers (and ISPs) would be more transparent in their practices.

Posted in OII | Leave a comment

Multilingual sharing in video

I’ve thought a lot about translation and multilingual sharing online in text environments (blogs, Wikipedia, social networking sites), but I’m reminded how quickly platforms change on the web, and text-only exchanges seem outdated considering YouTube has been around for 5 years now and the prevalence of online video has only increased with Internet speeds. So, one major media format to consider in cross-lingual sharing is online video.

Early this year, YouTube announced it would use voice recognition (speech-to-text) to subtitle all its English-language videos, which can be helpful to foreign language speakers of English (Shimogori, et al. 2010) as well as hearing impaired individuals. It doesn’t seem much of a leap to allow for correction of these subtitles and machine or human translation of them. YouTube does support uploading subtitles, but I wasn’t aware of an easy way to create these until recently.

Two tools, dotSUB and Universal Subtitles provide an easy way to transcribe and translate speech in videos. Neither site uses machine translation or voice recognition, but both provide the opportunity for crowd/social transcription and translation of videos. Videos can then be viewed with subtitles and embedded around the web or viewed directly on each platform.

Both seem good tools. dotSUB has been around for a number of years and seems more established and less buggy, but Universal Subtitles provides some nice new features. While videos are hosted on dotSUB, Universal Subtitles uses JavaScript to simply overlay text on top of videos hosted elsewhere. This seems especially useful for translating videos when a user does not have access to the original video file to upload it (again). Since Universal Subtitles does not actually host videos, but simply adds text alongside (over) existing videos, I would hope that any video freely viewable online could be used with this tool. The technology allows this, but the legal implications are not clear since (as We Blog the World correctly point out) a subtitled video is a derivative work protected by copyright law and thus requires the permission of the copyright holder. I agree, however, with We Blog the World’s argument that “Universal Subtitles as a tool does not produce subtitled videos; rather they facilitate a way for viewers to lay a text file over an already existing video.”

In any case, these two communities of subtitlers and translators provide an exciting new possibility for multilingual sharing. The standard caveats of crowd sourcing and needing to keep an engaged community apply, but I think these tools have great potential. I particularly thought about a tool along these lines when I was teaching English in Japan. One reason for delays in broadcasting foreign TV in Japan, I was told, is the process of subtitling and translating. I thought transcribing and translating a video would be a perfect real-world application for my English students, and it could actually save a content-producer money by giving a rough translation to be checked/editted later by professional translators if needed. I realize it may be some time before broadcasters move from blocking Hulu, iPlayer, etc. internationally to actually trying to capture some of the great potential of a geographically and linguistically diverse viewership online. I’m happy to say TED is using dotSUB to translate their talks (albeit with a more oversight than the general crowd sourced free-for-all approach).

For these sites to actually have an impact, another challenge is actually having a diverse viewership. In this respect, some changes could be made to make it easier to find videos. If you choose to search on dotSUB, for example, you will have to search the title of the video you want in its original language, because the titles themselves are not actually translated. Integration/recognition by an existing video hosting platform (e.g. YouTube) could bring even more viewers. Social networking sites, blogs, and good ol’ word of mouth will spread awareness for now.

More information on dotSUB is here, including an interview with its founder of dotSUB.

There is also a useful comparison of the two tools at We Blog the World.

Posted in crowd sourcing, multilingual, OII, video | Tagged | 2 Comments

Purposesfully Restricted and Network Visualizations

A few interesting links to share:

Posted in news only, OII, Uncategorized | Leave a comment

Japan/China crosslinks on- and off-line

Just over a month ago, I wrote about the difficulty international platforms such as Google maps have in naming disputed geographic features. Recent incident in the East China Sea involving a Chinese fishing boat and Japanese Coast Guard vessels around disputed Chinese/Japanese islands is showing how intense and complicated these issues can be.

After cutting off official ties, I am curious if we could see a similar fluctuation in online traffic and/or links between Japanese and Chinese blogs, websites, etc. Does an online decrease proceed or follow the official decision from the Chinese government to suspend high-level contacts between the two governments? Will online cross-links between the languages show an increase before (or after) resumption of normal diplomatic relations? I don’t have the time to develop the dataset fully now, but this could be a fruitful area of inquiry.

A comment in Chinese on the second story indicates the difference in how Japanese and Chinese media are handling the event.

我曾经看过日本有关这个的一则新闻 让我极其不爽的就是他宣称那个岛屿是他们的 是我们中国人先撞了他们的船
The claim I’ve seen in Japanese media that the Chinese ship first hit the Japanese Coast guard vessel makes me unhappy.*

In line with my previous research about the Haitian earthquake (I really promise this should be online soon), I would not be surprised if online news media links between the two countries were absurdly low or non-existent. In my research, I found most of the cross-lingual links were created by individuals writing alone or in small groups.

*My Chinese is really elementary, and I’m happy for someone to contribute a better translation.

Posted in OII, Uncategorized | 2 Comments

How geographically or linguistically diverse is your online social network?

Three recent news stories highlight the international nature of many social media platforms:

This begs the question how much does an average user’s friends’ network reflect this geographic / linguistic diversity of the platform. My guess based on homophily and past work is not very much. Despite being on the same platform with individuals from many diverse countries, the average user seeks out those individuals similar to him or herself. Networks like Facebook often simply reflect real world ties by encouraging individuals to use their real names and prompting them to friend past high school and college friends. Twitter using screen names in a pseudo-anonymous style and not forcing users to follow those following them seems like it might promote more diverse ties. On the other hand, I am not certain this is the case. Ethan Zuckerman recently wrote about one Portuguese language phrase—cala boca Galvao—that confused many users when it became a top trending topic on Twitter. I am building an application to let users plot the geographic diversity of their friends’ networks, and I hope to make that application live this fall. In the meantime, I am open to suggestions and tips on research to look at.

In addition to the limits of personal social networks, there are also of course limits to platforms. I was reminded of this today when a friend was trying to join Mixi, the most popular social network in Japan. Not only is the network entirely in Japanese, but it also requires users to give their Japanese cell phone number and confirm it via a text message sent to the phone. The lack of demand in forming geographically diverse ties partly explains why social media platforms can exist without interoperability and why country isolated networks like the Facebook clone in China can work successfully. It’s my hope that an increase in awareness, tools showing users the diversity of their personal networks, and easily accessible information sources might increase the diversity of our consumption of information on the Internet. In parting, I stumbled on a nice service aggregating the Google News feeds across several countries today:

Aggregation of Google News across several countries

Posted in OII, Uncategorized | Tagged | 4 Comments

Naming Places/Features on Google Maps

Google Maps must engage in a cross-cultural (and often cross-lingual) act to publish its maps. Each place or feature name can be given in multiple languages, and occasionally as NPR’s On the Media discusses, cultures don’t agree on the name of a shared feature.

I’ve been aware of differing names for some features, such as the Bay of California vs. Sea of Cortez and the Faukland Islands vs. Las Malvinas. However, the difficulty of even constructing one world map upon which multiple cultures can agree is amazing. The NPR story is a good listen or read for those interested.

Google must pick a name (or list multiple names) when naming these locations. Google is able to name features differently on different language versions. Thus, the maps of China on google.cn has an slightly wider more definite reach than the google.com version, which has a dashed border that doesn’t make it clear whether Arunachal Pradesh is part of India or China.

Google.cn

View Larger Map

Google.com

View Larger Map

NPR On the Media
http://www.onthemedia.org/transcripts/2010/07/23/04

Posted in OII, Uncategorized | Tagged | 2 Comments

Homophily and the Internet

I recently had the opportunity to meet Ethan Zuckerman while he was on a visit to Oxford for TED where he gave a talk on Listening to global voices. Ethan has been doing amazing work on trying to promote more global interaction through Global Voices. In preparation for our meeting, I read a few past blog posts and found this one on: homophily, serendipity, and xenophilia. While it’s a couple of years old, I think these terms are still important today and what to add to the discussion.

In particular, reading some of the linked sources made me ask the question of how to do you defeat homophily without promoting homogenization?

(Homophily, btw, is a nifty sociological term “meaning love of the same” and commonly expressed by the adage “birds of a feather flock together.”)

Ethan links to an article in the Washington Post, which states:
“Ever larger numbers of people seem to be sealing themselves off in worlds where everyone thinks the way they do. No Walter Cronkite figure unites audiences today, the sociologist noted. We can now choose cable stations, magazines and blogs that see the world exactly as we do.”

The elimination of competition for limited newspaper space, limited radio frequencies, and limited TV channels means many more diverse information sources can co-exist. This decreases the likelihood of a single, unifying figure like Walter Cronkie, but this is a good thing in that on a macro (perhaps national level) more diverse media is being consumed.

The challenge is that we tend to consume information similar to that of our friends, and personalized recommender systems promote this. So, while the national diversity of media may have increased, each person is less likely to individually be exposed to this diversity.

The challenge then is how to increase an individual’s media consumption without homogenizing media consumption on a macro level. Personalized recommender systems don’t increase a individual’s diversity of media, while on the other hand, a generic (non-personalized) recommender system (like a return to limited channel TV) would give everyone the same recommendations and lead to a national decrease in media diversity.

If we think technology fuels homophily, could it be only so because it also propels heteroization? A paper by Daniel Lemire, Stephen Downes, and Sébastien Paquet looks at diversity in social networks and recommender systems. The paper argues dense social networks with a long tail are best at promoting diversity, but that elements of randomness might also be incorporated into recommender systems. Another path to diversity might be “representation” where groups of similar users are clustered and given only one vote for the group regardless of its size (thing US Senate). Ultimately, however, the paper concludes “users should avoid relying on a single aggregation strategy to filter content.”

At present, the best way to increase individual and national diversity is to change user behavior. As Ethan says in his TED talk. We need to cultivate xenophilia (love the foreign) and bridge figures who like musical DJ’s can curate foreign news and draw our interest to it. Other systems, the Internet, traditional media, education, too need to be examined for ways to increase cross-cultural and cross-lingual communication.

Ethan’s talk from TED is now online:


http://www.ted.com/talks/ethan_zuckerman.html

Posted in OII, Uncategorized | Tagged , , | 1 Comment

Resuming (or at least attempting)

After a break from blogging while I completed an intense MSc at the Oxford Internet Institute I am going to attempt to revive this blog.

Although I haven’t written in sometime, I have added numerous news stories and other blogs that have caught my eye in the right side bar. Over the next few weeks, I hope to blog about my research work this past year and where I am looking to go.

I am very happy to be able to put my passions in language and computers together in a social science prospective to look at how speakers of different languages communicate online. I’ve recently submitted my master’s thesis, which looked at cross-lingual links between Japanese, Spanish, and English bloggers writing about the Haitian earthquake. I was particularly interested in how often bloggers link to material in foreign languages and what type of blogger (e.g. professionally affiliated vs. personal blogger) is most likely to create these links. I’ll have many more details to share, but the much abridged version is that in my dataset about the Haitian earthquake, cross-lingual hyperlinks are mostly created by individual Japanese- or Spanish-writing bloggers and point to primarily English-language media blogs. There are many more layers to explore, and I hope to share this complexity and the paper itself in a future post.

I am going to continue this research as a doctoral student at the Oxford Internet Institute this fall. I plan to look at additional languages and platforms and try to tease out real-world consequences of these interactions. In an attempt to increase my own cross-lingual presence,  I’ve decided to mirror this blog at mojofiti, which seems a really cool community of community, human, and machine translation. Ever post will automatically be translated into 28 languages on mojofiti.

Thank you for reading! The original version of this post is at:
http://www.scotthale.net/blog/?p=4

Posted in Uncategorized | Leave a comment