September 8, 2013

Different cultures, languages, and alphabets add an extra dimension to the search for strategically important information in unstructured Big Data.

The internet has accelerated the globalization process, bringing people, businesses, governments, and countries closer together. Online, we can form communities, make business deals, and maintain personal relationships around the world. Content is created at a rapid pace, in the form of web sites, social media posts, blogs, news items, product reviews, comments, and other unstructured text.


Although English has become the “lingua franca” in large parts of the world, the internet still contains a multitude of languages. For companies that want to find out how their products are talked about online, or researchers monitoring events in a specific field, this presents a lot of opportunities.

Imagine if an analyst could learn, in real time, what people around the world were saying about your products, not just in English, but in languages like French, Spanish, German – and Russian, Chinese, Japanese, and Arabic. How can they accomplish this, and use the results to improve their sales and bottom line?


Not surprisingly, simply translating words from one language to another is not sufficient. Languages differ in more ways than just the mere words and their spelling. This means you have to change your thinking to achieve the same meaning in another language.

As we’re increasingly aware of, cultural context is also an important issue. National and local customs and cultural references are still important. Respecting those references, and understanding their significance, is necessary to get a good grasp of the global market place.


There are also some practical issues in dealing with content in “all” languages. Many of the major languages, for instance Chinese, Japanese, Russian, Arabic, Farsi, Urdu, and Hindi, do not use the Latin alphabet. Searching and mining this unstructured content could pose some technological challenges, however, these can be solved. The appearance of readable text on a web page in, say, Cyrillic, does not mean that the source code is legible for a search application. Sometimes downloaded content has to be converted and saved in specific formats. If an analyst can automate this process and build it into their content retrieval methods they will save a lot of manual effort.

Not all search tools have the capability of mining content in different alphabets or character sets. If a researcher is planning to analyze unstructured text in Chinese, Russian, Japanese or other “non-Latin” languages, he or she needs technology that supports this.

One may also have to invest some time in finding the best and most relevant content sources in the country and language they are examining. In social media, Facebook rules most of the world, but not for instance Russia, Brazil or East Asia. In China, Baidu is the most popular search engine instead of Google. There are also differences from country to country in web site design and coding, blogging and communities, or online news sources.


Services like Google and Bing Translate have come a long way and produce decent results most of the time, especially on word-by-word translations. However, auto-translating large blocks of texts, such as a news article or blog post, will give sub-standard results, and in some cases errors that lead to misinterpretations and misunderstandings.

To achieve great results, you need human analysts - experts that understand the topics, locations, cultures, and languages you want to examine. It’s certainly possible to throw a set of translated words into a search application and get results. The problem is, an analyst is likely to miss out on important content, and not even know what they have missed.

At the very least, the content one finds needs to be examined by knowledgeable people. Auto-translation might be helpful in filtering content, weeding out the irrelevant items. However, in the final stage, the analyst must be able to fully understand the essence of the content and convey their significance to managers, clients, and decision makers.


Monitoring and mining online content in a variety of languages could be of great benefit to a business, by providing business opportunities, open new markets, and spurring innovation. There are, however, a number of issues to consider, semantically, technologically, and analytically. If done correctly, the world is at an analysts’ fingertips.

TextOre, Inc.