This past Thursday, 31 March 2016, I had the opportunity to do a walkabout at Strata+Hadoop World, the Big Data conference held annually in San Jose, CA. Presented by O’Reilly Media and Cloudera, this year marked the 10th Anniversary of Hadoop, the Apache Software Foundation (ASF) open-source big data project for distributed computing. Comprised of a number of foundational technologies, from HDFS (Hadoop Distributed File System), to YARN, MapReduce, HIVE, Pig, and more, the Hadoop ecosystem has radically changed the way companies store, process, analyze, and draw insights from with huge quantities of information. Over recent years streaming processes such as Apache Spark have also risen to its own, and publish/subscribe mechanisms like Kafka have helped take us from a world primarily made up of data-at-rest to a world increasingly comprised of data-in-motion.

Another major trend in big data over recent years has been to externalize data that had been formerly kept deep in the bowels of organizations. Because if you collect data, and no one can ever use it, is it doing anyone any good? Bruce Andrews, Deputy Secretary of the U.S. Department of Commerce, addressed the Strata+Hadoop audience in a keynote highlighting the availability of U.S. federal government sources (such as at, and initiatives such as the Commerce Data Service and the Commerce Data Advisory Council.

This externalization is increasingly true for companies, NGOs and governments around the world. For example, the Humanitarian Data Exchange (HDX), managed by the UN Office of Coordination of Humanitarian Affairs (OCHA), is a key resource for the public sharing of data for day-to-day issues like food prices, or crises such as Ebola, the Nepal Earthquake, and the global effects of El Niño. Other exchanges exist for water point data and health data.

While Mr. Andrews emphasized that “our data on climate reaches from the depths of the ocean to the surface of the sun,” most of the show’s attention was focused on the domestic U.S. business market. In contrast with such high-minded public benefits from global, even extraterrestrial data, what you saw at Strata+Hadoop from booth-to-booth were more down-to-business commercially-focused applications: business intelligence, fraud detection, security, industrial, retail and advertising. The technologies underlying them were squarely targeted at corporate IT. And, to be a bit more painfully obvious, for some vendors, exclusively for a U.S.-centric, English-language-only IT world.

In a way, there is even a problem of terminology. For instance, the term localization. In the context of linguistics, the term localization is the step beyond simple translation. But in Big Data, you have (at least) two different additional meanings: the first, being data localization, storing data within a legal geographical region for regulatory compliance, and, second, localization of resources, such as in YARN, copying remote information to a local file system for improved processing. During my conversations I did get a few double-takes when I even brought up the term localization, and needed to clarify what I meant by it.

Such an Anglophone stereotype in the B2B world is not universally true nor altogether fair. in private talks with vendors, many have apps, tools and technology already internationalized, localized and in use by customers around the world. If the vendor supported localization, it was often a need-to-have for compliance in the EU or for acceptance Asian markets. One trend I spotted: in my informal survey of which languages vendors were looking to add, beyond what they might already be supporting, “Arabic” was the most-frequently cited.

Other vendors knew they weren’t localized — yet — but had it on their strategic roadmap. For them, it was not a matter of “if” but “when.” Others still were open to the conversation, but it was apparent their whole focus had been on the innate data science or engineering side of their own business; they simply had never gone through localization with their current or any prior product.

There is a whole spectrum of work required to truly be a globalized brand, from the obvious points of customer contact, such as social media and web content, to webinars that needed dubbing or subtitling, and translations for presentations and collateral material, to changing the very UI/UX, documentation and help files — even, potentially, the source code of their products. I had quite a few lively conversations at the show.

For example, at the booth for R Studio I spoke with one of their staffers who hadn’t heard about RL10N, the project to localize R. It was one of those “The More You Know” moments for him. If the R package and its libraries are localized, the next wave will require tool kits like R Studio to support localized development.

Speaking about Big Data and translation, did you know the name for Apache Flink was taken from a word that exists in German, Swedish and Dutch, flink, which can be translated as “agile?” I found that out from a great blog on MapR’s site.

I also ran into Information Builders at the show. Founded in 1975, they have weathered many radical changes in the IT industry, and rank highly among Business Intelligence (BI) providers. Visiting the Information Builders website, I noted it was already localized for multiple markets, with worldwide offices spanning every continent apart from Antarctica.

I ran into many other companies at the show, and had many great conversations, from the folks who ignited Spark at DataBricks, to Data Science, Inc., Jethro, Snowflake, DataTorrent, plus old friends (and new) at Aerospike, and others with Basho and Objectivity.

If anything, companies can benefit from translation and localization even within the Strata+Hadoop World environment itself, because later this year, the event will move on to Beijing (4-6 August 2016). The Call for Proposals (CFP) period for presentations in Beijing is now open, and most sessions will be in Mandarin! Are you ready for your Chinese audience?

What are your thoughts on language localization and its role in big data? Are these merely two trends passing in the night, or is there a need for more attention to customize applications, interfaces, documentation, even APIs for global audiences? Let me know! I’m on Twitter at @PeterCorless.

Do you have a big data or other major IT project you need to localize as you go global? Send an email to and share your challenges.