New Urdu-Punjabi Machine Translation System

Guru Gobind Singh Bhawan at Punjabi University, Patiala, India. (Image: Wikipedia)

A new project from Punjabi University, Patiala in India created a new Machine Translation (MT) system to specifically improve the quality of translation directly between these two languages.

An article in the Times of India details some of the technical challenges the research team faced such as handling missing diacritical marks and the ambiguities of meaning in many Urdu words, as well as creating parallel corpus and dictionaries.

Because it was designed with these specific languages in mind, the creators purport that their system is more accurate than a general-purpose translation system like Google Translate. The software has been integrated into Akhar 2016, an Indic-language word processor. The stable release of the software is planned for release on 23 September 2016.

Urdu-to-Punjabi: Twice the population of German-to-Italian

Urdu is the national language of Pakistan (officially replacing English in 2015), and is one of 22 scheduled languages of India. Urdu is spoken by over 162 million people, primarily spread across India and Pakistan. However, the adoption of Urdu in Pakistan is not without controversy. No region in Pakistan is majority-Urdu speaking, and many see it as the language of the Muhajir (or Mohajirs), refugees that came to Pakistan after the 1947 partition. Regardless, because of its large linguistic base, it is a vital language for communications in south Asia.

Next to the official Urdu (which many only speak as a second language), Punjabi is the largest language bloc in Pakistan. Its primary speakers accounts for 48% of the country’s population — nearly half of Pakistan’s 182 million citizens. Like Urdu, the Punjabi community is split between India and Pakistan: 90 million speak Western Punjabi in Pakistan (and predominantly write using the Shahmukhi alphabet), and another 29 million speak Eastern Punjabi in India (and primarily write in the Gurmukhi alphabet), for a total linguistic base of 119 million.

By way of comparison, there are 80 million who live in Germany and 60 million people who live in Italy. The Urdu and Punjabi language communities respectively double each of these in terms of size.

Additional Research

This is not the only project produced by the Research Centre for Punjabi Language Technology, which was founded in 2004. Other projects include web resources to teach Punjabi (such as a Punjabi grammar checker), an Urdu-to-Hindi translation system, a 97% accurate Gurmukhi OCR system, a Shahmukhi-to-Gurmukhi transliteration system, and much more.

These projects are harbingers of the evolving development of systems to directly translate without the intermediary of English. While English dominated the Internet and computer software from their Anglophone origins, a rising cluster of languages are increasingly emerging as their speakers become more engaged in the web and social media via mobile phones. In 2016, there will be over 204 smartphones in India, and 40 million in Pakistan.

In both of these polyglot nations, India and Pakistan, nearly seven decades of decolonization has made English far less of a lingua franca than it had been in years past. There is more need to communicate peer-to-peer in the common vernacular, such as Urdu and Punjabi. The same is true of other parts of the world where English, French, and other western languages associated with colonial empires and commerce decline relative to the growth of native languages. As more computing power is available worldwide, and more smartphones reach developing markets, expect evolving methods to support language translation without any English in the equation.

What are your thoughts? What translation language pairs do you find are vital to your organization’s localization success? We’d love to hear. Email us at projects@e2f.com and let us know.

Previous
Previous

Risks and (Lack of) Rewards for Crowdsourcing

Next
Next

Have friends looking for a job?