Spanish as a Native AI Language

Artificial Intelligence is an integral part of our day-to-day lives, but it still struggles to process the world’s second-most spoken language: Spanish. Professor Elena Gonzalez-Blanco, whose work includes an AI-based recommendation engine for song lyrics in Spanish, details why this is and highlights the importance of digitization to preserve linguistic heritage in the future of AI development.

With each day that passes, we hear more about artificial intelligence. In fact, AI is fast becoming part of the routine of our daily lives, even if we don’t fully understand exactly what it is. For example, we no longer think twice about smiling in order to unlock our mobile phone, all the while (probably) unaware that during this microsecond, thousands of pixels are being converted into a data feed that the latest deep learning algorithms utilize in order to carry out facial recognition with more than 98% accuracy.

AI’s rise has been rapid. DeepMind beat the world’s Go champion in 2016 and since then a combination of colossal amounts of data, the creation of powerful processing systems (GPUs), and the maturity of neural network algorithms (such as Tensorflow) have turned the machine learning theories developed more than 60 years ago by Marvin Minsky and John McCarthy into a programmable reality.

It is incredible indeed. However, beneath that magic that allows computers to behave like the human brain, there lies a combination of technologies and data that go about solving problems in very different ways from the human brain – and these methods sometimes fail. The paradox is that on one hand, we are frightened by a world in which robots could take our jobs, yet at the same time we are unable to communicate for longer than a few minutes with Siri, Alexa, or Google Home. In fact, our “conversations” with today’s virtual assistants or chatbots (using voice or text) do not go beyond requesting basic information, giving simple commands, or establishing specific routines.

Spanish is the world’s second-most spoken language, yet there is no AI capable of processing the Spanish language’s numerous dialectal variants.

Making machines talk and write is one of the most complex tasks computing has ever faced. As early as 1951, the British scientist Alan Turing posed the challenge of the imitation game to test a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. The reality of Turing’s endeavor is still very distant today. Why? Human language is highly varied and complex, it is a living system where the algorithms that weave these digital neurons that make up AI learn from the data that they feed upon, allowing computer cells to acquire vocabulary and improve their linguistic structures thanks to their constant exposure to conversation.

The main scientific developments and the large companies currently working on AI have originated in English-speaking countries and have therefore trained with data in English. There is thus no question as to why, in regards to the linguistics aspect of AI, much more progress has been made in English than in any other language. Of course, linguistic reality is very different from the technological one: Spanish is the world’s second-most spoken language, with more than 585 million speakers and an average growth rate of 7.5% per year. Yet there is no AI capable of processing the Spanish language’s numerous dialectal variants (due to different geographical, social, or contextual circumstances), or at least not processing it to a particularly high standard.

The reason why Spanish lags behind English is largely due to the fragmentation of Spanish language technology companies, which are generally small and oriented to specific functions such as translation and that reflect the Spanish spoken in Spain, rather than, for example, in Latin America. Moreover, despite the large amount of data now available in Spanish, much of it cannot be used because it is often owned by private companies, and even those in the hands of public and cultural institutions are in silos and are not easily accessed for one reason or another. Therefore, companies and large clients looking to use AI solutions in Spanish must choose programs that were not “manufactured” in Spanish and, worse yet, were trained through data that was English translated into Spanish. In this manner, both efficiency and accuracy are greatly reduced.

In order to train a native Spanish “speaking” robot for use in the legal field, a huge amount of Spanish legalese is required – in addition to knowledge of Roman Law and the functioning of jurisprudence in Spain. In the case of Latin America, for example, differentiating the many varieties of Spanish on the continent demands knowing not only the lexical variants but also the phonetics and even the situational functioning (pragmatics) of some expressions in certain contexts. These are the nuances that are easily lost in translation.

There is progress, though, thanks to a growing interest in the development of AI applied to language. There has been an increase of 34.5% of scientific papers on Natural Language Processing (NLP) and AI applied to language between 2019 and 2020, which illustrates the growing maturity of the technology. Moreover, the development of AI is key to economic development: China leads the technological revolution; followed by the United States, and Europe struggles to avoid falling further behind while looking for niches linked to new opportunities and to its cultural, economic, and historical reality. Language is undoubtedly one of these opportunities since the asset that serves as a starting point – the data – is available and not yet being properly put to use.

We have barely begun to explore the market potential and data variety of Spanish as a native AI language. The wheel does not need reinvention. Rather, we just need to open up the data and make it available to train existing algorithms and align business so as to create an AI as powerful as the number of Spanish speakers around the world. This will not only encourage the creation of new companies and better algorithms but reinforce the digitalization and digital preservation of an entire cultural, linguistic, and historical heritage that deserves a privileged space in the future of international digital transformation.


© IE Insights.


Sign up for our Newsletter

Newsletter Subscription