Data from the People for the People

New tools and huge amounts of open-source data are democratizing the Big Data revolution, but what are the possibilities and limits of this transformation?

“How spies, soldiers, and the public should use open-source intelligence.” This is the title of a recent article in The Economist that details the impact of the vast amount of information that we receive about the Ukrainian war via “mobile-phone video, drone footage, satellite images and other forms of open-source intelligence (OSINT).” 

The main goals of OSINT are discovering, collecting, and evaluating public information, as well as developing plans and tools to collect this vast unstructured information and process it. There is no wonder why it is so appealing to governments, intelligence agencies, official investment offices, international organizations, and researchers. Of course, this idea of collecting and analyzing information is not new for such offices but there are several features, including the volume, sources of information, ability to process it, and its open-source nature that have begun to alter the playing field.

The first important change is the amount of information. It is not only the vast information already available but also that, every second, new data is being generated. It makes little sense to compute the size in terms of “zettabytes” of the Internet, as this is a moving target. In fact, back in 2017, BNEF forecasted that the global storage market would double six times by 2030. Just imagine, as we find ourselves on the verge of Web 3.0, what the future holds with new bytes of information generated by all the devices of the Internet of Things.

The second radical change on the horizon is what we consider information. The new century opened many possibilities for OSINT and information sharing. The collaborative web, Web 2.0, allowed people to participate and share information – and data that before had no relevance was suddenly key and began to increase exponentially. This includes comments, conversations, information on social and professional networks, photos, videos, audios, and satellite images, as well as information from companies including data from financial transactions and e-commerce. On top of this, most of these new bytes of information are geographically referenced, and with great precision.

Although these new sources of information generate large amounts and types of information, it is not “structured” in a way that is ready to use. Here is where the third radical change comes in. There has been a rapid increase in the processing power of computers and the development of new algorithms that allow us to convert this new information into structured information that is ready to be processed and analyzed. New algorithms that convert text, images, audio, and video to numbers improve every year, from basic algorithms to more complex deep learning algorithms. Just look at what ChatGPT can do. 

Last but not least, an important development is the open-source nature of information: data from people and for people. Governments, international organizations, statistical agencies, and private institutions… are all improving their efforts to make this information public. Moreover, many of the tools (Python, for example) needed to process it are open source – that is, there is a plethora of information and also low-cost tools to process that information. The result is truly a democratization of the data.

Covid showed the world some of the value of Big Data.

It is easy to agree that the sheer amount and availability of information is enormous, but important questions remain: how can we use information for analytical purposes in social sciences, including economics, politics, and international relations – and what are the limits? As a practitioner implementing Big Data techniques for economic, social, and geopolitical analysis, my answer to the first question is simple: the potential of open-source intelligence is enormous. I would not be surprised if research on social science starts to adapt to this new paradigm by intensifying empirical and experimental research and to rely on more interdisciplinary teams.

There is an increasing trend of using these new techniques in the field of economics, despite some initial reluctance of using AI and Big Data some years ago. Covid showed the world some of the value of Big Data. The magnitude and speed of transmission of the pandemic made researchers focus on high-frequency Big Data with the capacity to provide some light to understand what was happening in real time and high definition. New data from mobility and transactions from financial institutions were put forth for the public good to complement the set of traditional statistics.  

Beyond the real-time capacity of Big Data, the potential for other types of analysis is also enormous. The high granularity or high definition component of this new information will improve the ability to analyze the heterogeneity of individuals. Indeed, these new data sets are key to understanding the magnitude and source of inequality, the impact of inflation in different households, and the effect of policies on different households and corporations. Furthermore, this new information will clearly encourage geographic, sectoral, urban, and climate change analysis. This will be particularly true in the case of sustainability and climate change analysis, where the availability of precise data is key in responding to the climate change challenge.

Understanding that “one size does not fit all,” the availability of such granular information opens the door to the design of smart policies, those that can be directed to where they are most needed and can have a greater impact.

Politics and international relations are also fields with big potential for new sources of information and tools for analytical purposes. The Big Data revolution and the ability to convert text from news, video, official documents, images, and maps into numbers can improve the quality of the analysis. In fact, these new sources of information are key to these disciplines, particularly in regards to the possibilities of real-time information. For example, in 2011, we learned of events in Cairo as they were happening thanks to social networks and particularly some Twitter accounts. We tracked, with high precision, the immigrant flows from Syria to Europe in 2015 just by extracting and processing information from the news. Yet, the number of images that have been coming to us from Ukraine since the start of the war has no precedent.

In a world where uncertainty has been on the rise, understanding how any country, sector, or leader interacts with the rest can be key.

This new wave of AI algorithms can help us observe what is happening as well as understand how the world feels about it, the “how is happening.” Beyond real time, information in the news can be used for sentiment analysis purposes, with the potential to measure abstract concepts such as uncertainty, happiness, and discontent from the likes of speeches, documents and even people. Understanding what governments say, how they say it, and the public’s response to it is now much easier than ever before.     

Speed and processing capacity comes into play here. For example, years ago, it would have taken significant time to analyze the documents coming out of the Chinese Communist Party meetings, but now there exist natural language processing techniques that can access, clean, and classify by topics, relevance, and the connection of the policies and strategies – and can do so with relative speed. Indeed, network analysis promises to help us better understand the interconnectedness of people, organizations, policies, and countries. In a world where uncertainty has been on the rise, understanding how any country, sector, or leader interacts with the rest can be key for researchers and practitioners. 

So, what are the limits to all this information and algorithms? In principle, it is easy to imagine that the amount of information will continue to increase exponentially and much of it will be included on e-government platforms and other forms of open-source information or OSINT. The limits and open questions here are more related to how governments regulate information and how the balance between efficiency, legal, and ethical issues is applied. In fact, this strategic issue is being applied at different speeds by the United States, EU, and China.

The potential will continue to grow. Higher processing power, new quantum computers, and new developments will be able to process faster and more complex phenomena. The algorithms are very good at observing and replicating patterns and will continue to improve at these initial levels of the causality ladder. They are masters of seeing and observing statistics and correlations, and can replicate patterns more efficiently than humans can. They are good for answering what happened and describing. Ask a question about the past on coding, history, politics to Chat-GPT and you will be surprised. 

However, there is room for improvement in the areas of causality and, in particular, imagining. It’s one thing to detail what happened, and another to explain why it happened (causal inference) or imagining (what would have happened if, for example, the Soviet Union had not collapsed at the beginning of the nineties) or exploring how the world should be. These basic questions are key to forecasting, simulating alternative scenarios, and designing good policies.

New technologies, massive amounts of information, new sources of data, and more sophisticated tools are here to stay and will, from here on out, transform the way we approach problems and develop social science research, which will be beneficial for a range of other disciplines. It is understood that these opportunities bring about challenges that require changes in our skills and the way we work, skills that allow us to distinguish between information and disinformation, understand data and their bias, and how to face them. In order to fully embrace the revolution and bring it to its full potential, we need multidisciplinary teams and increasing partnerships between government, corporates, and academia.


© IE Insights.