Enhancing Human Perception with ML, AI, IR and NLP

Interactive Text Mining and Information Retrieval at Work

Marco Brambilla
Towards Data Science
3 min readDec 12, 2017

--

Textual data is still one of the most prominent sources within the context of big data analytics. Exactly as any sensors can sense the status of the environment and report the data collected, humans are also able to perceive the surrounding world and then express their understanding as textual data. In a sense, humans are actually already transforming the input data in some kind of valuable information. Therefore, human-generated textual content is already a step beyond the “raw data” level. Textual content already embed elaborated and perceived insights over reality.

We can mine the content of such textual information and extract knowledge about the observed world, about the the observer (human), and about the used language. Out of that, with predictive analytics one can infer real-world variables. And you can also add to that the whole context data and metadata that can be available (e.g., timestamps, geolocation, and so on).

However, performing real NLP (natural language processing) is a difficult task. Here we are not talking about basic text processing, which has become a common practice. We are talking about real text understanding.

So, how can we leverage imperfect NLP to generate perfect applications? Simple: by bringing humans in the loop!

Exactly as you need the microscope, or telescope, or macroscope depending on what you want to analyze of the real world, you need a textscope for studying text. This tool must be an intelligent and interactive tool that combines information retrieval, text mining, data analysis, and human-in-the-loop techniques.

Several techniques can be combined: for instance, entity extraction, word classification, integration of text mining with causal analysis over time series (also of non-textual data), and so on.

On the diagram here you can see an heuristics for optimizing causality (based on time series analysis) and coherence (based on topic analysis)

Unfortunately, each application will require massive customization and training of such tool. We need abstraction and unification for minimizing the learning effort for covering different domains and applications.

This is very much related to our research on Social Knowledge Extraction, that analyses human-generated content with the purpose of generating new formalized knowledge (for instance, to integrate existing knowledge bases like DBpedia).

This story is inspired by a keynote speech by Cheng Xiang Zhai, ACM Fellow from University of Illinois at Urbana-Champaign (Institute for Genomic Biology, School of Information Sciences, Dept. of Statistics), given at the IEEE BigData Conference 2017.

--

--

Data science, social and media analysis. Data, software, ML, AI, and models all around.