Deciphering the stagnant pool of unstructured data
IDG Staff Feb 01st 2018 A-A+

The amount of data on the internet is growing exponentially and it is also leading to a growth in volume of unstructured data. Inducing structure into unstructured data is a complex task but has direct benefits that impact user experience and customer engagement. Advancements in Machine Learning are being used to crack some of the most difficult problems in this domain.

Exploding volume of data

As content on the web explodes, so does the amount of unstructured data, which it houses. Unstructured data is the kind of data which cannot be parsed into a machine interpretable structure using simple parsing methods. Examples of such data includes blobs of text, audio files, videos, images, and chats among others.

Unstructured data is an important component of the Internet and is often the center-focus of several kinds of content pieces which we browse. Even though there are places where this unstructured data can be “aggregated”, it must be made clear that the process of “aggregation” does not mean “structuring the unstructured data”. For example even though platforms like YouTube aggregate videos, it does not necessarily structure the data in the videos, something that we will see as we learn more about this.

How structuring the unstructured can change things

Let’s go back to the YouTube example. YouTube aggregates several videos but does not induce structure into it. What would happen if a video aggregation platform gains the capability to structure its videos? As an example, consider, if a video platform is able to structure the videos, then users on the platform would be able to search a video by entering the description of scenes in the search text box.

Structured and unstructured data

For example, if a user wants to see a video which has a part where a “car is entering a poultry farm”, the user can simply search for this phrase, and the platform will be able to find all the videos where there are scenes in which cars are entering a poultry farm. This example also clarifies what it exactly means to “induce structure in data”.
Inducing structure into data implies that the machine is able to “understand what the unstructured data means”, like how the video platform was able to understand the content of the videos present on the platform due to which it was able to filter out all those videos which had scenes of a car entering a poultry farm.

Structure can be induced in not only videos, but in audio, images, and text blobs also. Inducing structure in unstructured sets of data will enable deeper context-based searches in which the user can not only search on superficial attributes like file title but also on actual semantic meaning of the content inside the videos, audios and other content.

Techniques for structuring the unstructured

Machine Learning is the key tool for inducing structure in unstructured data. Natural language processing can be used easily to structure a massive blob of text. Recent advancements in natural language processing enable us to use deep learning tools known as ‘word embeddings’ to create highly accurate language models. These language models are essentially weights assigned to connections in a neural network with non-linear activations that are often deep layered. These language models are combined with techniques like generative recurrent networks to cluster, classify and summarize text. 

Recurrent neural networks are being used extensively to process audio. Recurrent nets tap on the assumption that statistical distribution of a sound file remains the same throughout the file. A neural network built using this concept is a recurrent network that is very effective in parsing and inducing structure in sound files.

Image processing has also highly benefited from recent advancements in deep learning, through the refinement of a specialized neural network which is known as the Convolutional Neural Network or CNN. CNNs leverage spatial/time-domain repetitive nature of speech and images to create neural networks, which need far lesser connections than the fully connected neural network.

This makes training CNNs much easier and it requires much less data. CNNs can be very effectively used for structuring images and video. Pre-trained CNNs are also available in the public domain over which specialized CNNs can be built for specific tasks using specialized data training.

Recently, CNNs have also been applied to videos with a high success rate. Like images, CNNs treat videos as a layer of multiple image frames, which are then processed like a multidimensional image. CNNs have been used to tag objects in images and videos both, with a very high accuracy.

The future

As the volume of data grows, so will its complexity. More complex data means that structuring it will become a complicated affair over time. Fortunately, there have been great advancements in techniques for structuring data and as computational power grows and computation becomes cheaper and more available, it will be possible for us to parse out massive amounts of unstructured data floating in the virtual space in the internet like unmined gold.