Unstructured vs semi-structured data: Order from chaos

Most Commented Posts

Structured vs unstructured data – it’s a common way of categorising things.
But it’s not quite that simple.

Although structured data is easy to grasp, the world of unstructured data and its transformation to more easily understandable, usable and analysable semi-structured data, is less simple.

In this article, we look at structured data, unstructured data, and how semi-structured data brings some order from potential chaos. And brings benefits to organisations that want to gain value from often very large stores of documents, images, sound files, video, social media posts, and so on.

Structured data has… structure

Business information is mostly generated by systems or people. Data from systems is most likely to be structured.

In its traditional format, this is most typified by data in relational databases that use SQL (structured query language). In these, structure is everything. Columns that represent variables are set up in advance and populated by rows of data in which a value sits at the intersection of each.

It’s something we can all visualise. It’s like we see in a spreadsheet – though whether spreadsheets are structured data is up for debate – but complex SQL database schemas involve the equivalent of numerous spreadsheets (tables, in database-speak) that relate (whence “relational”) to each other and can be filtered, joined and manipulated in many ways because they have common elements (keys).

Despite the prevalence of unstructured data and the rise of formats that are better described as semi-structured, structured databases are important and won’t go away soon.

They are easy to use, by everything from large-scale enterprise applications to machine learning tools, but can be limited in how they are accessed and used and can be relatively onerous to maintain and to change once initially configured.

The mass of unstructured data

Unstructured data is often generated by people – although not solely – and includes media such as images and sound recordings, social media posts, agent notes, websites and emails.

Unstructured data holds to no predefined data model and files and objects come in a wide range of sizes, from a few kilobytes for a social media post, for example, to potentially terabytes for uncompressed video footage.

Estimates often suggest that the vast bulk of data is unstructured – up to 80% or 90% of data held by organisations.

If that is the case – and we can safely assume it often is – then this presents huge challenges for organisations. Unstructured data is, to a greater or lesser extent, undefined and opaque to search and classification.

That means organisations may not know what is actually there, and that can be a security and compliance risk. At the same time, it means missing out on opportunities to interrogate that data to gain insights and value from it.

No such thing as unstructured data?

But in fact, it is arguable that no data is truly unstructured. The most unstructured data you can think of – image and sound files, for example – comes with metadata headers that provide high-level information on file contents that can be searched and questioned.

And it is increasingly possible to examine the contents of such files using artificial intelligence/machine learning techniques to, for example, examine and categorise the contents of sound and video files. YouTube does this to ensure copyright on music is not contravened when you upload a video, for instance, so these types of data can be tagged with new metadata-based, algorithm-based interrogation, should an organisation wish to throw compute at it.

The semi-structured data revolution

At the same time, there is a growing trend towards more use of semi-structured ways of holding data. Some forms of semi-structured data have been around for some time, such as CSV and XML. A bit later came JSON. All these brought with them something like a key:value format for representing variables and values.

Later came a wide range of ways of holding and analysing data that were not restricted by predefined structure. Broadly speaking, these can be lumped together as so-called NoSQL databases, but there are a number of types within that catch-all.

They include column store databases like Hadoop and Cassandra, document stores like MongoDB and CouchDB, key value stores like Riak, as well as graph databases, object databases, and so on. The list gets pretty long.

But, what links these is the lack of the predefined structure – schema-on-write – by which SQL is defined. So, with these non-SQL formats, potentially any data in any existing format, ie unstructured, can be provided with a structure – schema-on-read – as data is queried. It is even possible to include sound and video files – the ultimate in unstructured-ability – in things that get called databases, such as with MongoDB (although there are limitations).

The big advantage of being able to put unstructured data into some form of semi-structured format is that it enables a range of use cases to emerge, such as analytics to spot consumer behaviour, market trends, sentiment analysis.

Arguably, analytics on this kind of data gives deeper insight into users. An SQL database might hold name, date of birth, address, etc, but analysing unstructured data – via making it semi-structured – can get closer to what consumers think.

It is also possible to put some structure on the unstructured and make use of it. A photograph of delivered item would be unstructured data, but metadata from the image file could be combined with geo-tracking information from delivery vehicles in a business intelligence tool.

Source link

Most Commented Posts

Related Posts