What is a Unstructured Data?
What is Unstructured Data?
Unstructured data refers to datasets that aren’t stored in a traditional, structured database format (a relational database or RDBMS). Unstructured data often does have an internal structure, but because that structure is not predefined through data models or schema, it cannot be stored in an RDBMS.
Unstructured data represents the vast majority of data in the world, quite simply because it could be anything: text and other business documents, email, social media, images, video, audio, sensor and IoT data, and much, much more. Unstructured data can be human-generated or machine-generated and may appear in a textual or a non-textual format.
Unstructured data comprises 80-90% of all data in our world today — and its exponential rate of growth far outpaces that of structured data. Unstructured data is not only immense in quantity, but holds a tremendous amount of business value in the modern enterprise. Yet the fundamental nature of unstructured data means that much of this value remains untapped — and unsecured.
What is structured data vs. unstructured data?
To better understand unstructured data, lets look at how it compares to structured data:
Structured data is typically numerical information — dates, names, addresses, phone numbers, SSNs, credit card numbers, medical record numbers, product numbers, transaction information, etc. More importantly, structured data is organized in a pre-defined structured format, such as the columns and rows of an Excel spreadsheet or Google Sheet. Structured data usually lives within a relational database (RDBMS).
Most importantly, the framework of structured data makes it easy for data entry, searching, comparison and data extraction functions. The structured format means it is eminently searchable, whether with human-generated queries using a structured query language (SQL) — or via algorithms that pull in specific types of data and field names, such as alphabetical or numeric, currency or date.
By contrast, unstructured data is usually text-heavy, does not have a pre-defined format and does not live within an RBDMS — and as such is configured in such a way that makes it very difficult for traditional search and analytics tools to search and analyze. For example, while emails or social media posts can be searched based on hashtags or metadata, traditional search and analytics tools cannot parse the full text of these messages.
Unstructured Data Examples
Here are a few of the most common examples of unstructured data in the modern enterprise:
Everything from Word documents and PDFs, to PowerPoint presentations and many spreadsheets and XML files. Even though the text in many of these files may follow a common format, the data is not fully structured in a pre-defined schema that can be analyzed with traditional search and analytics tools.
Text messages, chat and messaging apps like Slack and Microsoft Teams, as well as phone recordings are all examples of unstructured data generated by intra-business communications every day.
Webpages and Apps
Developers know that coding is a very consistent and methodical practice, but the code itself does not follow a pre-defined structure or schema, meaning that the code behind almost all web pages and applications is unstructured data.
Another extremely common and voluminous source of unstructured data in the enterprise is customer interaction data, whether from multichannel contact center interactions (phone, chat, text, etc.), online reviews or surveys, etc.
Business emails generate a tremendous amount of unstructured data every day. Emails may be considered semi-structured by metadata-driven categories, but the full text of the email is unstructured.
Media (Images, Audio and Video)
Media files can be tagged using metadata and saved in consistent formats, but the full content of those media files represents unstructured data.
Incoming data from social media networks like Facebook, Twitter, YouTube, LinkedIn and the like are much like emails. Hashtags and metadata-driven categorization present some structure, but the full text/content of the posts represents unstructured data.
Machine-Generated Unstructured Data
One of the fastest-growing categories of unstructured data in the enterprise is machine-generated unstructured data. This includes monitoring data, such as photo and video surveillance, as well as scientific monitoring and measurement devices. But it also includes the exponentially growing Internet of Things (IoT) environment where nearly every machine, device or “thing” is monitoring something (its own condition, its usage, external factors, etc.) and reporting back. This machine-generated unstructured data is now essential to operations in manufacturing, tech, consumer electronics and more.
What about semi-structured data?
A third category of data, semi-structured data, falls in between structured and unstructured. While not stored in a relational database like structured data, semi-structured data has some organization properties — such as native metadata, tags or other markings — that make it easier to search, parse and analyze compared to truly unstructured data. Semi-structured data makes up just 5-10% of enterprise data, but it includes some common data types that have already been mentioned. For example, most email clients today include native metadata tagged onto all emails that allows for basic classification, keyword searching and more. Social media includes hashtags that allow limited searching capabilities. Other examples of semi-structured data include XML markup language, JSON data-interchange format, and noSQL databases.
Why is unstructured data important?
In the modern knowledge economy, many organizations’ “work product” is unstructured data. For example:
- Word documents where strategies are outlined
- PDFs containing critical business reports
- PowerPoint presentations used to launch and sell products and services
- Excel spreadsheets containing customer lists and other customer information
- Media files produced by content creators
- Web pages, web content and web apps created by developers
Moreover, all the work that goes into the polished end results and “finished products” discussed above typically takes the form of unstructured data. That includes drafts of all of the documents and files mentioned, as well as the emails, chats and other communications that drive collaborative productivity and innovation.
In other words, nearly all work in the modern enterprise takes the form of unstructured data.
The untapped value of unstructured data
While traditional analytics tools struggle with unstructured data, next-generation analytics tools are using advanced data mining, AI, deep learning and neural networks to dig into unstructured data — for example, using natural language processing to understand and analyze audio files. These next-generation tools enable organizations to analyze and query their tremendous troves of unstructured data to unlock insights and extract value for better data-driven decision-making. Below are just a few use cases:
Aggregating unstructured data from contact center interactions, customer surveys, social media and more to understand trends in customer sentiment and improve customer experience, predict customer needs and market demand, and drive customer-centric business models.
Using IoT sensor data to identify and even predict equipment, mechanical or technology issues and failures has proven incredibly valuable for manufacturing, utilities and other industries where uptime and operational performance are business-critical.
Data-powered IT optimization
Analyzing log data from IT systems and technologies to help manage capacity and balance bandwidth demand, as well as identify issues and potential improvements that can drive meaningful value for the business.
Why is unstructured data challenging?
The same thing that makes unstructured data so promising also makes it challenging: It’s inherently difficult to fully see, analyze and understand. While mature analytics tools have long existed for structured data, these tools don’t work with unstructured data. The aforementioned next-generation analytics tool — using sophisticated AI — are just becoming available and practical for most organizations.
The result, experts say, is that the vast majority of unstructured data remains completely untapped. More concerningly, the lack of visibility into unstructured data also means that the vast majority of unstructured data in the typical enterprise remains largely unprotected by data security tools and strategies.
Importance of securing unstructured data
It’s common sense, really: If it’s valuable, then it’s worth protecting. That’s patently obvious in cases where the essential work product of an organization exists as unstructured data. Of course, you want to protect your employees’ work product and secure your IP, your “crown jewels.”
But it’s also critical to protect all that potential value — because it also has potential in a competitor’s hands. Imagine a competitor gaining access to a retailer’s customer experience insights, a manufacturer’s operational insights, a tech platform’s IT operational insights or the higher-level business intelligence (BI) insights at any organization. That competitor could just as easily take that raw unstructured data and extract the value to be used for their again (and to your detriment).
Why conventional data security tools & approaches don’t fit unstructured data
Conventional data security tools built to protect structured data. DLP, CASB and other policy-based tools were designed for a long-gone world where all business value (data in need of protecting) existed as structured data — where the biggest concern was regulated data types like social security numbers, medical records and other personally identifiable information (PII). These policy-driven tools are designed to look for the clear and present hallmarks of structured data — the numerical format of a social security number or credit card number, for example — and then let the security team block the movement/exfiltration of those protected data types. Download this report to get the data on why traditional security tools aren't getting the job done.
DLP & CASB can’t see unstructured data
Conventional data security tools cannot see unstructured data. In fact, they simply were never meant to be used with unstructured data — yet they are increasingly applied to the problem of protecting valuable unstructured data. Case in point: a Word document could contain extraordinarily valuable information, or it could be a junk draft of an inconsequential memo. DLP and CASB struggle to tell the difference because they’re not designed to understand what value looks like in all these unstructured files.
Unstructured data evolves too quickly
The way around the visibility problem is that organizations need to go through an exhaustive data classification exercise to identify exactly which unstructured data/files need to be protected. This isn’t just a time-consuming pain; it doesn’t fit the dynamic nature of valuable unstructured data. Going back to the Word doc example from above: What starts as a junk draft might eventually become a new go-to-market strategy. Likewise, what starts as one of the countless explorations in R&D might eventually become your groundbreaking new product. There’s no clear line between valuable and not valuable. Moreover, policy-based tools depend on the security team to make that policy change to label a document valuable. In other words, these tools can only look for what the security team tells them to look for. The modern world of work just moves too fast for this approach to be practical.
Unstructured data needs to move
Policy-based tools use rigid blocking capabilities to protect sensitive, regulated data. That’s great when the data and the rules are explicitly clear: A social security number should NEVER be exfiltrated outside the organization. But it’s not so great for the dynamic nature of the unstructured data that drives the modern enterprise. Work product needs to evolve — rapidly. In our collaboration-powered work culture, sharing and iterating is how work happens. Policy-based tools are too rigid to handle this. The result is a litany of false positives that frustrate employees and stifle productivity and collaboration, along with alert fatigue and endless policy exceptions (read: blind spots) from the security team.
A modern approach to protecting unstructured data
- Start with visibility
You can’t protect what you can’t see. Organizations need to build a foundation of comprehensive visibility, with the right tools to be able to see, analyze and understand all their unstructured data — from every device, every application (sanctioned or not), in the cloud, and on or off the network.
- All data matters
This comprehensive visibility stems from embracing the paradigm that all data matters. Organizations can’t just focus on what they think is valuable, sensitive, or in need of protecting today. They need to be able to see all data in order to address the dynamic realities of collaboration culture.
- Sophisticated analytics unlock the real value
Advanced AI and BI tools are unlocking the value buried in mountains of unstructured data — and the same potential exists within data security. Advanced analytics capabilities can help organizations tune out the noise of harmless everyday productivity and collaboration — leveraging metadata to put context around and understand unstructured data. This gives security teams a clear signal of risk, so they can better identify when data that may be valuable move in abnormal ways that are more likely to present a risk to your business.
- Don’t inhibit the business — focus on right-sized response
The rigid blocking approach has always been problematic, frustrating users and security teams alike. Today, organizations simply can’t afford to inhibit speed, agility, collaboration, and innovation. They need to empower their employees’ ingenuity — and that means allowing unstructured data and files to move and evolve. But with comprehensive visibility, context and analytics, they can protect the value created through that free collaboration. They’ll not only have a clear signal of what’s actually risky — but the contextual visibility to investigate immediately and drive a rapid, right-sized response to protect the data, without inhibiting the business. Ready to stop blocking and take a right-sized response at your organization? See how Incydr can help in this white paper.
Frustrated with your traditional data protection tools?
Take a new approach to data protection – this guide compares traditional DLP solutions to Insider Risk Management solutions.