DQ (Data Quality) management of unstructured data
Handling Data Quality in Unstructured Data
Unstructured data in heterogeneous sources (documents, pdf’s, web pages, xls, podcasts…) by its very nature, lacks defined format, type, description (e.g. “what is this information?”), and clearly identified relationship to other information. If information in unstructured document is presented in a logical pattern, humans can ‘read’ and ‘understand’ it, but current databases, BI and reporting technology cannot. The patterns are too complex to ‘read’ even for today’s data management software.
Thus, over 80% of corporate information, (i.e. that held in unstructured data form), remains out of reach for current database management and data quality technology. To leverage unstructured data tied up in corporate repositories, corporate documents or content three tasks need to be completed
- data first needs to be identified as useful and categorized
- then unstructured data needs to be ‘extracted’ into a structured format
- once in structured format, data management and quality rigor needs to be applied before data can be utilized for decision-making and subsequent integration with structured data sources.
Zoral Labs possesses technological ‘know how’ and deep expertise in categorizing content and extracting high quality data from unstructured sources. We are experts in providing data quality management consulting services for structured and unstructured data.
Data Quality Management for unstructured data consists of two major steps:
1. Data Quality management of data categorization and data extraction process
2. Data Quality management of the resulting extracted structured data
Diagram 1 below illustrates data categorization and extraction of unstructured data and integration of such data to corporate Operational Data Store (ODS) or Data Warehouse. Data Quality ‘touch points’ are clearly defined in order to help explain Data Quality management for unstructured data and bring such data mainstream into corporate Data and Information Architecture.
Unstructured Data Repository – we can identify, gather, (a physical or a logical operation), and extract unstructured data files from any computer on a corporate distributed network and/or websites of interest on WWW. This in itself is a technically challenging task. Ideally, IT management will have installed and implemented a data or content management and migration system for unstructured data that periodically gathers or harvests unstructured data from identified sources, computers, hard disks, directories and content management systems (CMS) and consolidates all unstructured content in its original form. This might be using tools such as SharePoint. EMC, Hitachi, IBM etc. All hard disk vendors sell this enterprise ‘ETL like’ software for unstructured content management. However, this simply collates the data in a central “library” or repository.
Categorization or manual selection – in order to extract useful data, we need to identify documents or content in Unstructured Data Repository by ODS subject areas of interest, for example: (Sales, Marketing, PR, Finance, Engineering, Manufacturing, Operations, Customer Service, Human Resources, Legal, etc). A document can contain information pertaining to more then one area of interest. We can try to do this manually or using primitive rules that we construct and maintain, but frequently this is not a viable solution.
Any repository could contain important information of disruptive or even non-compliant nature. Manual or rule based sorting or tagging of unstructured data files DOES NOT necessarily identify what is inside those files. Automated, high quality categorization software is required to ‘read’ unstructured data files and correctly identify its contents and assign correct categories or subject areas to the files. (Please note, as content is updated and changed over time, assigned categories of that content can also change.)
Knowledge Base – Categorization technology and algorithms often require some form of ‘knowledge’. These, just like humans, require knowledge prior to ‘understanding’ a subject. The quality and depth of such knowledge will determine the quality and coverage of categorization produced by automated systems..
Knowledge Management (KM) Tools - Zoral Labs Knowledge Management Tools implement a highly automated, rapidly convergent, scalable algorithm for building quality knowledge bases, customized for a given, non-standard categorization or sub-categorization task, (NB knowledge for common categories is already acquired and optimized). Knowledge bases are subsequently utilized as model inputs for a set of machine learning engines to categorize raw content. The algorithms clearly identify and greatly reduce manual human intervention. The algorithm predictably reduces the associated time and cost of building a comprehensive, quality corpus by two orders of magnitude. KM Tools facilitate extensive regression testing to assure that quality / coverage is improved and there are no adverse effects with each, new Knowledge Base release. Knowledge Management quality assurance is an integral part of the overall unstructured data quality management process.
Selected Data Universe - is a driver list of categorized subject areas from which document data extraction is to take place. Thus extraction of data from unstructured data sources and integration into ODS can be divided by subject areas and file type and performed in prioritized phases.
Document Extractor – is a specialized, scalable intelligent crawler that abstracts the complexity of extracting unstructured data from various distributed, heterogeneous sources such as: (databases, web sites, ftp, RSS, SharePoint, CMS, etc). Document Extractor performs a similar role to the extraction layer of an ETL platform and can be scheduled to run periodically to extract new or updated content.
Unstructured Data Cache – is a time-series unstructured data repository with operational, time-series metadata of previously processed files. It is also used as an audit-trail to verify transactional integrity when data is extracted from unstructured content periodically: (hourly, daily, weekly, monthly, quarterly).
Document Parser – understands different unstructured data and file formats and is able to ‘read’ them, stripping out ‘noise’ and transferring heterogeneous unstructured data formats into a homogeneous, normalized unstructured data file format or normalized document that can be ‘read’ and ‘understood’ by automated extraction software.
Automated Extraction – of unstructured data consists of the following components:
Language Detector – dynamically determine the language of parts of the content or a document. This is needed where unstructured data and information of a multi-national enterprise is in more then one language.
Entity Extractors – are advanced, AI based technology that can dynamically, and at high speed, ‘read’, ‘understand’ and extract specified data types from normalized documents, for example: (Company Names, People, emails, Organizations, Clients, Client Names, Client titles, Partners, Partner Names, Partner titles, Country, City, Author, Date and Time, Industry, Deal Type, Event Type, Topics, Products, Relationships, Sentiment/Entity associations, document summary, etc.)
Transaction Engine – is configured to periodically request Document Extractor to extract updated or new content from unstructured data repository. Transaction Engine utilizes unstructured data cache in order to examine how content ‘looked’ at a time Tn, compare it to the way content ‘looks’ at a time Tn+1 and generate delta changes or structured transactions from the updated or new unstructured content into a staging area – Structure Data Repository.
DQM Tool – Automated Data Quality Management (DQM) Tool assures data quality and data coverage during the data extraction process from unstructured content. The goal of automated data quality monitoring is to measure the data quality gap between software extracted data output (from a representative set of test unstructured content files that are stored in Unstructured Data Cache) and “Golden Corpus” manual or expert verified test data input from same test unstructured content files - at a given point in time. As this gap is closed more and more, the Structured Data Repository data quality improves. There are many data quality problems that can arise over time. Known data quality problems (and the ones we are commonly monitoring) generally fall into four categories:
Each of the 4 I’s is described below.
Incomplete describes data values that are required but are missing.
Invalid describes data values that fail to conform to an agreed and defined set or range of values.
Inconsistent describes a set of data in which two or more data values, though possibly complete and valid, are inconsistent with each other. For instance, a single transaction for HR subject area with one value describing a male and another indicating pregnancy (the two values are valid but mutually exclusive)
Inaccurate describes a set of data that is complete, valid and consistent but nonetheless wrong. For example a person can have a wrong currency input for his bank records and instead of lira for example his bank holdings are erroneously registered in dollars, a logical error.
In order to perform ongoing data quality monitoring, at each unstructured data processing / data transformation ‘touch point’, we capture 4I time-series data quality measures for each output data type for a statistically significant sample of corporate unstructured content. The data quality measures are available aggregated as well as in granular form in order to identify areas to improve over time and measure progress. Data Quality Monitoring (DQM) Tool and data quality management infrastructure facilitates this process.
After a set of test data from a statistically significant unstructured corporate content is captured or manually entered/verified in DQM Tool – Golden Corpus, we can measure the 4I’s, at any level of granularity, for the test content extracted by the automated extraction software.
As an example, the following DQM screen snapshot demonstrates aggregated 4I Quality Measures for a set of client companies, for a set of fields for a particular point-in-time. Unstructured data automated extraction took place for Client Companies subject area and extraction was performed from a Selected Data Universe: client company websites.
If we drill down on the above, we will get data quality detail per company, per extracted data type of interest. The following DQM screen snapshot demonstrates a detailed view of 4I Quality Measures for a set of companies, for a set of fields for a particular point-in-time automated data extraction execution:
After data quality issues have been prioritized, identified and resolved, over-time we can monitor and measure data quality improvement by analyzing 4I’s historical statistics or data quality trends for a selected list of companies, and thus gain control over data quality management in perpetuity as follows:
Having setup a process to periodically categorize, extract and quality assure data from heterogeneous unstructured data sources, we can then apply data management rigor to the resulting structured time-series data at the staging area or Structured Data Repository, prior to integration of this data with other structured data sources at ODS or integration of this data to upstream vertical search, BI and data warehousing applications.
The fact that we have quality assured data extraction from unstructured data sources does not yet necessarily mean that the data is fit for purpose and is of 100% data quality.
We are now on major step 2: Data Quality management of the resulting extracted structured data
So typically what is done to quality assure structured data?
State of the art data quality management techniques for structured data include:
This is good but… it can be better
Zoral Labs has developed expertise in Inductive Logic Programming (ILP), using this to provide automated data quality 4I rules discovery in structured data. This provides automatic association data mining and monitoring of structured data quality root causes, impacting key business metrics.
An example is the Point of Sale (POS) data, shown below:
… and automatically identify these errors
This gives Zoral Labs the ability to perform highly automated, granular Data Profiling of structured data along with aggregated Data Quality Monitoring associating data quality problems with key business metrics.
Data Quality Monitoring or DQM Monitoring – uses a number of analytical techniques to automate data quality monitoring of a data warehouse staging area such as Structured Data Repository (our consolidation point of data extracted from unstructured content in illustration above) or ODS. The analytical techniques include:
- ILP to automatically identify 4I data quality problems
- Association data mining to associate derived data quality problems with key business metrics
- 6-sigma analysis to show significant data quality deviations from the baseline that impact key business metrics and require immediate attention to resolve.
For example, given the following sample reference and transactional data:
We can automatically derive 4I metrics for each reference data field and discover data association between reference data fields and key business metrics (cost in the data example above). If we create a statistical baseline of transactions over a representative period of time, we can then observe 6-sigma data quality deviations in our incoming transactions from unstructured data sources and perform required remedial action to keep the ODS and upstream analytical applications of sufficient and known data quality. A Sample Data Quality Monitoring application is shown below. Red color indicates a significant deviation in data quality from the baseline. If we drill down on a red cell, we will see that for the country of France there are a number of Incomplete and Invalid transactions significantly impacting our Cost business metric quality – in other words, costs for a set of reference data fields will be misplaced during financial reporting and analysis. To identify the offending transactions and identify the data quality root causes that impact our business metrics we drill down on aggregated 4I Details screen and identify data quality impacted transactions and specific data quality problems that need to be reviewed and fixed.
For more information on how Zoral Labs can assist your organization in discovering, measuring, managing and improving data quality of both structured and unstructured data please contact our sales department at: sales@ zorallabs.com