You are here: Home Services Document / website categorization

Document / website categorization

Zoral Labs can integrate and add significant value and advanced capabilities to most:

  • enterprise content – document management platforms,
  • document processing systems and solutions,
  • digital asset management infrastructure,
  • unstructured data ETL and
  • Data management technology

 

by automatically parsing and categorizing documents and content based on defined categories.



This advanced, artificial intelligence (AI) based categorization capability allows document processing vendors and enterprises utilizing document management systems to achieve far greater control over their unstructured data information assets and facilitate integration of unstructured and structured data as part of a next generation Information Architecture.

Document & Content Categorization Management Platform:                                      High-level Architecture

Major Platform Components

Categorization Engines

The Categorization Engines are designed to perform a task of assigning an object to one or multiple categories from a predefined set of categories. Object in general can have different nature. It can be a web page, text document or text stream, etc. Most of the implemented Categorization Engines are based on supervised algorithms. Such engines are functioning in two modes: training mode and execution mode. During the training mode the engine builds a user defined categorization model – the internal structure that is used further for object categorization.

Using a single categorization algorithm isn't always sufficient to achieve high categorization accuracy. Advanced boosting methodology is used to solve this problem. Categorization subsystem provides several boosting algorithms, which are inspired by AdaBoost, bootstrap aggregating and use of neural networks for aggregation.

Statistical Categorizers

The algorithmic kernel of these components is based on words and phrases frequencies. To be categorized, each document is transformed into Feature Vector – a set of frequencies of predefined words and phrases. These frequencies are used to calculate the document/category relation weights.

The following, Zoral Labs optimized categorization kernels are used: Support Vector Machine, Bayesian, and Markov Chains.

The lists of words and phrases may be defined / tuned manually as well as on a fully automatic basis.  In the second case the Dictionary Selection tool is used. It implements words/phrases selection technique based on association, chi-square and frequency measures.

Statistical categorizers allow Zoral Labs to achieve the level of accuracy in the range of 80-95% (can be higher depending on data) and coverage of 98%.

When integrated the document categorization engine has significant throughput and scalability characteristics, it will not create any bottlenecks in the enterprise document management infrastructure.

The following are performance or throughput measures of Zoral Labs categorizers:

  • CPU: 2.4GHz (one core is used)
  • Document type: HTML
  • Average document size: 20KB
  • Number of documents in  the training set: 720

 

Bayes

MarkovBayes

SVM

Training time

5.6 s

7 s

10 s

Execution time

(average, per document)

0.0003 s

0.001 s

0.0015 s

 

Rule-Based Categorizers

The rule-based categorizers allow Zoral Labs to achieve even more accurate categorization results. However they require human involvement during the training process.

This subsystem consists of two major components. The first one is a mixture of rule-based and statistical categorizers. The set of rules written by a human is used to form a feature vector for the input documents. Then the statistical categorizer (Neural Network) is used to make the final categorization decision. This allows the engine to control the features used in categorization process without a need to define feature/category relation weights (these weights are automatically derived from the training data).

The second component implements pure rule-based categorization technique. In this case, human defines not only categorization rules, but also rule/category relation weights.

The combination of these techniques allows Zoral Labs to achieve the level of categorization accuracy of 99% and coverage of 60%.

The following are performance measures of Rule-Based categorizers:

  • CPU: 2.4GHz (one core is used)
  • Document type: HTML
  • Average document size: 20KB
  • Number of documents in the training set: 720

 

Rule-based

Rule+Statistics

Training time

(excluding time to write rules)

0.15 s

1200 s

Execution time

(average, per document)

0.0074 s

0.0102 s

 

Similarity Search Categorizer

This component utilizes KNN (K-Nearest Neighbor) categorization technique. The manually categorized documents are indexed for fast search (refer to the Search Engine component). To categorize the document, the system parses the document, extracts statistically significant keywords and forms the search query. Then the system executes this query on the indexed documents and selects K most relevant results. The categories of the selected documents are used to assign the final category to the input document.

This technique allows Zoral Labs to dispense with the training step – it is replaced by the initial indexing step. Note that the fast incremental indexing is supported.

Search Query –Based Categorizer

This technique is based on the idea that each category can be defined as search engine query (i.e. the combination of keywords and phrases concatenated by the logical operations such as OR, AND, NOT). The queries can be defined manually as well as derived automatically. In the second case, the system utilizes Inductive Logic Programming (ILP) and Genetic Programming to derive search queries from the training documents.

Language Detector

The component identifies the language of a given document. It helps facilitate more advanced categorization, natural language processing and document clustering techniques relying on language dependant models. Current implementation supports more then 105 languages (including Japanese, Chinese, Arabic, Russian, etc) on a level of 99% accuracy.

Cluster Analysis Engine

This component allows Zoral Labs to automatically assign a set of documents into subsets (clusters) so that documents in the same cluster are similar. It also extracts the most important keywords and phrases from each cluster and uses these keywords to label the clusters. Thus automated category tree can be constructed and subsequently refined from any enterprise set of heterogeneous documents and content.

The screenshot below illustrates how automatic categories are constructed by applying Zoral Labs clustering technique to the HTML documents extracted from the corporate websites.

Entity/Fact Extraction Engines

The Entity Extraction task is to locate and classify the atomic text elements into predefined categories such as person name, company name, product name, postal address, email, phone, etc. Processing unstructured documents the system produces a fixed-format, structured output.

The Entities Extraction is considered as a pipeline – a sequential application of several document-processing modules. Each module produces information that is used for further processing or is valuable for the end-user. It includes document parsing and segmentation, text analysis (rule-based and statistical algorithms such as CRF, HMM, etc), analysis of tables, aggregation of extracted/identified facts.

The following screenshots demonstrate a sample output of this subsystem:

Source HTML document

Generated XML output

Source document extended with automatically annotated entities


Search Engine

This component allows Zoral Labs to index a set of documents for fast search. The search component supports batch indexing as well as the incremental indexing. Each document can be described as set of text fields. The search query is defined as set of keywords and phrases concatenated with logical operations such as OR, AND, NOT. The query can refer to any number of document’s fields.

Example of a ‘vertical search engine’ query:

  • FIELD1: (keyword1 OR keyword2) AND NOT keyword3
  • AND FIELD2: keyword1 AND “exact phrase 1”


During the search, the relevancy measure is calculated for each search result in order to select the most relevant documents.

The search results can be clustered and labeled automatically, on-the-fly:

The ability to search and cluster documents helps to quickly analyze the natural organization of the document universe.

Calculation Cluster

The calculation cluster is a set of highly scalable, cloud enabled, FreeBSD servers managed by a Sun Grid Engine. It is used for distributed calculations, load balancing and distributed in-memory cache. The cluster is designed to run any of the platform tasks such as categorization, clustering, searching, data mining, etc.

Document and content categorization usage scenario

The following is an example scenario demonstrating the methodology of utilizing Zoral Labs document auto-categorization platform in order to create the categories and categorize a set of documents or heterogeneous content for the enterprise.

1. Upload and index the documents / content

2. Create categories

3. Run clustering tool and review natural document clusters

4. Correct the category tree (if needed)

5. Find several keywords or patterns (by reviewing one of the interesting clusters)

6. Form and test the search query using discovered keywords

7. Move search results to corresponding category

8. Train the categorizer and run it on uncategorized documents

9. Review the categorization results, correct if needed (and retrain the categorizer)

10. Move categorized documents to the corresponding categories

11. Repeat steps 1 - 9

 

For more details please contact us at sales@zorallabs.com