You are here: Home Products SENTRA - Sentiment transactions Entity to Sentiment and Entity to Entity relationship extraction

Entity to Sentiment and Entity to Entity relationship extraction

The goal of SENTRA and its innovative, component based algorithms is to obtain granular information about both writers’ feelings and facts as expressed in positive or negative reviews, comments and questions, by analyzing a large number of on-line sources and documents. This includes extraction of Named Entities, anaphora resolution, annotation of sentence-level sentiments and mining of entity-entity and entity-sentiment relations. To date such systems are of low quality (around 70%).

Significant innovation is required to boost quality over a 90% mark.

The Sentiment/Relationship mining components of SENTRA deal with real-world WWW including news and product review websites, blogs and forums, as well as on-line reference data sources such as corporate websites, on-line databases and unstructured encyclopedias and social networks. Due to huge volume of data, the architecture of the system is designed to be distributed, grid-based, cloud enabled and highly scalable. The system covers multiple domains and languages, which means that most of the system analytical components are configurable with language/domain specific knowledge.

The figure below shows SENTRA’s typical web text-mining lifecycle.

The first step foresees the retrieving of unstructured or semi-structured information from the Web. This information consists in textual and media contents present in web pages and attachments. The crawling is implemented by Intelligent Crawler component, as shown in the figure below.

The Intelligent Crawler component of SENTRA uses a user defined Driver List with predefined sources such as news and blogs RSS feeds or news/ social media aggregates such as a Moreover or Lexis Nexis feed, for example, to automatically extract latest information.

The second step is the pre-processing, resulting in the creation of a repository of the information from the web represented in normalized form. During this phase each document is parsed to UTF-8 encoded text and related meta-information such as source name, crawling date, etc. This step is implemented by Parser and Meta-information Extractor sub-components.

The system architecture implements focused mining, as it is designed to only gather and process documents for a specific set of topics for a specific user or user instance. This capability is implemented by the Categorization step. It is composed of tree sub-components: Language Detector, Categorizer and Filter.

The Annotation and Relation Extraction steps are implemented by the following sub-components:

- Sentence and Quotation Annotator: splits raw text into sentences. In most simple cases the “.”, “!” and “?” can be treated as an end of sentence. However, these symbols can be used in many other contexts such as abbreviations, product and company names, dates, prefixes, etc. The sentence annotator handles all these cases utilizing a rich set of rules and machine learning techniques;

- Named Entities Annotator: detects entities and their categories (product names, company names, persons, dates, etc). It is based on machine learning, rules and automatically updated reference data.

- Deep Parser: generates the parsed syntax tree, grammatical features (parts of speech, verb tense, gender of nouns, noun-number tags) and dependency relations (named relations between pairs of words or phrases). In a syntax tree, each constituent is a word or a group of words that functions as a single unit within a hierarchical structure of a sentence. Set of grammatical relations determines directed graph between words of a sentence. Common examples of relations are subject-object relations between a verb and a noun. The deep parser also performs semantic generalization of dependencies, including the following steps:

  • Normalization of passive voice constructions: passive subjects are converted into objects, and corresponding verbs are tagged with a passive tense feature;
  • Entities are treated as single words, that avoids generating unnecessary dependencies between entity tokens;
  • Prepositional relations are generalized to collapse object and complement into single relation;
  • Comparative relations extracted between objects that are explicitly compared in the sentence;
  • Frequent collocations that play a role of conjunctions or logical connectors are collapsed to a single dependency (e.g. "as soon as", "as a result of").

 

- Anaphora Resolver: uses morphological features (POS, gender, etc), grammatical features (subject, direct object, etc), entity annotations (product, organization, person, etc. ) and document structure annotations (headings, paragraphs, sentences, quotations) in order to produce co-reference chain annotations (lists of co-referent entity mentions in the text) and identify anaphoric references in the text with high degree of accuracy (over 90%). Additional advanced Anaphora resolver component utilize Wikipedia discovered – entity, entity feature and topic relationships to further resolve any anaphora ambiguities.

- Sentiment Annotator: annotates sentiment units in the text and assigns corresponding polarity (positive, negative). It is based on shallow parsing technique configured with extensive sets of rules and knowledge base representing the language/domain model. Target accuracy and coverage – over 95%.

- Object-Sentiment Association sub-Module: assigns correct sentiment to each object mentioned in a sentence. It uses a syntax tree, grammatical relations graph, anaphora resolution output, annotated quotations and sentiment units to enrich the grammatical relations graph with information about sentiments. Additionally, extensive rules are added to boost the overall accuracy of the sub-module. Target accuracy – over 90%.

- Object-Object Relations Extractor: extracts predefined as well as new relations between entities such as product-company relation, company-partner relation, company-competitor relation, company-client relation, company-acquirer relation, etc.

The last step in data lifecycle is presented by the Analytical Database (DB) and Data-mining subsystem. The Analytical DB is a distributed, vectorized database, able to handle huge volumes of transactions (time-series sentiments, facts and relations extracted in previous steps) with high performance.

Most of the described components use multiple machine learning algorithms and extensive NLP rules. The Knowledge-management subsystem provides the environment for managing annotated corpora used in training process, re-training and testing the sub-components, including regression testing and advanced data quality control capabilities. SENTRA system output is structured, time-series transactions that can be used to enable a variety of vertical search applications, data discovery and near real-time Business Intelligence applications across a number of industries and subject matter.