How does Question Processing Work with Concepts and Tokens and Stemming? (Doc ID 2226973.1)

Last updated on MAY 15, 2017

Applies to:

Oracle Knowledge - Version 8.6 and later
Information in this document applies to any platform.

Goal

First read the overview of Natural Language processing in KM article : How does Natural Language Search work and what are some recommended best Practices? (Doc ID 1039007.1)  This gives an NLP and Oracle Knowledge overview.

Each search request has a question (a string of words called tokens), a request language, and a number of result locales.

The question is ONLY interpreted and processed in the request language.  If the request language is English and you type in a German or Japanese question then the question will only do keyword or exact string matching in the result locale documents.  In order to use NLP and concepts the tokens need to be in the request language as the synonyms for the request language are all that is looked at.  See KM article for more information on doing a cross locale search : What Results to Expect With Cross-Languages Search from OOTB ui.jsp or Infocenter? (Doc ID 1542934.1)  The request language is determined by the users profile in Information Manager as described in this KM article : How does the Preferred Language Selector For InfoCenter Search Work (Doc ID 2163709.1)

In order to understand how the index is created and how the articles have been indexed you can use the Service Browser described in this KM article : How can I use the Service Browser tool?(Doc ID 1039105.1)  Internally we refer to words as tokens, stems and senses.  Tokens are all the words in the question and the articles.  These words may be stemmed to the root word, for example - installing or installed becomes install in the stemming process.  You should always use the root stem word in the concept.  You do not need synonyms for installing and installed. The senses are the synonyms themselves.

In Service browser you will find the index lists for tokens, stems and senses.  These lists are part of the index that show where these words are used in the documents as part of excerpts.

The tokens are the words in the index. All words go into the index in the crawl and all words in the question are considered unless they are ruled out by specific rules, like the skiplist rule. The token words are then stemed to make up a list of basic words. If you have product and products in your content, they will both be tokens, but will be stemmed to one word, product. The senses are the words that have synonyms. And will be used if the question word is identified as a synonym. These are internal definitions of words and how they are processed as they relate to the statements in the logs and the service browser. When the question words are evaluated for the language being used, the words are matched to synonyms of that language, and are stemmed for that language and then the rules will rank these words to find the most appropriate documents based on what the question is. If more than one language is specified for the results then concepts from the request language could also be used to find articles related to all the synonyms in that concept, the synonyms are not limited to the request language, they are only identified by the request language. So if there is a Spanish synonym in a English document then that would also be considered.

Parts of speech are not generally used.  There are not any OOB part of speech specific concepts.  It is a real edge case that you would do this. And there have only been a couple business specific cases where this has been done.  There is no need to divide locales into different concepts. That does not mean this has not done that. Both of these could be done if needed.

Solution

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms