When Using the WORLD_LEXER, Indexing Certain 3-byte UTF8 Characters Fails with DRG-11302, DRG-11428: document contains invalid characters (Doc ID 1551860.1)

Last updated on MAY 08, 2013

Applies to:

Oracle Text - Version 11.1.0.7 to 11.2.0.3 [Release 11.1 to 11.2]
Information in this document applies to any platform.

Symptoms

When the database character set is AL32UTF8 and the WORLD_LEXER is used, indexing documents with certain 3-byte characters fails with the following errors:

DRG-11301: error while indexing document
DRG-11302: document may be partially indexed
DRG-11428: document contains invalid characters

Below are some of the characters which cannot be indexed with the WORLD_LEXER:

U+215E 0xE2859E  VULGAR FRACTION SEVEN EIGHTHS
U+215D 0xE2859D  VULGAR FRACTION FIVE EIGHTHS
U+215C 0xE2859C  VULGAR FRACTION THREE EIGHTHS
U+2158 0xE28598  VULGAR FRACTION FOUR FIFTHS

U+2140 0xE28580  DOUBLE-STRUCK N-ARY SUMMATION
U+2141 0xE28581  TURNED SANS-SERIF CAPITAL G
U+2149 0xE28589  DOUBLE-STRUCK ITALIC SMALL J
U+215F 0xE2859F  FRACTION NUMERATOR ONE

When the AUTO_LEXER is used, these characters can be indexed and are text-searchable.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms