Oracle SES 11g Crawler does not detect correct Title and is not able to fetch additional Attribute
Last updated on FEBRUARY 11, 2017
Applies to:Oracle Secure Enterprise Search - Version 11.1.2 and later
Information in this document applies to any platform.
Attempting to crawl and index a web site using Oracle SES 11g. The SES crawler does not seem to find the title or meta tags, also it does not seem to find any links either - so it only indexes the starting page.
On dumping the start of the file it is observed that there is a strange three-byte code at the start of it:
0000000 357 273 277 \r \n < ! D O C T Y P E h
0000020 t m l P U B L I C " - / / W
0000040 3 C / / D T D X H T M L 1 .
0000060 0 T r a n s i t i o n a l / /
0000100 E N " " h t t p : / / w w w .
0000120 w 3 . o r g / T R / x h t m l 1
0000140 / D T D / x h t m l 1 - t r a n
0000160 s i t i o n a l . d t d " > \r \n
0000200 < h t m l x m l n s = " h t t
0000220 p : / / w w w . w 3 . o r g / 1
It seems that the no-break space at the beginning of the doc is preventing SES from identifying the file as an HTML file.
It seems like all documents receive a fallback title (DOCTYPE-declaration part of html page), and unusual looking snippets (also from start of document)
Also it is not possible to fetch additional meta data elements from html head section (attributes added on Global Settings-Search Attributes, and then
manually added to crawler definition).
Oracle® Secure Enterprise Search Administrator's Guide
11g Release 2 (11.2.1)
Part Number E17332-04
clearly states "You can define additional HTML metatags to map to a String
attribute on the Home - Sources - Metatag Mapping page"
Sign In with your My Oracle Support account
Don't have a My Oracle Support account? Click to get started
My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms