IM crawl overview and common problems - not returning documents (Doc ID 1516510.1)

Last updated on MARCH 01, 2017

Applies to:

Oracle Knowledge - Version 8.1.2.1 and later
Information in this document applies to any platform.
The IM crawler will connect to the IM URL provided in the collection definition. From there it will get the repository data and the list of articles to crawl. After getting the repository information all documents are downloaded using the RESOURCE_HOST_URL or the IM console resources definition --> Published content URL prefix* value. The article display urls for search will be built using the build url in the collection definition. The attachment display urls will be built using the Published content URL prefix* value or RESOURCE_HOST_URL. The crawler downloads the content xml and attachments and creates the IQXML files for that content for the rest of the crawler to use in preprocessing, indexing and classification.

If IM and the indexer are run on the same box then the http connection can be substituted for crawling the IM content. The intelligent search Admin guide says for the Force HTTP Access option- Specify whether to access content using only HTTP. If the Information Manager repository and the content processor reside on the same processor, the content processor attempts to access certain data via the file system rather than HTTP by default, to reduce overhead. This option specifies that only HTTP access be used, which may be helpful in diagnosing content processing issues. (more details in #4 below)

When adding a new IM collection to crawl verify the following.
1 - The new channel content matches the publishing status of the collection. If the collection publishing status is published then there is published content in the channel.
2 - The resource mount point has xml in live or staging for the new channel. And the resource_host_URL is working to bring the content back to the crawler.
3 - Validate that the resource mount point has content xml for the new channel and the docs that you are trying to crawl. If it does not run a buildxml from the repository, channel list, buildxml button.
4 - The collection definition matches other working IM collection definitions. If in staging or production make sure that the appropriate config overrides have been added to the custom.xml for this collection.
5 - A full content crawl has been done on the collection, an incremental crawl cannot be done first. This can be done for a single new collection if needed.
6 - The collection tab in system manager should have content indexed for the new collection.
7 - The usergroups and views work for the users being used to test the content search.

Symptoms

IM crawl is suddenly returning zero documents or is missing documents, even though documents can be found using IM find. End users may be complaining content is missing and the crawls have been succeeding.  But if the System Manager Collections page is examined there may be a small number of documents or 0 documents.  The job for the crawl can be checked in systems manager for the task/collection that first shows warnings or errors.  It is possible you can see the relevant messages in the system manager.  There are times you will have to look at the indexer logs.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms