Elasticsearch Document Crawl Downloads Unnecessary Files (like .zip, .7zip, .mdb, .pdf, .jpg) Which Increases the Indexing Time (Doc ID 2775099.1)

Last updated on JULY 25, 2023

Applies to:

Oracle WebCenter Portal - Version 12.2.1.4.0 and later
Information in this document applies to any platform.

Symptoms

ACTUAL BEHAVIOR

Elasticsearch Document Crawl downloads unnecessary files (like .zip, .7zip, .mdb, .pdf, .jpg) which is increasing the indexing time.
The Elasticsearch log shows messages like this:

[<DATE-TIME>][INFO ][o.w.s.e.c.AttachmentProcessor]
AttachmentProcessor.execute: Starting content crawling for the document:{dDocName=<DOC_NAME>, fileName=<FILENAME>.7z,
docUrl=http://<HOST>:<PORT>/cs/idcplg?IdcService=SES_CRAWLER_DOWNLOAD_CONTENT&file=<FILE>.hda&source=default}

[<DATE-TIME>][INFO ][o.w.s.e.c.TikaExtractor ]
TikaExtractor.parse: Content type of the document =
application/x-7z-compressed, document info = {dDocName=<DOC_NAME>,fileName=<FILENAME>.7z,
docUrl=http://<HOST>:<PORT>/cs/idcplg?IdcService=SES_CRAWLER_DOWNLOAD_CONTENT&file=<FILE>.hda&source=default}

[<DATE-TIME>][INFO ][o.w.s.e.c.AttachmentProcessor]
AttachmentProcessor.execute: Content successfully crawled for the document:{dDocName=<DOC_NAME>, fileName=<FILENAME>.7z,

EXPECTED BEHAVIOR

Expect not to see unindexable files being crawled by Elasticsearch.

STEPS

The issue can be reproduced with the following steps:

Configure WebCenter Portal with Elasticsearch.
Upload some documents with the following types to a Portal:
- txt
- pdf
- jpg
- zip
- 7z
- mdb
Go to Administration > Tools and Services> Search Settings > Scheduler.
Select Documents Crawl > Schedule > Crawl All Items > Start Crawl now.
Look at the Elasticsearch logs.
The logs will show the above file types are being crawled.

Changes

Cause

	To view full details, sign in with your My Oracle Support account.
	Don't have a My Oracle Support account? Click to get started!

In this Document

My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.

Elasticsearch Document Crawl Downloads Unnecessary Files (like .zip, .7zip, .mdb, .pdf, .jpg) Which Increases the Indexing Time (Doc ID 2775099.1)

Applies to:

Symptoms

Changes

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!