Elasticsearch Document Crawl Downloads Unnecessary Files (like .zip, .7zip, .mdb, .pdf, .jpg) Which Increases the Indexing Time
(Doc ID 2775099.1)
Last updated on JULY 25, 2023
Applies to:
Oracle WebCenter Portal - Version 12.2.1.4.0 and laterInformation in this document applies to any platform.
Symptoms
ACTUAL BEHAVIOR
Elasticsearch Document Crawl downloads unnecessary files (like .zip, .7zip, .mdb, .pdf, .jpg) which is increasing the indexing time.
The Elasticsearch log shows messages like this:
AttachmentProcessor.execute: Starting content crawling for the document:{dDocName=<DOC_NAME>, fileName=<FILENAME>.7z,
docUrl=http://<HOST>:<PORT>/cs/idcplg?IdcService=SES_CRAWLER_DOWNLOAD_CONTENT&file=<FILE>.hda&source=default}
[<DATE-TIME>][INFO ][o.w.s.e.c.TikaExtractor ]
TikaExtractor.parse: Content type of the document =
application/x-7z-compressed, document info = {dDocName=<DOC_NAME>,fileName=<FILENAME>.7z,
docUrl=http://<HOST>:<PORT>/cs/idcplg?IdcService=SES_CRAWLER_DOWNLOAD_CONTENT&file=<FILE>.hda&source=default}
[<DATE-TIME>][INFO ][o.w.s.e.c.AttachmentProcessor]
AttachmentProcessor.execute: Content successfully crawled for the document:{dDocName=<DOC_NAME>, fileName=<FILENAME>.7z,
EXPECTED BEHAVIOR
Expect not to see unindexable files being crawled by Elasticsearch.
STEPS
The issue can be reproduced with the following steps:
- Configure WebCenter Portal with Elasticsearch.
- Upload some documents with the following types to a Portal:
- txt
- jpg
- zip
- 7z
- mdb
- Go to Administration > Tools and Services> Search Settings > Scheduler.
- Select Documents Crawl > Schedule > Crawl All Items > Start Crawl now.
- Look at the Elasticsearch logs.
The logs will show the above file types are being crawled.
Changes
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Changes |
Cause |
Solution |
References |