Content Server Data Feeds are not Cleaned After Elasticsearch Crawl from WebCenter Portal (Doc ID 2826129.1)

Last updated on MARCH 18, 2024

Applies to:

Oracle WebCenter Portal - Version 12.2.1.4.0 and later
Information in this document applies to any platform.

Symptoms

ACTUAL BEHAVIOR

Content Server data feeds are not being cleaned after Elasticsearch Crawl from WebCenter Portal.

After doing a document crawl with Elasticsearch, the data feed files status is not updated in the defaultFeeds.hda file to indicate the document as being processed.
This causes the data feed files to not be cleaned resulting in the same data feed files being processed multiple times.
This is consuming disk space for files that have already been processed and wasting time to read and discard data feed files that had being already processed.

EXPECTED BEHAVIOR

Expect the documents data feed files to be deleted after the document is crawled.

STEPS

The issue can be reproduced with the following steps:

Look at the contents of this folder:

<DOMAIN_HOME>/ucm/cs/data/sescrawlerexport/datafeeds

There will be a defaultFeeds.hda file and a default folder.

The default folder will contain many folders and xml files.
These hda xml files are not cleaned.
Look at the portal diagnostic log.

This will show the document crawl is reading all files from the defaultFeeds.hda file.
Each xml file is scanned repeatedly and skipped.
A message like this will be displayed for the same data feed files every time a document crawl is running:

[<TIMESTAMP>] [WC_Portal] [TRACE] [] [oracle.webcenter.doclib.crawl.rss.RSSDataFeedFetcher] [tid: <TID>] [ecid: <ECID>] [APP: webcenter] [partition-name: DOMAIN] [tenant-name: GLOBAL] [SRC_CLASS: oracle.webcenter.doclib.crawl.rss.RSSDataFeedFetcher]
[SRC_METHOD: endElement] Discarding feed = <DATAFEED_FILE_NAME> as the lastBuildDate =.xml YYYY-MM-DDTHH:MI:SS.000Z is before or same as the last crawl time = YYYY-MM-DDTHH:MI:SSZ
[<TIMESTAMP>] [WC_Portal] [NOTIFICATION] [] [oracle.webcenter.doclib.crawl.rss.RSSDataFeedFetcher] [tid: <TID>] [ecid: <ECID>] [APP: webcenter]
[partition-name: DOMAIN] [tenant-name: GLOBAL] Discarding feed = <DATAFEED_FILE_NAME>.xml as the lastBuildDate = YYYY-MM-DDTHH:MI:SS.000Z is before or same as the last crawl time = YYYY-MM-DDTHH:MI:SSZ

[<TIMESTAMP>] [WC_Portal] [NOTIFICATION] [] [oracle.webcenter.doclib.crawl.rss.RSSDataFeedFetcher] [tid: <TID>] [ecid: <ECID>] [APP: webcenter]
[partition-name: DOMAIN] [tenant-name: GLOBAL] Data feed processing is complete. Data feed URL = http://<HOSTNAME>:<PORT>/cs/idcplg?IdcService=SES_CRAWLER_DOWNLOAD_FEED&file=<DATAFEED_FILE_NAME>.xml&source=default

Cause

	To view full details, sign in with your My Oracle Support account.
	Don't have a My Oracle Support account? Click to get started!

In this Document

My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.

Content Server Data Feeds are not Cleaned After Elasticsearch Crawl from WebCenter Portal (Doc ID 2826129.1)

Applies to:

Symptoms

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!