Endeca Crawl Does Not Parse PDF Data and Logs Exception: parser not found for contentType=application/pdf (Doc ID 2074230.1)

Last updated on NOVEMBER 10, 2015

Applies to:

Oracle Commerce Guided Search / Oracle Commerce Experience Manager - Version 6.1.3 and later
Information in this document applies to any platform.
An upgrade or migration to a modern version of the product was recently performed.

Symptoms

Summary

When executing a CAS web crawl, the crawl proceeds to completion.

However, during the crawl WARN messages are logged, and the resultant data set includes less records than expected.

Errors

The CAS Webcrawler crawl.log resembles the following snippet:

INFO 2015-10-30 00:02:15,661 0 com.endeca.itl.web.Main [main] Reading seed URLs from: /path/to/endeca/CAS/webcrawl-config/seedURLs.txt
INFO 2015-10-30 00:02:15,670 9 com.endeca.itl.web.Main [main] Seed URLs: [http://www.oracle.com/]
INFO 2015-10-30 00:02:17,998 2337 com.endeca.itl.web.db.CrawlDbFactory [main] Initialized crawldb: com.endeca.itl.web.db.BufferedDerbyCrawlDb
INFO 2015-10-30 00:02:17,999 2338 com.endeca.itl.web.Crawler [main] Using executor settings: numThreads = 100, maxThreadsPerHost=7
INFO 2015-10-30 00:02:18,718 3057 com.endeca.itl.web.Crawler [main] Fetching seed URLs.
INFO 2015-10-30 00:02:19,574 3913 com.endeca.itl.web.Crawler [main] Seeds complete.
WARN 2015-10-30 00:02:25,244 9583 com.endeca.itl.web.UrlProcessor [pool-1-thread-29] process (parse) http://www.oracle.com/pdfs/oracle_information.pdf exception caught while parsing
org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.oracle.com/pdfs/oracle_information.pdf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:75)
at com.endeca.itl.web.UrlProcessor.createOutput(UrlProcessor.java:336)
at com.endeca.itl.web.UrlProcessor.process(UrlProcessor.java:146)
at com.endeca.itl.web.Crawler.processUrlWork(Crawler.java:372)
at com.endeca.itl.web.Crawler.access$100(Crawler.java:37)
at com.endeca.itl.web.Crawler$3.work(Crawler.java:543)
at com.endeca.itl.web.HostPartitionedUrlWorkExecutor$1.run(HostPartitionedUrlWorkExecutor.java:95)
...

 

Changes

A recent migration was performed to bring the Oracle Commerce application to a more modern version of the product.

The old version was release 2.x.

The new version is release 3.x (or later).

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms