How are "guidDup" or "copyDup" records Created in ISS? (Doc ID 2074792.1)

Last updated on NOVEMBER 05, 2015

Applies to:

Oracle Communications Indexing and Search Service - Version 1.0.5 and later
Information in this document applies to any platform.

Goal

How are Copy duplicate or "guidDup" / "copyDup" records created in ISS?

When a new message is delivered to ISS, two documents are created and indexed: one for the content (in the content index) and one for meta data (in the meta index).  Other documents for each of the attachments are also created in the content index, but these are tied to the primary content document, so need not be included in this discussion.

The content document contains most of the information of the email. All parts of this data are constant: once created it never changes.  The meta document contains a smaller part of the data, and some fields may change over time, such as flags, the name of the folder, etc.  Together these documents allow the search service to search for any email in the account.

The IMAP specification provides the Copy command (but no Move command) which causes a copy of an email to be placed in a different folder from the original.  Because the Copy command can copy a large number of emails (UIDs) in a single command, it can be very expensive to update the index. However, all the information in the content document is the same for the copies, so instead of creating a new content and meta document for each Copied email, only the new meta document is created, while the content document is shared among all the copies.   The mechanism to do this sharing, and allow the search and other operations to distinguish between the copies, is to create a Copy Duplicate (CopyDup) record which is written in the account record in the dIndex.

Each time an email using the same content document is copied, the count in its CopyDup record is incremented, and whenever an email using the same content document is deleted, the count is decremented. When the last email referencing the content document is deleted, the CopyDup record is also deleted.

Most of the time, users do not "Copy" emails, but instead "Move" them into different folders.  This "Move" appears to the MS store and ISS as a Copy followed by an Expunge of the original email.  This results in the CopyDup record being created with two uses by the Copy, followed by the count being decremented by the Expunge of the original email.  Therefore, a very common pattern in the CopyDup records is to have only a single email referencing the content document (but not the original email).


How CopyDup records impact the system

By using CopyDup records instead of re-indexing the content document for each copy, the cost of the Copy event is greatly reduced: the most expensive part of indexing an email is the processing of the content (including attachments of documents); this is avoided in most Copies. This savings comes at the cost of more complicated processing in the Expunge and search service, plus the overhead of maintaining the data in the dIndex.  Over time, the number of CopyDup records in each account may grow quite large, since the copied email might never be Expunged (unlike moving an email to the Trash which eventually gets Expunged).

During processing of events, when the number of CopyDup records in an account exceeds a threshold value (default 4000), a message is logged by the IndexSvc processing such as:

Wed Oct 28 21:02:29 PDT 2015 com.sun.comms.iss.store.ContentDupManager  convertGuidListToMap WARNING: convertGuidListToMap:guidDup map size: 417743 (in bytes: 21371209)  # of zero dups: 411520 (in bytes: 21053834) initialization took 61ms investigate account acctName hostName for reindex to improve performance

Also, the --accountinfo output allows you to see how many CopyDup records there are:

 Group #   # accounts  status username     hostname   foldername
    4511    1        pno:07    meta index:   73M          content index:  1.1G  
                       A      user (id# 4413) host    (id# 1)(cd#12430)

The cd# at the end of the line indicates the number of CopyDup records in the account.  This also explains why you sometimes see folders with counts like these:

                45/26          E TRADE              (id# 68)

The first number is the number of meta documents in the folder (i.e., the number of emails) and the number after the "/" is the number of content documents in the folder. The difference is caused by CopyDup records for the missing content documents.  When the number of meta documents equals the number of content documents, only a single number is displayed. So from the --accountinfo output you can get an idea about where CopyDup records are used.

We have seen data from some customer systems that indicate the CopyDup records account for over half the total size of the dIndex. This can cause all reads and writes to the dIndex to take longer. As the above message shows, most CopyDup records point to only one email.  If the account were to be  re-indexed, all its CopyDup records would be deleted, and only the few emails which point to multiple emails would be duplicated in the content index. The smaller dIndex will speed up processing for all accounts, while the larger content index would take somewhat longer to search and update that account.


Possible future support

We have implemented some features to automate detecting and re-indexing accounts with many "zero dup" CopyDup records. Also, we have worked on moving the CopyDup records from the dIndex to the meta index of the individual accounts to reduce dIndex overhead and improve performance.  Neither of these capabilities are currently ready for production release, because we have needed to focus our efforts on other priorities.


 

Solution

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms