Find similar documents

Conditions

You have executed a search and opened a document in ZyVIEW. You can retrieve similar documents or detect near duplicates. The method slightly differs based on the type of index that is being used (HAPI or TBIE).

Instructions

  1. Click the "Find similar documents" button: find similar documents.
    The Similarity Settings dialog appears when using a HAPI index (for an explanation of the settings see step 2 to 5), the Detect Near Duplicates dialog appears when using a TBIE index (see step 6 to 9).

    Embedded JPEG File Template 70

  2. In the Similarity Settings dialog, indicate whether you want to filter OCR errors, field values and/or numbers.
  3. If you want to set the Precision and Recall settings, click the Advanced button.

    ZyVIEW takes the first N words of the documents and uses it to formulate a quorum search to find at least M (<=N) of these N words. By default N is chosen at 60% of the number of words in the documents with a maximum text buffer of 500 characters and M is set to 60% of N.
    The size of N may be set as a precision parameter of the file size and M may be set using the recall parameter. With a large precision few documents will be returned but with many matching words. A large recall will yield a small M so that more documents will be returned. There is a delicate relation between recall and precision in retrieving information from the index. Searching with a large precision is typically for someone who wants to find relevant information but does not care if one or more documents are missed, e.g. a journalist trying to find background information for a story. Searching with a large recall is typically meant for someone who wants to find all documents related to a certain topic, e.g. a lawyer that needs to find every piece of evidence in a court case.

  4. After you have adjusted the settings, click OK.
  5. Click Find Similar Documents.
  6. When using a TBIE index, in the Detect Near Duplicates dialog, select a method. Currently only one method is available, the Characteristic query. This method will analyze the original document and select characteristic terms (or phrases) which will be compared with other documents.
  7. Define the Similarity Threshold. Define in percentages how much the documents should be similar, in which 0% is not similar at all and 100% is an exact duplicate.
  8. Define the Histogram Size. A histogram is a graphical representation showing a visual impression of the distribution of data. A digital image can be defined by the width, height and bit depth (1-bit is 2 gray levels (black and white), 2-bit is 4 gray levels, 4-bit is 16 gray levels, 8-bit is 256 gray levels, 16-bit is 65,536 gray levels, 32-bit is 4,294,967,296 gray levels).
    The advantage of increasing the bit depth is that each pixel can represent a greater range of values and record measurements more precisely. The disadvantages are that doubling the bit depth doubles the memory and increases the storage needed for the image.
  9. Define the Signature Count. How many signatures (representations of large sets of strings (semantic units) from the documents) do you want to compare to establish the similarity of documents?

    Limitations
    Near-duplicates are not detected if there is not enough text in it. for example, is does not work for audio, video, images files, as well as for small text documents.

    Minimum Requirements for successful detection
    - It should be possible to perform text extraction on the document.
    - It should be possible to generate at least a requested amount of valid signatures from the text in the document (the default number of signatures is 16 in ZyFind). A valid signature contains 3 different words following each other. Each word should be no longer than 13 chars and no less than 3 chars. Example of correct signature is "test1 test2 test3". Example of incorrect signature is "test 1 test", because words are not unique and number "1" has only one char in it.
    - The distance between the first correct signature and the last should be at least 100 words.

Result

You have searched for similar documents. If similar documents are found, they will be presented in ZyRESULT.

Previous Topic

Next Topic