Find similar documents
Conditions
You have executed a search and opened a document in ZyVIEW. You can retrieve similar documents or detect near duplicates. The method slightly differs based on the type of index that is being used (HAPI or TBIE).
Instructions
ZyVIEW takes the first N words of the documents and uses it to formulate a quorum search to find at least M (<=N) of these N words. By default N is chosen at 60% of the number of words in the documents with a maximum text buffer of 500 characters and M is set to 60% of N.
The size of N may be set as a precision parameter of the file size and M may be set using the recall parameter. With a large precision few documents will be returned but with many matching words. A large recall will yield a small M so that more documents will be returned. There is a delicate relation between recall and precision in retrieving information from the index. Searching with a large precision is typically for someone who wants to find relevant information but does not care if one or more documents are missed, e.g. a journalist trying to find background information for a story. Searching with a large recall is typically meant for someone who wants to find all documents related to a certain topic, e.g. a lawyer that needs to find every piece of evidence in a court case.
Limitations
Near-duplicates are not detected if there is not enough text in it. for example, is does not work for audio, video, images files, as well as for small text documents.
Minimum Requirements for successful detection
- It should be possible to perform text extraction on the document.
- It should be possible to generate at least a requested amount of valid signatures from the text in the document (the default number of signatures is 16 in ZyFind). A valid signature contains 3 different words following each other. Each word should be no longer than 13 chars and no less than 3 chars. Example of correct signature is "test1 test2 test3". Example of incorrect signature is "test 1 test", because words are not unique and number "1" has only one char in it.
- The distance between the first correct signature and the last should be at least 100 words.
Result
You have searched for similar documents. If similar documents are found, they will be presented in ZyRESULT.