![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Introduction to ZyLAB Search LanguageIn this ZyLAB Search Language Guide, we explain how to use the ZyLAB search language to search for one or more terms within a data set. The data set usually contains files in many forms. Not only the terms used in these text, image or audio files, but also the information about these files (the meta data), can and should be searched. Search Techniques Searches are prone to produce over- and under-inclusive results. Several search techniques are designed to resolve this issue. For example, wild card searches to help you find common spelling variations and misspellings, or Boolean searches to specifically include or exclude certain terms. View the ZyLAB search language techniques in the table below. Please note that though some operators are expressed in capital letters, this is only done for your personal clarity. The search engine does not differentiate between capital and lowercase letters.
Search Queries A search query consists of one or more terms. Terms can be enhanced with Term Operators (Fuzzy/Wild Cards) and connected with Boolean or Proximity Operators. When using Boolean or Proximity Operators in your search query, group terms or phrases with round brackets to show the search order in which connections should be interpreted. Though brackets are not required, they can influence the outcome of a search; Queries placed between brackets are processed first. Order of precedence When no brackets are used to define the order of precedence, the following search order is applied: 1. NOT 2. OR 3. W/n, P/n (these operators are of equal precedence) 4. AND 5. TO Search Results Explained Once a search query is being executed a result list will appear. Retrieved terms will be highlighted in the files. Of course, to be found, terms need to be present in the file. However, whether a term is retrieved also depends on the settings in the character map, the indexing structure and the tokenizer. The building blocks of a text file are characters, (hyphenated) terms and phrases. Characters are letters, numbers or symbols like %, @, &, ^, *, etc. Terms are characters or words; they are unique entries in the dictionary with a separator on either side. Phrases are two or more terms with no intervening operator. Hyphenated terms (such as sugar-free) are two or more separate terms, connected with a hyphen. Each term has the same token id, given by the tokenizer.
If a term or combination of terms you are searching for contains a hyphen, that term will be found, even if you did not include a hyphen in your search query. For example, when you search for 'email' or 'e mail', it will also find 'e-mail'. However, 'e-mail' will only retrieve 'e-mail'. Also, 'e mail' will not find 'email' or the other way around ('email' will not find 'e mail'). The character map determines which characters are used to separate terms, which characters are indexed, which ones are used for punctuation, etc. All possible characters that can be recognized and searched on are listed in the character map. By default some characters are not indexed and will not be found unless the default character map is adjusted. How characters are defined in the character map, influences the outcome of a search. For example, when brackets are set to be separators, the following text will be identified as 3 terms: 'most definite(ly)'. Use quotes to search for operators or reserved characters. Examples: "and", "http://localhost/?id=10" In addition to the characters defined in the character map to be recognized by the tokenizer as separators, the tokenizer creates separators to mark the end of a sentence (EOS, disabled by default), end of a paragraph (EOP), end of a line (EOL), end of a page (EOG) or the end of a document (EOD). You can search for the operators EOP, EOL, EOG and EOD. Tip: When searching for EOD, the query returns all files with nothing highlighted. Since each file has an EOD token, it is an easy query to find all files in a data set. The tokenizer extracts text from a file and produces tokens, based on the settings defined in the character map. Tokens can be anything between two separators. Tokens are the identified small parts that form or define a file. Tokens are not terms! For example, hyphenated terms all have the same token id, but are separate terms. And a separator (for example, EOD) can be a token, but not a term. A token id is the natural number or position of a token, given by the tokenizer. Token ids are used to determine the distance between the terms. Separators do not have token ids.
An occurrence is the number of times a given term occurs in the data set. Occurrences will be highlighted in the files. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Top of Page |
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|