Print page

Introduction to ZyLAB Search Language

In this ZyLAB Search Language Guide, we explain how to use the ZyLAB search language to search for one or more terms within a data set. The data set usually contains files in many forms. Not only the terms used in these text, image or audio files, but also the information about these files (the meta data), can and should be searched.

Search Techniques

Searches are prone to produce over- and under-inclusive results. Several search techniques are designed to resolve this issue. For example, wild card searches to help you find common spelling variations and misspellings, or Boolean searches to specifically include or exclude certain terms. View the ZyLAB search language techniques in the table below.

Please note that though some operators are expressed in capital letters, this is only done for your personal clarity. The search engine does not differentiate between capital and lowercase letters.
It is possible to specify language specific (i.e. English, German, French, etc.) versions of these operators.

Boolean and Proximity Operators

Term Operators

AND

Fuzzy

~n

OR

Wild Cards

?

NOT

 

*

TO

 

[character(s)]

IN fieldname{query}

 

[character-range]

Within

 

W/n

 

[^]

W/n/term

 

+

 

/n,m/

 

{m,n}

Precedes

 

P/n

 

{m}

P/n/term

 

{m,}

Number Range

 

<

 

 

<=

 

 

=

 

 

<>

 

 

>

Field Filter

>=

fieldname=query

Quorum

n of {term, term, ..}

 

Exclude List of Terms from Fuzzy/Wild Card Query

fuzzy/wild card query - {exclude_term_1, ..., exclude_term_n}

Search Queries

A search query consists of one or more terms. Terms can be enhanced with Term Operators (Fuzzy/Wild Cards) and connected with Boolean or Proximity Operators. When using Boolean or Proximity Operators in your search query, group terms or phrases with round brackets to show the search order in which connections should be interpreted. Though brackets are not required, they can influence the outcome of a search; Queries placed between brackets are processed first.
For example, searching for 'cars OR NOT used cars' will return different results than searching for 'cars OR NOT (used cars)'. The first query will return 'cars' and all terms in front of 'cars', except 'used'. The second query will only find the term 'cars'.

Order of precedence

When no brackets are used to define the order of precedence, the following search order is applied:

1. NOT

2. OR

3. W/n, P/n (these operators are of equal precedence)

4. AND

5. TO

Search Results Explained

Once a search query is being executed a result list will appear. Retrieved terms will be highlighted in the files. Of course, to be found, terms need to be present in the file. However, whether a term is retrieved also depends on the settings in the character map, the indexing structure and the tokenizer.
Based on the character map the tokenizer will process all files. How this is done, we will explain here.

The building blocks of a text file are characters, (hyphenated) terms and phrases. Characters are letters, numbers or symbols like %, @, &, ^, *, etc. Terms are characters or words; they are unique entries in the dictionary with a separator on either side. Phrases are two or more terms with no intervening operator. Hyphenated terms (such as sugar-free) are two or more separate terms, connected with a hyphen. Each term has the same token id, given by the tokenizer.

Token

I

like

sugar

free

food

EOS

EOD

Token id

1

2

3

3

4

x

x

If a term or combination of terms you are searching for contains a hyphen, that term will be found, even if you did not include a hyphen in your search query. For example, when you search for 'email' or 'e mail', it will also find 'e-mail'. However, 'e-mail' will only retrieve 'e-mail'. Also, 'e mail' will not find 'email' or the other way around ('email' will not find 'e mail').

The character map determines which characters are used to separate terms, which characters are indexed, which ones are used for punctuation, etc. All possible characters that can be recognized and searched on are listed in the character map. By default some characters are not indexed and will not be found unless the default character map is adjusted. How characters are defined in the character map, influences the outcome of a search. For example, when brackets are set to be separators, the following text will be identified as 3 terms: 'most definite(ly)'.
For more information on the character map and how to configure it, please contact support (http://support.zylab.com).

Use quotes to search for operators or reserved characters. Examples: "and", "http://localhost/?id=10"

A period (.) is treated like a separator when defined as such in the character map, except when:

- the period is preceded and followed by a number ('0.1' is one term)
- the period is preceded by a space and followed by a number (' .1' is one term)
- the period is preceded and followed by one alphabetic character, which can be repeated ('A.B.C' is one term)

In addition to the characters defined in the character map to be recognized by the tokenizer as separators, the tokenizer creates separators to mark the end of a sentence (EOS, disabled by default), end of a paragraph (EOP), end of a line (EOL), end of a page (EOG) or the end of a document (EOD). You can search for the operators EOP, EOL, EOG and EOD.

Tip: When searching for EOD, the query returns all files with nothing highlighted. Since each file has an EOD token, it is an easy query to find all files in a data set.

The tokenizer extracts text from a file and produces tokens, based on the settings defined in the character map. Tokens can be anything between two separators. Tokens are the identified small parts that form or define a file. Tokens are not terms! For example, hyphenated terms all have the same token id, but are separate terms. And a separator (for example, EOD) can be a token, but not a term.

A token id is the natural number or position of a token, given by the tokenizer. Token ids are used to determine the distance between the terms. Separators do not have token ids.

Token

There

are

5

files

EOS

EOD

Token id

1

2

3

4

x

x

An occurrence is the number of times a given term occurs in the data set. Occurrences will be highlighted in the files.

Previous Topic

Next Topic

Top of Page Print page