MetaServer > Help > Extract > Find Word with Mask / Words

120-260 MetaServer Extract – Find Word with Mask / Words

01 What is a Line, Word Group and Word?

A line of text is all the text on the same horizontal line. All the text in the green box below is located on the same line.

Word Groups are clusters of words separated by large spaces or TABs. As you can see in the example image, the word groups are marked in pink. The TABS are represented with a → character in MetaServer.

The line of text marked in green contains two word groups and is extracted by MetaServer as follows:
Customer ID: 173002→Req Date/ Time: 01/16/15 UPS

Words are separated with spaces. In our example, we marked some words in blue.

In conclusion, a line consists of 1 or more word groups, and a word group consists of 1 or more words.

With MetaServer’s Find Word with Mask / Words rule, you can find specific words or words matching a mask. It’s frequently combined with a Find Line with Line number or Find Word Group rule.

The Find Word with Mask / Words rule is very useful when you need to extract data from documents that don’t have a fixed format. As an example, let’s take mortgage redemption letters. All letters have a reference number, account number, date, etc. but the data is always located on different places for each lawyer.

You typically define an Extract Text rule first to add the full text of the document in an index field we call Text Block or Full Text. Next, you define a Find Word Group rule to filter the text and add a Find Word with Mask / Words rule to only keep the required word(s).

In our example, we will make us of the “CB – FLOATING DATA” workflow. This workflow is automatically installed with CaptureBites MetaServer.

We want to extract the account number from redemption letters. The location varies with each lawyer, but the account numbers all have the same format. They all consist of 13 characters, start with 2 letters and end with a digit. The first separator is a slash “/” or dash “-“, the second separator is always a dash “-“.

You can find the account number by defining a mask in our Find Words with Mask / Words rule. We will explain this in more detail later.

Next, you add a Replace Text rule to replace all slashes with dashes and change the account number to a consistent format.

We will only focus on the Find Word with Mask / Words rule. For the full logic, please have a look at the “CB – FLOATING DATA” workflow setup which is included in the MetaServer installer.

Find Word with Mask / Words rules are defined in a MetaServer Extract or Separate Document / Process Page action.

To add this rule, press the Add button and select Find –> Word –> with Mask / Words.

TIP: The thumbnail on the right will follow you, so you can easily refer to the Setup window. Click on the thumbnail to make the image larger.

First, add a description to your rule. Then, select a field to hold the extracted data. In this case, we select the field “Account Number”.

01 – Source field: press the drop-down arrow to select the source field. This is the field containing the text you want to parse to find word groups containing the required data.

1) Match whole word: only returns words exactly matching the defined mask or word(s). When disabled, it will also return words containing the accepted words or mask. For example: with “Match whole word” disabled and when searching for “apple”, it will also return words like “pineapple”.

2) Match case: enable this option to make the search Case Sensitive. If you search for “Law Center”, for example, it will only return that word if it’s found in the exact same case. Disable the option to also find “law center”, “LAW CENTER”, “Law center” etc.

Masks are used to search for words matching a specific format also known as a regular expression. However, you don’t need to use complex regular expressions, MetaServer uses an easy to use formatting pick list and you can construct your mask by selecting the elements you need.

Example: the account number on our mortgage letters, like “SB-43964718-1” or “AP/22779911-1”, start with 2 letters, followed by a separator that can be either a slash “/” or dash “-“, 8 digits, a dash “-“ as the 2nd separator and ends with a digit.

If we translate that into a mask, it would be { A, 2 }{ C }{ 9, 8 }-{ 9 } with a minimum length of 13. Each mask syntax element is described in more detail below.

The Reject mask is used to skip lines containing words with the defined Reject mask. For example, if you want to skip all account numbers that start with “ZZ”, you could define a reject mask starting with ZZ.

TIP: When you have both Accept and Reject masks defined in a single rule, all the words matching the reject mask are eliminated first. Then, the remaining words are used to look for the ones matching the Accept mask.

01 – Accept / Reject mask: here you define the masks. Words matching the Reject mask will be eliminated, words matching the Accept mask will be kept. Both masks use the same setup method.

By pressing the dropdown button, you can select different format types to compose your mask. You can even add a field to your mask, so it can change dynamically based on that field value.

1) Clear: clears the mask.

2) My text here: an example text. You can overwrite it with your own text. Use it if your masks consist of fixed characters. It’s also possible to type fixed text directly in the mask’s input box.

3) Any character: shown as { ? }, any character is allowed.

4) A letter: shown as { A }, any letter is allowed, both upper and lower case. If you want to only accept a specific case, you can use a custom character.

5) A letter or digit: shown as { X }, any letter or single digit is allowed. If you also want to allow periods, hyphens, commas, etc., you need to use the { ? } “Any character” type.

6) A digit: shown as { 9 }, any single digit is allowed.

7) A custom character: shown as { C }, only allows a list of defined characters. You can define these in the Custom Character Setup. Press the “…” button next to the Accept or Reject Mask to set up your custom characters.

The Custom Character Setup window opens…

Above custom character definition only allows a “-” or “/” for every C element in your mask.

1) Valid characters: you can choose if the custom character also allows uppercase letters, lowercase letters or digits.

2) Others: Here you can add, delete or modify specific custom characters. In the example above, a custom character can only be a “-” or “/”.

8) Any 5 … : the number 5 is just an example, replace the 5 with the number of characters you want.

For example: { ?, 6 } means any 6 characters, { A, 2 } means 2 letters, { X, 5 } means 5 letters or digits, etc..

02 – Minimum length: If you only want to read a part of the mask, set the minimum length lower than the total length of the mask.

To explain how the Minimum length setting works, consider below settings:

Examples:

AB/15687945-2:
OK because the number of characters is greater or equal than the defined minimum of 13 and the value starts with 2 letters followed by a custom character (“/“ in this case), 8 digits, a dash and a single digit.

AB/157945-2:
NOT OK because the number of characters (11) is smaller than the defined minimum length of 13.

AB/156870945-02:
NOT OK because longer (15 digits) than the total length of the defined mask, if you want to accept words containing more digits, you would need to disable “Match whole word”.

4B/15687945-2:
NOT OK because it contains a digit instead of a letter in the first 2 characters and therefore does not comply with the defined mask.

Here you can specify words that should be extracted (Accept) or skipped (Reject).

Document types or separator keywords, like the CB – INSPECTIONS REPORTS example above, are common examples where you would need to find fixed words.

1) Accept: return the specified word(s).

2) Reject: reject the specified word(s).

Note: Spaces also count as characters.

With the “Accept words from database” option, you can maintain a list of Accept Words outside MetaServer using an external database.

Enable the “Accept words from database” option and press the Setup button.

The Accept words from database Setup window opens…

Here you can select the database table and column containing the Accept Words. Any changes you make in the selected database table are automatically applied.

01 – Hide digits after decimal: in Oracle databases through ODBC, NUMBER(14) are integers but they are reported as DECIMAL. To avoid that, for example, an order number like 123456 is returned as 123456.00, you can enable this “Hide digits after decimal” option.

This can also be useful if ID numbers are stored as a NUMERIC or DECIMAL data type in MsSQL.

NOTE: thousand separators are never displayed in looked up decimal values, regardless whether the option is enabled or not.

This only applies to the lookup field, not the mapped fields.

Here you select your database type:

ODBC

ODBC is a standard to connect to a wide range of databases. Configuring an ODBC data source is straightforward. For detailed instructions on how to define an ODBC data source, please have a look at this guide.

IMPORTANT: It is not possible to connect to a “single-access” database type (e.g. Excel). This technical limitation applies for both database lookup during extraction and validation. This limitation does not apply for exporting to a database.

This is an overview of the ODBC settings:

1) Data source: select the data source you want to use. An ODBC data source needs to be defined first using the ODBC Data Source Administrator tool in Windows. To find step by step instructions how to define an ODBC Data Source in Windows, have a look here.

Select data source from field: you can use this option to switch databases dynamically using a field value.

To access this setup window, press the “…” button next to Data Source. You can select the field containing your database name by pressing the drop-down arrow.

Be aware that when you use this feature, all possible databases that can be loaded, must share the same table name and schema.

2) User name & Password: some databases require to login. If so, enter the user name and password in these fields.

3) Timeout: when the database does not respond in the specified time, the action will fail.

Log: enable this option to create a log file each time the database is called. This option is typically used during testing.

On the client side, you can find the information in the following folder:
C:\ProgramData\CaptureBites\Programs\Admin\Data\Log

On the server side, you can find the information in the following folder:
C:\ProgramData\CaptureBites\Programs\MetaServer\Data\Log

SQL Server

When you use a direct connection, it’s not required to set up an ODBC data source. Because the communication with the SQL server is direct, searching and updating SQL tables is more efficient.

NOTE: If you change the connection type from ODBC or a MetaServer Database to Direct SQL and you connect to the same table with identical field names, the mappings are preserved.

MetaServer

A MetaServer database is a shared CSV database. It doesn’t require any ODBC sources on any of the clients and is very easy to deploy. The MetaServer DB settings are very similar to the MS-SQL and ODBC settings.

For help on how to create a MetaServer database, you can refer to this guide.

In the Field Mapping tab, you can map MetaServer fields with database fields.

NOTE: Field mapping is only possible if you select “Keep first match” or “Keep last match” in the Value settings.

On the left side, all the fields of your current workflow are listed, except your lookup field. When the value of your lookup field was found in the database, it will map the corresponding fields.

In our example, if the Inspector name is found, we want to map the corresponding Inspector ID (called “Number” in the database).

01 – If multiple hits: when your database table contains more than 1 matching record, you can choose between the following options:

1) Skip mapping: if there are multiple hits, don’t map the other fields.

2) Keep first match: map the fields with their first match.

3) Keep last match: map the fields with their last match.

4) Keep all matches: map the fields with every matching value.

5) Keep unique matches: map the fields with every unique, matching value (no duplicates).

02 – Digits after decimal: the precision of decimals can be changed between 0 and 9 digits after the decimal.

By default, MetaServer uses the decimal precision as set in the Windows regional settings, which is typically 2 digits after decimal. This setting affects the precision of the following database field types: Choice, Decimal, Numeric, Number, Real and Float.

To create a MetaServer database, you simply create or copy a CSV file in:
C:\ProgramData\CaptureBites\Programs\MetaServer\Data\DB

The CSV file needs to comply to the following characteristics:
1) The first line defines the column names
2) The following lines are data records
3) Fields are separated by “,” (comma) or “;” (semi colon)

Example of a basic CSV:

VENDOR_NAME,VENDOR_ID
ARROW ELECTRONICS,9492785400
Cisco WebEx LLC,8754441234
Dell,9598741234
Evernote Corporation,6584568754
K Software,8595140754
PremiumSoft CyberTech Ltd.,85224983422
Vivify Scrum,5554872315
WPForms LLC,8787775487

The “,” delimiter can also be a “;” delimiter.

Here is the example CSV again as seen in a CSV Viewer:

NOTE: the ; (semi-colon) delimiter is often used in Europe because the comma is commonly used as a decimal point in European countries.

If you use a comma delimited CSV and you have values containing a comma, you need to put the value between double quotes. A value like 22500, Broadway would need to be quoted like “22500, Broadway” to avoid the comma in the street to be interpreted as a field separator.

Since MetaServer version 3.1.26, CSV databases are automatically converted to a SQLite database. This has the advantage that the CSV data does not have to be loaded in memory anymore, resolving memory issues, and without affecting the performance.

The SQLite database is a file that lives next to the CSV files in the DB data folder. For example, if you copy over a MetaServer database called “zipcodes.csv”, the SQLite version is auto-generated and, after a few seconds, “zipcodes.sqlite” will appear alongside the original CSV.

The mechanism to create or update a new MetaServer database remains identical as before.

IMPORTANT: We recommend to only edit your CSV database file in case of updates and avoid opening or editing the SQLite file to avoid locking or sync issues. The CSV file is considered the master database. Any changes made to the CSV are automatically applied to SQLite database. This guarantees absolute compatibility with existing MetaServer workflows.

01 – Value: You can specify which words you want to keep. There are 3 options:

1) Keep all matches: this will return all words matching the defined Accept Mask or Accept Words and not matching the defined Reject Mask or Reject Word(s).

2) Keep first match: this will return the first word matching the defined Accept Mask or Accept Words and not matching the defined Reject Mask Reject Word(s).

3)  Keep last match: this will return the last word matching the defined Accept Mask or Accept Words and not matching the defined Reject Mask or Reject Word(s).

Note: If you want to output a specific word of many, like the 2nd word of 5 words, extract all words by keeping all matches and create an Edit / Set Field Value rule to extract the specific word using the Extract segments options.

02 – Overwrite: if enabled, the result will overwrite the previous field value. Otherwise, the result will be added to the value that is already in the field.

03 – Clear field if result is blank: if the result is blank, any values already in the selected field are cleared.

04Delete duplicates: this will delete all duplicate matches and the result will only consist of unique values.

TIP: you can copy the current settings and paste them in another setup window of the same type. Do this by pressing the Settings button in the bottom left of the Setup window and by selecting Copy. Then open another setup window of the same type and select Paste.

Subscribe to our Newsletter


Please check the box below to agree to the privacy policy and continue *


NOTE: if you're experiencing trouble with submitting this form, please try again using another browser.