MetaServer > Help > Extract > Find Word with Mask / Words

120-260 MetaServer Extract – Find Word with Mask / Words

01 Find Word with Mask / Words – Introduction

01 What is a Line, Word Group and Word?

A line of text is all the text on the same horizontal line. All the text in the green box below is located on the same line.

Word Groups are clusters of words separated by large spaces or TABs. As you can see in the example image, the word groups are marked in pink. The TABS are represented with a → character in MetaServer.

The line of text marked in green contains two word groups and is extracted by MetaServer as follows:
Customer ID: 173002→Req Date/ Time: 01/16/15 UPS

Words are separated with spaces. In our example, we marked some words in blue.

In conclusion, a line consists of 1 or more word groups, and a word group consists of 1 or more words.

02 What is MetaServer’s Find Word with Mask / Words rule?

With MetaServer’s Find Word with Mask / Words rule, you can find specific words or words matching a mask. It’s frequently combined with a Find Line with Line number or Find Word Group rule.

The Find Word with Mask / Words rule is very useful when you need to extract data from documents that don’t have a fixed format. As an example, let’s take mortgage redemption letters. All letters have a reference number, account number, date, etc. but the data is always located on different places for each lawyer.

You typically define an Extract Text rule first to add the full text of the document in an index field we call Text Block or Full Text. Next, you define a Find Word Group rule to filter the text and add a Find Word with Mask / Words rule to only keep the required word(s).

02 Find Word with Mask / Words – FLOATING DATA Case Study

In our example, we will make us of the “CB – FLOATING DATA” workflow. This workflow is automatically installed with CaptureBites MetaServer.

We want to extract the account number from redemption letters. The location varies with each lawyer, but the account numbers all have the same format. They all consist of 13 characters, start with 2 letters and end with a digit. The first separator is a slash “/” or dash “-“, the second separator is always a dash “-“.

You can find the account number by defining a mask in our Find Words with Mask / Words rule. We will explain this in more detail later.

Next, you add a Replace Text rule to replace all slashes with dashes and change the account number to a consistent format.

We will only focus on the Find Word with Mask / Words rule. For the full logic, please have a look at the “CB – FLOATING DATA” workflow setup which is included in the MetaServer installer.

03 Find Word with Mask / Words – Add Rule

Find Word with Mask / Words rules are defined in a MetaServer Extract or Separate Document / Process Page action.

To add this rule, press the Add button and select Find –> Word –> with Mask / Words.

04 Find Word with Mask / Words – Setup

TIP: The thumbnail on the right will follow you, so you can easily refer to the Setup window. Click on the thumbnail to make the image larger.

First, add a description to your rule. Then, select a field to hold the extracted data. In this case, we select the field “Account Number”.

01 – Source field: press the drop-down arrow to select the source field. This is the field containing the text you want to parse to find word groups containing the required data.

1) Match whole word: only returns words exactly matching the defined mask or word(s). When disabled, it will also return words containing the accepted words or mask. For example: with “Match whole word” disabled and when searching for “apple”, it will also return words like “pineapple”.

2) Match case: enable this option to make the search Case Sensitive. If you search for “Law Center”, for example, it will only return that word if it’s found in the exact same case. Disable the option to also find “law center”, “LAW CENTER”, “Law center” etc.

05 Find Word with Mask / Words – Mask Setup

Masks are used to search for words matching a specific format also known as a regular expression. However, you don’t need to use complex regular expressions, MetaServer uses an easy to use formatting pick list and you can construct your mask by selecting the elements you need.

Example: the account number on our mortgage letters, like “SB-43964718-1” or “AP/22779911-1”, start with 2 letters, followed by a separator that can be either a slash “/” or dash “-“, 8 digits, a dash “-“ as the 2nd separator and ends with a digit.

If we translate that into a mask, it would be { A, 2 }{ C }{ 9, 8 }-{ 9 } with a minimum length of 13. Each mask syntax element is described in more detail below.

The Reject mask is used to skip lines containing words with the defined Reject mask. For example, if you want to skip all account numbers that start with “ZZ”, you could define a reject mask starting with ZZ.

TIP: When you have both Accept and Reject masks defined in a single rule, all the words matching the reject mask are eliminated first. Then, the remaining words are used to look for the ones matching the Accept mask.

01 – Accept / Reject mask: here you define the masks. Words matching the Reject mask will be eliminated, words matching the Accept mask will be kept. Both masks use the same setup method.

By pressing the dropdown button, you can select different format types to compose your mask. You can even add a field to your mask, so it can change dynamically based on that field value.

1) Clear: clears the mask.

2) My text here: an example text. You can overwrite it with your own text. Use it if your masks consist of fixed characters. It’s also possible to type fixed text directly in the mask’s input box.

3) Any character: shown as { ? }, any character is allowed.

4) A letter: shown as { A }, any letter is allowed, both upper and lower case. If you want to only accept a specific case, you can use a custom character.

5) A letter or digit: shown as { X }, any letter or single digit is allowed. If you also want to allow periods, hyphens, commas, etc., you need to use the { ? } “Any character” type.

6) A digit: shown as { 9 }, any single digit is allowed.

7) A custom character: shown as { C }, only allows a list of defined characters. You can define these in the Custom Character Setup. Press the “…” button next to the Accept or Reject Mask to set up your custom characters.

The Custom Character Setup window opens…

Above custom character definition only allows a “-” or “/” for every C element in your mask.

1) Valid characters: you can choose if the custom character also allows uppercase letters, lowercase letters or digits.

2) Others: Here you can add, delete or modify specific custom characters. In the example above, a custom character can only be a “-” or “/”.

8) Any 5 … : the number 5 is just an example, replace the 5 with the number of characters you want.

For example: { ?, 6 } means any 6 characters, { A, 2 } means 2 letters, { X, 5 } means 5 letters or digits, etc..

02 – Minimum length: If you only want to read a part of the mask, set the minimum length lower than the total length of the mask.

To explain how the Minimum length setting works, consider below settings:

Examples:

AB/15687945-2:
OK because the number of characters is greater or equal than the defined minimum of 13 and the value starts with 2 letters followed by a custom character (“/“ in this case), 8 digits, a dash and a single digit.

AB/157945-2:
NOT OK because the number of characters (11) is smaller than the defined minimum length of 13.

AB/156870945-02:
NOT OK because longer (15 digits) than the total length of the defined mask, if you want to accept words containing more digits, you would need to disable “Match whole word”.

4B/15687945-2:
NOT OK because it contains a digit instead of a letter in the first 2 characters and therefore does not comply with the defined mask.

06 Find Word with Mask / Words – Words Setup

Here you can specify words that should be extracted (Accept) or skipped (Reject).

Document types or separator keywords, like the CB – INSPECTIONS REPORTS example above, are common examples where you would need to find fixed words.

1) Accept: return the specified word(s).

2) Reject: reject the specified word(s).

Note: Spaces also count as characters.

07 Find Word with Mask / Words – Accept words from database – Setup

With the “Accept words from database” option, you can maintain a list of Accept Words outside MetaServer using an external database.

Enable the “Accept words from database” option and press the Setup button.

The Accept words from database Setup window opens…

Here you can select the database table and column containing the Accept Words. Any changes you make in the selected database table are automatically applied.

01 Database

When opened, the Database tab is selected by default. Here you can select the database table and column containing the Accept Words. Any changes you make in a database table are automatically applied.

01 – Type: select your database type:

SQL Server: when you use a direct connection, you don’t require the setup of an ODBC data source. Because the communication with the SQL server is direct, searching and updating SQL tables becomes more efficient.

Note: If you change the connection type from ODBC to Direct SQL and you connect to the same table, the mappings are preserved.

1) Server: enter the SQL server you want to connect to.

2) Database: enter the name of the database you want to access.

3) User name & Password: most SQL databases require to log in. If so, enter the user name and password in these fields.

4) Extra: allows you to add custom connection string parameters. If you don’t need to use any special options, you can leave that field blank.

5) Timeout: when the database does not respond in the specified time, the action will fail.

6) Table: a database typically stores data in one or more tables, such as a document types table, a suppliers table, a products table, etc. Specify the correct table containing the Accept words you want to use in your rule.

7) Column: a table typically contains one or more columns, such as a name column, address column, phone number column etc. Specify the correct column containing the Accept words you want to use in your rule.

ODBC: this is the default database type. ODBC is a standard to connect to a wide range of databases. Configuring an ODBC data source is straightforward. For detailed instructions on how to define an ODBC data source, please have a look at this guide.

This is an overview of the ODBC settings:

1) Data source: select the data source you want to use. An ODBC data source needs to be defined first using the ODBC Data Source Administrator tool in Windows. To find step by step instructions how to define an ODBC Data Source in Windows, have a look here.

Select data source from field: you can use this option to switch databases dynamically using a field value.

To access this setup window, press the “…” button next to Data Source. You can select the field containing your database name by pressing the drop-down arrow. 

Be aware that when you use this feature, all possible databases that can be loaded, must share the same table name and schema.

2) User name & Password: some databases require to login. If so, enter the user name and password in these fields.

3) Timeout: when the database does not respond in the specified time, the action will fail.

4) Table: a database typically stores data in one or more tables, such as a document types table, a suppliers table, a products table, etc. Specify the correct table containing the Accept words you want to use in your rule.

5) Column: a table typically contains one or more columns, such as a name column, address column, phone number column etc. Specify the correct column containing the Accept words you want to use in your rule.

MetaServer: a MetaServer database is a shared CSV database. It doesn’t require any definition of ODBC sources on any of the clients and is very easy to deploy. The MetaServer DB settings are very similar to the ODBC settings.

08 How to Create a MetaServer Database

To create a MetaServer database, you simply create or copy a CSV file in C:\ProgramData\CaptureBites\Programs\MetaServer\Data\DB

The CSV file needs to comply to the following characteristics:
1) The first line defines the column names
2) The following lines are data records
3) Fields are separated by “,” (comma) or “;” (semi colon)

Example of a basic CSV:

VENDOR_NAME,VENDOR_ID
ARROW ELECTRONICS,9492785400
Cisco WebEx LLC,8754441234
Dell,9598741234
Evernote Corporation,6584568754
K Software,8595140754
PremiumSoft CyberTech Ltd.,85224983422
Vivify Scrum,5554872315
WPForms LLC,8787775487

The “,” delimiter can also be a “;” delimiter.

Here is the example CSV again as seen in a CSV Viewer:

Note: the ; (semi-colon) delimiter is often used in Europe because the comma is commonly used as a decimal point in European countries.

If you use a comma delimited CSV and you have values containing a comma, you need to put the value between double quotes. A value like 22500, Broadway would need to be quoted like “22500, Broadway” to avoid the comma in the street to be interpreted as a field separator.

09 Find Word Group with Mask / Words – Result

01 – Value: You can specify which words you want to keep. There are 3 options:

1) Keep all matches: this will return all words matching the defined Accept Mask or Accept Words and not matching the defined Reject Mask or Reject Word(s).

2) Keep first match: this will return the first word matching the defined Accept Mask or Accept Words and not matching the defined Reject Mask Reject Word(s).

3)  Keep last match: this will return the last word matching the defined Accept Mask or Accept Words and not matching the defined Reject Mask or Reject Word(s).

Note: If you want to output a specific word of many, like the 2nd word of 5 words, extract all words by keeping all matches and create an Edit / Set Field Value rule to extract the specific word using the Extract segments options.

02 – Overwrite: if enabled, the result will overwrite the previous field value. Otherwise, the result will be added to the value that is already in the field.

03 – Clear field if result is blank: if the result is blank, any values already in the selected field are cleared.

04 – Delete duplicates: this will delete all duplicate matches and the result will only return unique values.

TIP: you can copy the current settings and paste them in another setup window of the same type. Do this by pressing the Settings button in the bottom left of the Setup window and by selecting Copy. Then open another setup window of the same type and select Paste.