MetaServer > Help > Extract > Find Word with Type

120-280 MetaServer Extract – Find Word with Type

A line of text is all the text on the same horizontal line. All the text in the green box below is located on the same line.

Word Groups are clusters of words separated by large spaces or TABs. As you can see in the example image, the word groups are marked in pink. The TABS are represented with a → character in MetaServer.

The line of text marked in green contains two word groups and is extracted by MetaServer as follows:
Customer ID: 173002→Req Date/ Time: 01/16/15 UPS

Words are separated with spaces. In our example, we marked some words in blue.

In conclusion, a line consists of 1 or more word groups, and a word group consists of 1 or more words.

With MetaServer’s Find Word with Type rule, you can find specific words of a certain type. It’s frequently preceded by a Replace Text and Remove Spaces rules to remove redundant spaces, periods or separators and detach the word from colons or commas.

The Find Word with Type rule is very useful when you need to extract a specific type of data from documents that don’t have a fixed format. As an example, let’s take Brazilian receipts. All these receipts contain a Brazilian registration number (CNPJ), personal registration number (CPF), the total amount and date of purchase.

You typically define an Extract Text rule first to place the full text of the document in an field called Full Text. Then, you put the extracted text in the field that will hold the final value by using a Set field value rule. After that, you clean up the text with a Replace Text and Remove Spaces rules. Finally, you extract the correct data by adding a Find Word with Type rule.

 

MetaServer’s Find Word with Type rule currently supports the following Types:

– ABN (Australian Business Number)
– Belgian National Number
– Belgian Structured Communication (Credit Transfer)
– Boleto Bancário
– BSN (Burgerservicenummer)
– CPF (Cadastro de Pessoas Físicas)
– CNPJ (Cadastro Nacional de Pessoa Jurídica)
– Chave de Acesso da NF-e
– Dutch SOFI / BSN Number
– INAMI / RIZIV
– Oyster Card
– VIN (Vehicle Identification Number)
– KBC Bank (Mod97 checksum)
– Weighted Mod 11

Luhn Check:
– Custom length
– American Express
– Canadian GST
– Canadian SIN
– IMEI
– Maestro
– MasterCard
– VISA

VAT number for:
– Belgium
– France
– Germany
– Luxembourg
– Netherlands
– Portugal
– South Africa
– Poland
– Spain (CIF)
– UK

You can always contact us if you require an additional type.

In our example, we will make us of the “CB – CUPOM FISCAL” workflow. This workflow is automatically installed with CaptureBites MetaServer.

We want to extract the CNPJ number (Cadastro Nacional de Pessoa Jurídica) from receipts. The location varies with each vendor, but the CNPJ has a special fixed format and check digit that MetaServer’s Find Word with Type rule can use to automatically recognize it in the text.

First, you extract the full text with an Extract Text rule, copy that value in the CNPJ field using a Set field value rule and remove any unnecessary characters that appear in the text (like “.”,”/”,“-“, etc.) with a Replace Text rule.

Next, you find the CNPJ number by using a Find Word with Type rule and selecting the correct type (CNPJ).  MetaServer verifies all numbers in the text and checks if any of them match the length and check digit calculation of a CNPJ number. Only these matching numbers are returned.

We will only focus on the Find Word with Type rule. To see all the rules of the workflow, please have a look at the “CB – CUPOM FISCAL” workflow setup, which is included in the MetaServer installer.

Find Word with Type rules are defined as part of a MetaServer Extract or Separate Document / Process Page action.

To add this rule, press the Add button and select Find –> Word –> with Type.

TIP: The thumbnail on the right will follow you, so you can easily refer to the Setup window. Click on the thumbnail to make the image larger.

First, add a description to your rule. Then, select the field that will hold the result. In this case, we select the field “CNPJ”.

01 – Source field: press the drop-down arrow to select the source field. This is the field containing the text you want to parse to find your specific type of word. In our case, the full text was copied in the CNPJ field with a previous Set Field Value rule, so the Source field is also the CNPJ field.

02 – Type: select the type of word you need to find in the text. Some word types, like VAT numbers and Luhn Check, have their own setup window with extra options. When available, you can access this setup by pressing the “…” button.

03 – Value: You can specify which words you want to keep. There are 3 options:

1) Keep all matches: this will return all words matching the specified type.

2) Keep first match: this will return the first word matching the specified type.

3)  Keep last match: this will return the last word matching the specified type.

Note: If you want to output a specific word of many, like the 2nd word of 5 words, keep all matches and create an Edit / Set Field Value rule to only keep that specific word using the Extract segments options.

04 – Overwrite: This is only available if the source field is different from the target field. if enabled, the result will overwrite the previous field value. Otherwise, the result will be added to the value that is already in the field.

05 – Clear field if result is blank: if the result is blank, any values already in the selected field are cleared.

06 – Delete duplicates: this will delete all duplicate matches and the result will only return unique values.

TIP: you can copy the current settings and paste them in another setup window of the same type. Do this by pressing the Settings button in the bottom left of the Setup window and by selecting Copy. Then open another setup window of the same type and select Paste.