MetaServer > Help > Extract > Extract Text (Azure Form Recognizer)

120-150 Extract – Extract Text (Azure Form Recognizer)

With MetaServer’s Extract Text (Azure Form Recognizer) rule, you can automatically extract header data, key values and line items from forms, like invoices, and store that extracted data in fields. This is done through Azure Form Recognizer’s prebuilt models. These do not require any training or configuration.

You can specify the pages where you want to extract information from. After the values have been extracted, you can apply other Extract rules to clean up, format or adjust the values before sending it to the next action.

The invoice model, the most noteworthy model of the Azure Form Recognizer engine, currently supports the following languages:

  • English
  • Spanish
  • German
  • French
  • Italian
  • Portuguese
  • Dutch

NOTE: Don’t hesitate to test unsupported languages. For example we have tried a set of Czech invoices with good results.

You can refer to the Azure Form Recognizer’s Invoice Model documentation for a more detailed list of the supported languages.

Some examples (results are below each sample):

US invoice (EN)

German invoice

Spanish invoice

French invoice

Belgian invoice (NL)

UK invoice (EN)

US invoice (EN)

French invoice

US invoice (EN)

US invoice (EN)

NOTE: To validate line items as a table with different columns, you need to merge al line items in 1 field using a Set Field Value rule. The line-item field would then contain all line items in a CSV format.

For example:

5138489C”,”TRAXION MENACE CREW”,”9.00″,”EACH”,”10″,””,”90.00″
5138489D”,”TRAXION MENACE CREW “,”9.00″,” EACH “,”2″,””,”18.00″
5144035″,”STADIUM II BACKPACK”,”30.00″,” EACH “,”1″,”3″,”90.00″
5142723″,”STRIKER II TEAM BACKPACK”,”22.50″,” EACH “,”1″,””,”22.50″

Using a Validate CSV rule, it would look like this in validation:

NOTE: You need to sign up for the Azure Form Recognizer service. Paid plans for the prebuilt models are available for 10$ per 1000 pages (S0 Plan for Prebuilt document types). There is also a free, 1-year plan (F0) where you can test the engine with prebuilt models up to 500 pages per month for free.

IMPORTANT: The processing speed in the free plan is limited to only 1 call per 2 seconds and only reads 2 pages of the invoice. For the paid plan (S0 plan), the processing speed is 15 calls per second, which is 30 times faster than the free plan and reads all the pages of the invoice.

More detailed information about the pricing can be found here:
https://azure.microsoft.com/en-us/pricing/details/form-recognizer/

For more information on how to apply for a key, please refer to the instructions below.

NOTE: For more technical information about how the Microsoft’s Azure Form Recognizer engine works (API, OCR, etc.) and how they handle Data privacy and security, please refer to the Microsoft Azure Form Recognizer documentation.

In our example, we will make use of the “CB – INVOICES, FACTURES, RECHNUNGEN” workflow. This workflow is automatically installed with CaptureBites MetaServer.

Extract Text rules are defined in a MetaServer Extract or Separate Document / Process Page action.

To add this rule, press the Add button and select Extract -> Text (Azure Form Recognizer).

TIP: The thumbnail on the right will follow you, so you can easily refer to the Setup window. Click on the thumbnail to make the image larger.

How to sign up for a key

1) Log in to the Azure portal: https://portal.azure.com/#home

NOTE: If you don’t have a Microsoft or Azure account yet, you can sign up for free. You can find more info here:
https://azure.microsoft.com/en-us/free/

2) Create a resource:

3) In the “AI + Machine Learning” section, create a “Form Recognizer” Resource:

NOTE: If you don’t have an Azure account, you can start one for free for 12 months. Just select “Start Free Trial” and follow the instructions.

 
4) Select the server of your region, give your instance a unique name (this will be used in the endpoint URL) and select the pricing tier.

You can choose between 2 pricing tiers for the Azure Form Recognizer:

F0 plan (Free): 12 months Free (500 pages/month)

– If you use the invoice model with the free plan, maximum 2 pages for each invoice will be processed.

– The page file size must be less than 4 MB.

– With the 12 month free plan, the Azure server can only handle 1 call every 2 seconds for the recognizer API. Because of this speed limit, we recommend to run extraction and separation only on one core with the free plan.

– Every 28th of the month, the counter is reset to 500 pages.

– When you run out of free calls before the 28th of the month, MetaServer will move documents to the Error tab and will report to wait until the 28th of the month to continue processing documents or to switch to a paid plan.

– Documents that ended up in the errors tab, can be reprocessed with a paid plan or retried when the free counter is reset on the 28th of the month.

– After the 12-month period, you will receive an email from Microsoft one month before the expiration, stating that the 12-month free service is about to expire and will stop working.

You will then need to switch from your free plan to a “pay-as-you-go” S0 plan (see below). You have 30 days to switch from your free plan to a “pay-as-you-go” S0 plan or to stop using the service.

More detailed information about the pricing can be found here:
https://azure.microsoft.com/en-us/pricing/details/form-recognizer/

S0 plan: Pay-as-you-go ($10 per 1000 pages)

– The page file size can be up to 500 MB.

– The Azure server can handle 15 calls per seconds. You can run Extraction and Separation on multiple cores with a paid plan.

– Microsoft only charges for READ calls. GET calls are free

– For high volumes, starting from > 20.000 pages / month, you can find special pricing here.

– You can pay the subscription with a credit card or request to pay by check or wire transfer here:
https://docs.microsoft.com/en-us/azure/cost-management-billing/manage/pay-by-invoice

5) Press “Review + create” to check your Resource details. If they are correct, press the “Create” button.

6) You can find your resource’s Keys and Endpoint in your Microsoft Azure Dashboard. The “Go to Azure Portal” button will open the portal in your default browser.

The example below shows a resource called “MetaServerAFR-DEV”.

First, add a description to your rule. Then, press the Setup button in the upper toolbar to setup the connection to your Azure Form Recognizer resource.

01 – Key, Endpoint, Location: enter your resource key, endpoint and select your location using the drop-down arrow. You can find this information in your Microsoft Azure Dashboard. The “Go to Azure Portal” button will open in the portal in your default browser.

The example below shows a resource called “MetaServerAFR-DEV”.

In your portal, you can also check your remaining calls. This can be useful to check if you’re not exceeding your current Microsoft Azure Form Recognizer’s pricing tier plan.

If you haven’t signed up for a key yet, please refer to the instructions above.

02 – Proxyif you want to connect to a proxy server, press the Proxy button to open the setup window.

1) Type, Host, User name, Password: press the drop-down arrow to choose your proxy protocol and enter the connection settings to your proxy server. When in doubt, contact your IT department.

2) Port: enter the specified port of your Proxy server. When in doubt, contact your IT department.

03 – Prebuilt model: The “Invoice” prebuilt model is set as default. The “Invoice” model automates processing of invoices to extract header data like vendor name, vendor tax id, invoice number, invoice date, due date, payment terms, total amount, etc.

The model also extract line items like article codes, unit price, quantity, etc. Currently, the model supports English, Spanish, German, French, Italian, Portuguese, and Dutch invoices. But don’t hesitate to test it with other languages, we have also seen good results with Czech, Swedish, Danish invoices, etc.

The “Receipt” model automatically extracts merchant name, dates, line items, quantities, and totals from printed and handwritten receipts. The version v3.0 also supports single-page hotel receipt processing. The “Receipt” prebuilt model is more limited then the “Invoice” model. In our tests, for European tax receipts we see better results with the invoice model.

Preview models are documented here:

Invoice model:
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-invoice?view=form-recog-3.0.0

Receipt model:
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-receipt?view=form-recog-3.0.0

Read model:
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-read?view=form-recog-3.0.0

04 – API version: here you can specify the API version for the Azure Form Recognizer. By default, the preview version is selected. However, the preview version is currently only available for Azure Form Recognizer resources in the following locations:
– West Europe
– West US2
– East US

If your Azure Form Recognizer resource is created in another location, please select the General Available (GA) version.

Microsoft regularly releases new API versions. New versions add new functionality, like more supported languages or improved recognition etc.

You can find more details about what is new in Azure Form Recognizer’s current API here:
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/whats-new?view=form-recog-3.0.0&tabs=csharp

05 – Get result after [x] seconds: by default, this time is set to 6 seconds.

We recommend only changing this to a lower number if you have signed up for Microsoft Azure Form Recognizer’s paid plan. Because the Free plan is limited to 1 call every 2 seconds, we recommend to run extraction with a Free Plan on a single core and keep the “Get result after [x] seconds” at 6 seconds.

06 – Log: enable this option to create a log file each time the Microsoft Azure Form Recognizer engine is called. This option is typically used during diagnosing issues with Microsoft Azure.

On the client side, you can find the log information after running a Test in Extraction in the following folder:
C:\ProgramData\CaptureBites\Programs\Admin\Data\Log

On the server side, after processing some documents, you can find the log information in the following folder:
C:\ProgramData\CaptureBites\Programs\MetaServer\Data\Log

07 – Apply: choose when to apply the rule. The default option is “Always”, which means that the rule is always applied. Press the drop-down arrow to see all other available conditions.

A good example of conditional extraction, is if you first try to extract a value using the Extract Text rule (= standard OCR engine) or Extract Text (Azure Computer Vision) rule but it doesn’t return a valid result. Only then will you let the Extract Text (Azure Form Recognizer) rule try and extract the value.

This speeds up the extraction process and only uses calls to your Microsoft Azure Form Recognizer resource when your first search didn’t return a good result.

After selecting your condition, for example, “If field value is blank”, press the “…” button next to the drop-down arrow to open that condition’s setup window.

1) If value of field: press the drop-down arrow to select the field value that needs to be evaluated.

2) is equal to / is not equal to / is greater than /…: enter the other value your field value needs to be compared with. You can also press the drop-down button to select different system and index values to compose your value.

08 – Page: set the page number to where the information is located. The default is page 1.

For example:
– Enter 1 for the 1st page
– Enter -1 for the last page
– Enter 1-3 to extract from page 1 to page 3.
– Leave this empty in case you want to extract all pages (same as 1–1)
– Etc.

If a document does not contain a specific page, it is ignored. For example, extracting page “2,3” on a 2-page document will only extract page 2.

The free F0 plan in combination with the invoice model, will only process a maximum number of 2 pages for each invoice. With the page range you can determine what those pages are. For example 1,-1 would send the first and last page of each invoice.

01 – Deskew & Rotate: if your documents are skewed or rotated incorrectly, you can enable the Deskew and/or Rotate option to optimize Text extraction. It will also result in a corrected version of the page(s).

02 – Color Dropout: if your documents contains a lot of colored tables, lines or stamps throughout the values you want to extract, you can enable the the color dropout feature, which allows you to select up to 3 dropout colors.

Press the setup button to specify which colors to drop out.

Each selected color has its tolerance. With the Test button, you can see the effect after dropping out the selected colors in the right preview windows.

To reset all dropout colors to white (off), you can use the “Reset All Colors” button.

NOTE: The filtered image is only used temporarily to improve text extraction. The processed image keeps all the original colors.

03 – Confidence: characters with a confidence level lower than the set confidence level, will be ignored and not returned in the result. If set to 0, all characters are accepted.

For legacy reasons, this setting is retained. We recommend to start using the new Check if confidence is lower than option in the validate rules.

To help you in defining the correct confidence level, you can check the confidence level of each field in the “Confidence” column of your test results.

NOTE: The “Confidence” value shown beneath “OCR” refers to the confidence level of the Azure Form Recognizer’s “Document Content” field value, which is currently mapped to the “Full Text” value in our example.

04 – Tabs: by default, lines are segmented in multiple word groups that are separated by tabs. If you want no tabs at all, press the drop-down button to select the “Remove” option and all the words will be grouped as 1 single word group for each line.

05 – Convert page(s) to searchable PDF: enable this option if you want to save the extracted text as a searchable text layer in the processed PDF. It will only do this for the page(s) set in the Page(s) field. Leaving the Page(s) field empty will convert all pages of the documents.

As a result, you will be able to search handwritten, arabic, cyrillic or low-quality text in your exported PDF:

This high-quality text layer can also be used during Validation with the Select text tool. To do this, please make sure you also enable the “Use searchable text layer if present” option in the Select text tool setup:

01 – Field: The “Field” column shows your MetaServer fields. You can map them to the corresponding Form Recognizer field values using the drop-down arrow in the “Form Field” column (see below).

02 – Form Field: based on the selected Azure Form Recognizer Model (Invoice, Receipt, Read), you can map your MetaServer fields to the model’s extracted field values.

NOTE: As you will see in the output, some field values, like dates and amounts, are automatically reformatted by the Azure Form Recognizer engine. This is to output consistent, standardized formats, regardless of the input.

Dates are formatted to YYYY-MM-DD format.

For example:

12/06/2023

becomes:

2023-06-12 (on a European invoice)

2023-12-06 (on a US invoice)

Amounts are formatted without thousand separators, they have a period (.) as a decimal character and the number of digits after decimal will be up to the available number of digits, except for the Total amounts of the header data where it will round it to 2 digits.

For general amounts:

*.00 will become *.0

*.0000 wil become *.0

*.20 will become *.2

*.1254 will remain *.1254

 

For Total amounts of header data:

*.00 will become *.0

*.0000 wil become *.00

*.20 will become *.2

*.1254 will become *.13

 

The currency is also removed from all types of amounts.

For example:

£20,432.625

becomes:

20432.2

 

Time values are formatted to a HH:mm:ss format.

For example:

07:45 PM

becomes:

19:45:00

If the format needs to be different for your output, you can change it for each field by using the Extract action’s Format and Edit rules.

The Invoice model can return the following field values:

 

Name Type Description Example Processed Value
Document Content String

The complete extracted text without any formatting. This can be useful if you want to extract any other values that the model was not able to find.

You would typically map it to your “Full Text” field which you can then use as a source in your Extract rules.

INVOICE
CONTOSO LTD.
Contoso Headquarters→INVOICE: INV-100
123 456th St→INVOICE DATE: 11/15/2019
New York, NY, 10001→DUE DATE: 12/15/2019
CUSTOMER NAME: MICROSOFT CORPORATION
(…)

Single Value Fields (see example invoice on the right):

Name Type Description Standardized Output Format Example Processed Value
Customer Name String Invoiced customer MICROSOFT CORPORATION
Customer Id String Customer reference ID CID-12345
Purchase Order String Purchase order reference number PO-3333
Invoice Id String ID for this specific invoice (often “Invoice Number”) INV-100
Invoice Date Date Date the invoice was issued YYYY-MM-DD 2019-11-15
Due Date Date Date payment for this invoice is due YYYY-MM-DD 2019-12-15
Vendor Name String Vendor name CONTOSO LTD.
Vendor Tax Id String The taxpayer number associated with the vendor
Vendor Address String Vendor mailing address 123 456th St New York, NY, 10001
Vendor Address Recipient String Name associated with the Vendor Address Contoso Headquarters
Customer Address String Mailing address for the Customer 123 Other St, Redmond WA, 98052
Customer Tax Id String The taxpayer number associated with the customer
Customer Address Recipient String Name associated with the Customer Address Microsoft Corp
Billing Address String Explicit billing address for the customer 123 Bill St, Redmond WA, 98052
Billing Address Recipient String Name associated with the BillingAddress Microsoft Finance
Shipping Address String Explicit shipping address for the customer 123 Ship St, Redmond WA, 98052
Shipping Address Recipient String Name associated with the ShippingAddress Microsoft Delivery
Payment Term String The terms of payment for the invoice 30 NET
Sub​total Number Subtotal field identified on this invoice Integer 100.0
Subtotal Currency Code String The currency code associated with the extracted subtotal amount USD
Subtotal Currency Symbol String The currency symbol associated with the extracted subtotal amount $
Total Tax Number Total tax field identified on this invoice Integer 10.0
Total Tax Currency Code String The currency code associated with the extracted invoice total amount USD
Total Tax Currency Symbol String The currency symbol associated with the extracted invoice total amount $
Invoice Total Number (USD) Total new charges associated with this invoice Integer 110.0
Invoice Total Currency Code String The currency code associated with the extracted invoice total amount USD
Invoice Total Currency Symbol String The currency symbol associated with the extracted invoice total amount $
Amount Due Number (USD) Total Amount Due to the vendor Integer 610.0
Amount Due Currency Code String The currency code associated with the extracted invoice total amount USD
Amount Due Currency Symbol String The currency symbol associated with the extracted invoice total amount $
Service Address String Explicit service address or property address for the customer 123 Service St, Redmond WA, 98052
Service Address Recipient String Name associated with the Service Address Microsoft Services
Remittance Address String Explicit remittance or payment address for the customer 123 Remit St New York, NY, 10001
Remittance Address Recipient String Name associated with the Remittance Address Contoso Billing
Service Start Date Date First date for the service period (for example, a utility bill service period) YYYY-MM-DD 2019-10-14
Service End Date Date End date for the service period (for example, a utility bill service period) YYYY-MM-DD 2019-11-14
Previous Unpaid Balance Number Explicit previously unpaid balance Integer 500.0
Previous Unpaid Balance Currency Code String The currency code associated with the extracted invoice total amount USD
Previous Unpaid Balance Currency Symbol String The currency symbol associated with the extracted invoice total amount $
Payment Details IBAN String Holds the IBAN Payment Option details
Payment Details SWIFT String Holds the SWIFT Payment Option details
Total Discount Number The total discount applied to an invoice Integer
Total Discount Currency Code String The currency code associated with the extracted invoice total amount USD
Total Discount Balance Currency Symbol String The currency symbol associated with the extracted invoice total amount $

Line items (see example invoice on the right):

 

Name Type Description Example Text Example Processed Value
Line Item Amount Number The amount of the line item $60.00
$30.00
$10.00
60.0
30.0
10.0
Line Item Currency Code String The currency code associated with the extracted line item amount   USD
USD
USD
Line Item Currency Symbol String The currency symbol associated with the extracted line item amount   $
$
$
Line Item Description String The text description for the invoice line item Consulting Services
Document Fee
Printing Fee
Consulting Services
Document Fee
Printing Fee
Line Item Quantity Number The quantity for this invoice line item 2
3
10
2
3
10
Line Item Unit String The unit of the line item, e.g, kg, lb etc.

hours

pages

hours

pages

Line Item Unit Price Number The net or gross price (depending on the gross invoice setting of the invoice) of one unit of this item $30.00
$10.00
$1.00
30.0
10.0
1.0
Line Item Unit Price Currency Code String The currency code associated with the extracted line item unit price   USD
USD
USD
Line Item Unit Price Currency Symbol String The currency symbol associated with the extracted line item unit price   $
$
$
Line Item Product Code String Product code, product number, or SKU associated with the specific line item A123
B456
C789
A123
B456
C789
Line Item Date Date Date corresponding to each line item. Often it’s a date the line item was shipped 3/4/2021
3/5/2021
2/6/2021
2021-04-03
2021-05-03
2021-06-03
Line Item Tax Number Tax associated with each line item. Possible values include tax amount and tax Y/N    
Line Item Tax Rate Number Tax Rate associated with each line item. 10%
5%
20%
10%
5%
20%

The Receipt model can return the following field values:

Name Type Description Example Processed Value
Document Content String

The complete extracted text without any formatting. This can be useful if you want to extract any other values that the model was not able to find.

You would typically map it to your “Full Text” field which you can then use as a source in your Extract rules.

INVOICE
CONTOSO LTD.
Contoso Headquarters→INVOICE: INV-100
123 456th St→INVOICE DATE: 11/15/2019
New York, NY, 10001→DUE DATE: 12/15/2019
CUSTOMER NAME: MICROSOFT CORPORATION
(…)

Thermal receipts (General, Meal, Credit Card, Gas, Parking):

Field Type Description Example Value
Example Processed Value
Merchant Name string Name of the merchant issuing the receipt Contoso Contoso
Merchant Phone Number phoneNumber Listed phone number of merchant 987-654-3210 987-654-3210
Merchant Address address Listed address of merchant 123 Main St. Redmond WA 98052 123 Main St. Redmond WA 98052
Total number Full transaction total of receipt $14.34 14.34
Transaction Date date Date the receipt was issued June 06, 2019 2019-06-06
Transaction Time time Time the receipt was issued 4:49 PM 16:49:00
Subtotal number Subtotal of receipt, often before taxes are applied $12.34 12.34
Total Tax number Tax on receipt, often sales tax or equivalent $2.00 2.0
Tip number Tip included by buyer $1.00 1.0
Line Item Total Price number Total price of line item 7.20 €
7.80 €
26.50 €
23.90 €
7.2
7.8
26.5
23.9
Line Item Description string Item description Surface Pro 6
Wireless Mouse Model 2
Surface Pro 6
Wireless Mouse Model 2
Line Item Quantity number Quantity of each item 1
2
1
1
1
2
1
1
Line Item Price number Individual price of each item unit $1.00
$0.56
$3.99
1.0
0.56
3.99

Hotel receipts:

Field Type Description Example Value
Example Processed Value
Merchant Name string Name of the merchant issuing the receipt Contoso Contoso
Merchant Phone Number phoneNumber Listed phone number of merchant 987-654-3210 987-654-3210
Merchant Address address Listed address of merchant 123 Main St. Redmond WA 98052 123 Main St. Redmond WA 98052
Total number Full transaction total of receipt $14.34 14.34
Arrival Date date Date of arrival 27Mar21 2021-03-21
Departure Date date Date of departure 28Mar21 2021-03-28
Currency string Currency unit of receipt amounts (ISO 4217), or ‘MIXED’ if multiple values are found USD
EUR
MIXED
Merchant Aliases string Alternative name of merchant Contoso (R) Contoso
Line Item Total Price number Total price of line item 7.20 €
7.80 €
26.50 €
23.90 €
7.2
7.8
26.5
23.9
Line Item Date date Item date 27Mar21 2021-03-27
Line Item Description string Item description Room Charge
BBQ Hamburger
Salted Almonds
Room Charge
BBQ Hamburger
Salted Almonds
Line Item Category string Item category Room
Room Service
Mini Bar
Room
Room Service
Mini Bar

01 – Test button / Auto Test Mode: press the Test button to show the extracted values of your invoice in the Result window and get additional information about the result such as if OCR was applied or not, the confidence level of each field in the “Confidence” column, and the confidence level of the Azure Forms Recognizer’s “Document Content” field value, above the result table.

Auto test:  press the drop-down arrow next to the Test button to enable Auto Testing. With this you can automatically test each document as you browse through them using the blue document navigation buttons.

02 – OCR (Yes/No): there are many types of PDFs. The most common PDF type used with MetaServer are Text-Based PDFs and Image-based PDFs.

Electronic / Text-based PDFs are generated by a computer program like MS Word, Invoice / Report creation software, etc.  Text-based PDFs already contain computer text represented by fonts. This text can directly be extracted without any OCR processing.

Scanned / Image-based PDFs contain an image of each of the pages of the document and require OCR (Optical Character Recognition) to convert the images to computer text.

The Azure Form Recognizer automatically switches between electronic text extraction, in case of text-based PDFs, and OCR extraction, in case of a scanned image.

This way, your Microsoft Azure Form Recognizer resource returns a result with 100% confidence with electronic text.

If OCR is applied, the OCR value will indicate Yes.

If the PDF is 100% electronic, then the OCR value will indicate No.

If the PDF is partially electronic but also contains some image information (like logos), then the OCR value will indicate Mixed.

The below example shows a 100% electronic document, as indicated with “No” OCR.

03 – Confidence: this signifies the confidence level of the Azure Form Recognizer’s “Document Content” field value, which is currently mapped to the “Full Text” value in our example.

NOTE: The confidence level of each individual field can be found in the “Confidence” column of your test results.

TIP: you can copy the current settings and paste them in another setup window of the same type. Do this by pressing the Settings button in the bottom left of the Setup window and by selecting Copy. Then open another setup window of the same type and select Paste.

You are also able to run the Azure Form Recognizer engine on-premise using Containers through the Docker engine.

Running the engine on-premise can be useful for security and data governance requirements.

You can find a detailed guide discussing the prerequisites and how to set up your AFR container here.

Subscribe To Our Newsletter

Join our mailing list to receive CaptureBites' latest news and updates


Please check the box below to agree to the privacy policy and continue *


NOTE: if you're experiencing trouble with submitting this form, please try again using another browser.