MetaServer > Help > Extract > Extract Text (Azure Computer Vision)

120-140 Extract – Extract Text (Azure Computer Vision)

With MetaServer’s Extract Text (Azure Computer Vision) rule, you can extract handwritten text, degraded text and deformed images with text (like those produced with a smartphone) from your imported documents and store the extracted data in fields. The results are impressive and are demonstrated in below examples.

You can specify the pages or zones you want to extract information from.

The engine can also read 122 different languages and detects these languages automatically, even in the same text line. Please refer to Azure Computer Vision’s documentation for a complete list of supported languages (“Read” column).

For example (results are below each sample):

A blurred receipt
A passport with security background
A report with noisy background and handwriting

Warped text on a bottle
A faded receipt

Handwriting and red stamps on a summoning document
Faded text on a bank transfer

Worn out transport pass with faint text
A letter with a collection of logos

License plate
Handwritten grocery list
Korean receipt
Faded, Japanese receipt
Chinese receipt

Automatic language detection even on the same line (e.g. English and Arabic)

Use the high quality ICR result to add a searchable full text layer to your PDF file and even search for handwritten text.

NOTE: The way the Azure Computer Vision engine works, is that you also need to sign up for the Azure service itself. Paid plans are available starting from 1$ per 1000 pages (S1 Plan). There is also a free, 1-year plan where you can test the engine up to 2500 pages per month (F0 Plan). 

MPORTANT: The processing speed in the free plan (F0 plan) is limited to only 1 call per 2 seconds. For the paid plan (S1 plan), the processing speed is 10 calls per second, which is 20 times faster than the free plan.

For more information on how to apply for a key, please refer to the instructions below.

NOTE: For more technical information about how the Microsoft’s Azure Computer Vision engine works (API, OCR, etc.) and how they handle Data privacy and security, please refer to the Microsoft Azure Computer Vision documentation.

Extract Text rules are defined in a MetaServer Extract or Separate Document / Process Page action.

To add this rule, press the Add button and select Extract -> Text (Azure Computer Vision).

TIP: The thumbnail on the right will follow you, so you can easily refer to the Setup window. Click on the thumbnail to make the image larger.

How to sign up for a key

1) Log in to the Azure portal: https://portal.azure.com/#home

NOTE: If you don’t have a Microsoft or Azure account yet, you can sign up for free. You can find more info here:
https://azure.microsoft.com/en-us/free/

2) Create a resource:
3) In the “AI + Machine Learning” section, create a “Computer Vision” Resource:
NOTE: If you don’t have an Azure account, you can start one for free for 12 months. Just select “Start Free Trial” and follow the instructions.

4) Select the server of your region, give your instance a unique name (this will be used in the endpoint URL) and select the pricing tier.

You can choose between 2 pricing tiers:

F0 plan (Free): 12 months Free (5000 calls/month = ~2500 pages/month (each page uses 1 read and 1 get call)

– The page file size must be less than 4 MB.

– With the 12 month free plan, the Azure server can only handle 1 call every 2 seconds. Because of this speed limit, we recommend to run extraction and separation only on one core with the free plan.

– Every 28th of the month, the counter is reset to 5000 calls (~ 2500 pages)

– When you run out of free calls before the 28th of the month, MetaServer will move documents to the Error tab and will report to wait until the 28th of the month to continue processing documents or to switch to a paid plan.

– Documents that ended up in the errors tab, can be reprocessed with a paid plan or retried when the free counter is reset on the 28th of the month.

– After the 12-month period, you will receive an email from Microsoft one month before the expiration, stating that the 12-month free service is about to expire and will stop working.

You will then need to switch from your free plan to a “pay-as-you-go” S1 plan (see below). You have 30 days to switch from your free plan to a “pay-as-you-go” S1 plan or to stop using the service.

The prices are available here:
https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/

S1 plan: Pay-as-you-go (> 2500 pages/month. $1 per 1000 pages)

– The page file size can be up to 50 MB.

– The Azure server can handle 10 calls per second. You can run Extraction and Separation on multiple cores with a paid plan.

– Microsoft only invoices READ calls. GET calls are free

– For high volumes >1M pages per month, find special pricing here.

– You can pay the subscription with a credit card or request to pay by check or wire transfer here:
https://docs.microsoft.com/en-us/azure/cost-management-billing/manage/pay-by-invoice

5) Press “Review + create” to check your Resource details. If they are correct, press the “Create” button.
6) You can find your resource’s Keys and Endpoint in your Microsoft Azure Dashboard. The “Go to Azure Portal” button will open the portal in your default browser.

The example below shows a resource called “MetaServer”.

First, add a description to your rule. Then, select a field to hold the extracted data. In this case, we select the field “Full Text”.

01 – Key, Endpoint, Location: enter your resource key, endpoint and select your location using the drop-down arrow. You can find this information in your Microsoft Azure Dashboard. The “Go to Azure Portal” button will open in the portal in your default browser.

The example below shows a resource called “MetaServer”.

In your portal, you can also check your remaining calls. This can be useful to check if you’re not exceeding your current Microsoft Azure Computer Vision’s pricing tier plan.

If you haven’t signed up for a key yet, please refer to the instructions above.

02 – Proxyif you want to connect to a proxy server, press the Proxy button to open the setup window.

1) Type, Host, User name, Password: press the drop-down arrow to choose your proxy protocol and enter the connection settings to your proxy server. When in doubt, contact your IT department.

2) Port: enter the specified port of your Proxy server. When in doubt, contact your IT department.

03 – Model / Preview version: The “General Available” model uses ACV’s official, general model.

If you select the “Preview” model, you can enter a specific “Preview version”. By default, this is set to preview version “2022-01-30-preview”. If you want to use another preview version, just enter the correct name of the preview model in the field.

Release info and preview model names are documented here:
https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/cognitive-services/Computer-vision/whats-new.md

All the model versions are listed here:
https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/cognitive-services/Computer-vision/Vision-API-How-to-Topics/call-read-api.md#determine-how-to-process-the-data-optional

04 – Get result after [x] seconds: by default, this time is set to 6 seconds.

We recommend only changing this to a lower number if you have signed up for Microsoft Azure Computer Vision’s paid plan. Because the Free plan is limited to 1 call every 3 seconds, we recommend to run extraction with a Free Plan on a single core and keep the “Get result after [x] seconds” at 6 seconds.

05 – Log: enable this option to create a log file each time the Microsoft Azure Computer Vision engine is called. This option is typically used during diagnosing issues with Microsoft Azure.

On the client side, you can find the log information after running a Test in Extraction in the following folder:
C:\ProgramData\CaptureBites\Programs\Admin\Data\Log

On the server side, after processing some documents, you can find the log information in the following folder:
C:\ProgramData\CaptureBites\Programs\MetaServer\Data\Log

06 – Apply: choose when to apply the rule. The default option is “Always”, which means that the rule is always applied. Press the drop-down arrow to see all other available conditions.

A good example of conditional extraction, is if you first try to extract a value using the Extract Text rule (= standard OCR engine) but it doesn’t return a valid result. Only then will you let the Extract Text (Azure Computer Vision) rule to extract the value.

This speeds up the extraction process and only uses calls to your Microsoft Azure Computer Vision resource when your first search didn’t return a good result.

After selecting your condition, for example, “If field value is blank”, press the “…” button next to the drop-down arrow to open that condition’s setup window.

1) If value of field: press the drop-down arrow to select the field value that needs to be evaluated.

2) is equal to / is not equal to / is greater than /…: enter the other value your field value needs to be compared with. You can also press the drop-down button to select different system and index values to compose your value.

07 – Page: set the page number to where the information is located. The default is page 1.

For example:
– Enter 1 for the 1st page
– Enter -1 for the last page
– Enter 1-3 to extract from page 1 to page 3.
– Leave this empty in case you want to extract all pages (same as 1–1)
– Etc.

If a document does not contain a specific page, it is ignored. For example, extracting page “2,3” on a 2-page document will only extract page 2.

Bottom right alignment of a zone on a portrait-oriented image
Bottom right alignment of a zone on a landscape-oriented image

01 – Deskew & Rotate: if your documents are skewed or rotated incorrectly, you can enable the Deskew and/or Rotate option to optimize Text extraction. It will also result in a corrected version of the page(s).

02 – Color Dropout: if your documents contains a lot of colored tables or lines throughout the values you want to extract, you can enable the the color dropout feature, which allows you to select up to 3 dropout colors 

Press the setup button to specify which colors to drop out.

Each selected color has its tolerance. With the Test button, you can see the effect after dropping out the selected colors in the right preview windows.

To reset all dropout colors to white (off), you can use the “Reset All Colors” button.

NOTE: The filtered image is only used temporarily to improve text extraction. The processed image keeps all the original colors.

01 – Zone: in the Extract Text (Azure Computer Vision) setup window’s toolbar, you will find two Zone tools to specify your extraction zone:

Lasso / Full Page / Top Half / Bottom Half: you can choose to extract the entire page, half of the page or you can specify a custom extraction zone with the Lasso option. You can do this by drawing a zone with the select tool. This is depicted as the orange rectangle in the viewer’s toolbar.

02 – Align: when documents in your workflow are of varying sizes or mixed orientations (portrait and landscape mixed together), you can align your zone in relation to any of the 4 corners of the image: the top left or right corner or the bottom left or right corner. That way, the zone will be positioned correctly on all sizes and orientations.

03 – Extract: press the drop-down arrow to specify if the extraction zone on your document includes printed and /or handwritten text:

Printed and handwritten text: select this option if your extraction zone contains both printed and handwritten text.

Printed text: select this option if your extraction zone only contains printed text.

Handwritten text: select this option if your extraction zone only contains handwritten text.

TIP: while drawing, you can see the zone’s dimensions in cm/inches (depending on your regional settings) and in pixels. Below that you’ll find the page’s resolution in DPI.

In general, we recommend you scan your documents with a resolution of 300 DPI for the best OCR result and compact file size.

04 – Confidence: characters with a confidence level lower than the set confidence level, will be ignored and not returned in the result. If set to 0, all characters are accepted.

For legacy reasons, this setting is retained. We recommend to start using the new Check if confidence is lower than option in the validate rules.

To help you in defining the correct confidence level, you can check the confidence level of each word group in your test result using the “Show info” option. You can also see the highest and lowest confidence level displayed above the test result.

05 – Font size: here you can choose to set up a range of acceptable font sizes to only return lines or words containing at least one character within the specified range. You can even choose to only keep the matching characters

To help you in defining the correct font sizes, you can check the font size of each word group in your test result using the “Show info” option.

You can also see the font size of the smallest and largest character displayed above the test result.

06 – Space length: if the result shows too many spaces, like spaces between individual characters, increase this value. If spaces are missing and words start sticking together, decrease the value. The value is a percentage of the font size of the character following the space.

PDF with text: here, you can adjust the length for spaces coming from electronic / text-based PDFs with an existing text layer.

Scanned image (OCR): here, you can choose to let the OCR engine determine the space length (Automatic) from scanned / image-based PDFs and image files (TIF, JPG, etc) or you can adjust the space length using a custom value.

07 – Tabs: by default, lines are segmented in multiple word groups that are separated by tabs. If you want no tabs at all, press the drop-down button to select the “Remove” option and all the words will be grouped as 1 single word group for each line.

08 – Tab length: define the length of long spaces to convert them to tab characters. If you only want to convert very long spaces to a tab, increase this value. Spaces shorter than the set value will be converted to a single space. The value is a percentage of the font size of the character following the tab character.

09 – Text based PDF: text-based PDFs, also known as electronic PDFs, contain computer text. They are typically generated by a text-editor programs like MS Word, Excel or by invoice or report creation software. By default, we directly extract the original electronic text and don’t need to perform any OCR. This results in to a very fast and accurate extraction of the text.​

Apply OCR if PDF contains images: some electronic PDFs contain one or more small images that have logos or small header or footer text. These elements are seen as images, not text. If you want to extract the text in these images, enable this option so it automatically converts the full page to a 300 DPI image. It will then apply OCR to extract all the text.

10 – Image based PDF: image-based PDFs, also known as scanned PDFs, are typically generated with a document scanner. They contain an image of each page of that document. By default, we apply OCR to these pages, so the images are converted to text.

Use searchable text layer if present: some scanned PDFs contain an invisible, searchable text layer. If you want to extract this existing searchable text layer instead of applying OCR, enable this option.

01 – Convert page(s) to searchable PDF: enable this option if you want to save the extracted text as a searchable text layer in the processed PDF. It will only do this for the page(s) set in the Page(s) field. Leaving the Page(s) field empty will convert all pages of the documents.

As a result, you will be able to search handwritten, arabic, cyrillic or low-quality text in your exported PDF:

This high-quality text layer can also be used during Validation with the Select text tool. To do this, please make sure you also enable the “Use searchable text layer if present” option in the Select text tool setup:

02 – Test button / Auto Test Mode: press the Test button to show the extracted text of the current page in the Result window and get additional information about the result such as if OCR was applied or not, confidence and font size.
Auto test:  press the drop-down arrow next to the Test button to enable Auto Testing. With this you can automatically test each document as you browse through them using the blue document navigation buttons.

03 – OCR (Yes/No): there are many types of PDFs. The most common PDF type used with MetaServer are Text-Based PDFs and Image-based PDFs.

Electronic / Text-based PDFs are generated by a computer program like MS Word, Invoice / Report creation software, etc.  Text-based PDFs already contain computer text represented by fonts. This text can directly be extracted without any OCR processing.

Scanned / Image-based PDFs contain an image of each of the pages of the document and require OCR (Optical Character Recognition) to convert the images to computer text.

MetaServer automatically switches between electronic text extraction, in case of text-based PDFs and OCR extraction, in case of a scanned image.

This way, your Microsoft Azure Computer Vision resource is only called when OCR is required. This saves processing time and calls.

If OCR is applied, the OCR value will indicate Yes.

If you want to see the text-based PDF detection in action, test the following documents:

C:\META-DEMO\MFP\CMR\CMR-01.pdf (image-based PDF)
C:\META-DEMO\MFP\CMR\CMR-02e.pdf (text-based PDF)

The text-based PDF won’t apply any OCR and will show the exact same text of the text-based PDF in an instant. Extracting a text-based PDF is much faster than an image-based PDF because the latter needs to have OCR applied.

The below screenshot shows the result of the text-based “CMR-02e.pdf”:

Even though a line goes straight through some of the text, the text extraction is still perfect because the line is just an object completely separate from the text. The OCR value in the Result panel indicates No.

04 – Confidence: this signifies the least confident character’s confidence level in the extracted text zone.

If you set the confidence level higher than this level, characters with a confidence level below the set level will be filtered out of the result.

05 – Smallest / Largest character: this signifies the detected font size of the smallest and largest character in the extracted text zone.  Use this is as a guidance to configure the option to only keep text with a specific font size.

06 – Find: you can directly search for words in your test result by typing them in the “Find” field.

TIP: you can copy the current settings and paste them in another setup window of the same type. Do this by pressing the Settings button in the bottom left of the Setup window and by selecting Copy. Then open another setup window of the same type and select Paste.

You are also able to run the Azure Computer Vision engine on-premise using Containers through the Docker engine.

Running the engine on-premise can be useful for security and data governance requirements.

You can find a detailed guide on the prerequisites and how to set up your ACV container here.

Subscribe to our Newsletter


Please check the box below to agree to the privacy policy and continue *


NOTE: if you're experiencing trouble with submitting this form, please try again using another browser.