160-020 MetaServer Convert – Convert to Searchable PDF
IMPORTANT: To enable the searchable PDF action, you first need to install the MetaServer Searchable PDF Module.
With MetaServer’s Convert to Searchable PDF action, you can convert image-based (scanned) PDF, TIF, JPG, PNG, Word and Excel files to searchable PDF files.
It has the unique capability to make a PDF partially searchable to reduce processing time. For example, if you would only want to make the first 3 pages searchable, you would specify pages “1-3” in the settings. More about this later.
To output Searchable PDF:
Step 1: add the Convert to Searchable PDF action just before the Export action(s) used to output the Searchable PDF.
Step 2: in your Export action(s), select “Processed PDF” as File source. More about that later.
To add a Convert to Searchable PDF action, select the action after which you want to insert the Convert to Searchable PDF action and press Add -> Convert -> to Searchable PDF. The Setup window will automatically open.
You can also open the setup window of an existing Convert to Searchable PDF action by double-clicking the action or by pressing the setup button on the right side of the action or in the ribbon, as shown below.
In our example, we will make use of the “CB – DPE” workflow. This workflow is automatically installed with CaptureBites MetaServer.
01 – Conversion cores: by default, the Convert to Searchable PDF action runs 4 convertor threads. You can decrease the number of threads to give other actions higher priority.
02 – Page(s): the Convert to PDF action has the unique ability to make a PDF partially searchable. You can specify the page number or ranges separated by commas. To convert all pages, leave the field empty.
1-5 = convert the 1st page to the 5th page
1,3, -1 = convert the 1st, the 3rd and the last page
-1 = convert the last page
2–1 = convert the 2nd page to the last page
1, 3-5, -2 = convert the 1st page, 3rd to the 5th page and the page before the last .
03 – Extraction engine: if you have a license of the Extract Text (Azure Computer Vision) module, you can choose to convert your documents using the Azure Computer Vision engine. This makes it possible to also make handwritten and degraded machine printed text searchable.
If you don’t have a license of the Extract Text (Azure Computer Vision) module, and your documents are scanned in high quality (300 DPI, color) without any handwritten data (or it’s not important that the handwritten data is converted to searchable text), the free, default Tesseract engine is also a good option.
01 – Key, Endpoint, Location: enter your resource key, endpoint and select your location using the drop-down arrow. You can find this information in your Microsoft Azure Dashboard. The “Go to Azure Portal” button will open in the portal in your default browser.
The example below shows a resource called “MetaServer”.
In your portal, you can also check your remaining calls. This can be useful to check if you’re not exceeding your current Microsoft Azure Computer Vision’s pricing tier plan.
If you haven’t signed up for an Azure Computer Vision key yet, please refer to the following instructions.
02 – Model / Preview version: The “General Available” model uses ACV’s official, general model.
If you select the “Preview” model, you can enter a specific “Preview version”. By default, this is set to preview version “2022-01-30-preview”. If you want to use another preview version, just enter the correct name of the preview model in the field.
Release info and preview model names are documented here:
All the model versions are listed here:
03 – Get result after [x] seconds: by default, this time is set to 6 seconds.
We recommend only changing this to a lower number if you have signed up for Microsoft Azure Computer Vision’s paid plan. Because the Free plan is limited to 1 call every 3 seconds, we recommend to run extraction with a Free Plan on a single core and keep the “Get result after [x] seconds” at 6 seconds.
04 – Log: enable this option to create a log file each time the Microsoft Azure Computer Vision engine is called. This option is typically used during diagnosing issues with Microsoft Azure.
On the client side, you can find the log information after running a Test in Extraction in the following folder:
On the server side, after processing some documents, you can find the log information in the following folder:
05 – Deskew & Rotate: if your documents are skewed or rotated incorrectly, you can enable the Deskew and/or Rotate option to optimize Text extraction. It will also result in a corrected version of the page(s).
01 – Languages: press the dropdown arrow to select the language used in most of your documents.
You can select multiple languages, but we recommend to only do so when you have many documents of all selected languages. If it’s exceptional that you have a document in another language, only select the main language. Selecting more languages will slow down the conversion process.
IMPORTANT: If you don’t see all the languages listed, please install the MetaServer Searchable PDF Module.
02 – Improve OCR on low resolution images: to create a searchable PDF with the Tesseract engine, we make use of OCR (Optical Character Recognition) technology. This works best with documents scanned in a resolution between 300 and 400 DPI. If you have document scans with a lower resolution, like 100 or 200 DPI, we recommend enabling this option to improve the OCR result. This option won’t affect 300 DPI or higher resolution scans.
TIP: you can copy the current settings and paste it in another setup window of the same type. Do this by pressing the Settings button in the bottom left of the Setup window and by selecting Copy. Then open another setup window of the same type and select Paste.