060-570 MetaTool Extraction – Find Line with Line Number
MetaTool’s Find Line with Line Number makes it possible to find lines by specifying their position in a list of lines. It’s frequently combined with a Find Line with Mask / Words rule and/or a Replace Text rule.
The Find Line with Line Number rule is very useful when you need to extract data from documents that don’t have a fixed format. A classic example is when you need to extract names on invoices or reports. The data is also not always located in the same place, it depends on the document layout of each supplier or client.
You typically define an OCR extraction rule first to hold the full text of a scanned document in an index field we typically call Text Block or Full Text. Next, you would define a Find Line with Mask / Words rule to filter the full text and only keep the relevant lines. Next, you replace a character in that line with a line separator using a Replace Text rule to put the value you are interested in on a separate line.
Finally you would define a Find Line with Line Number rule to extract the actual index value you are interested in.
For example, in the case we describe below, we will extract the inspector’s full name from a Building Inspection Report. We will search for the line containing “Inspected by:“ with the Find Line with Mask / Words rule, then replace the “:” character with a line separator using a Replace Text rule and then extract the 2nd line with the Find Line with Line Number rule to extract the full name of the inspector.
01 Find Line with Line Number – Add Rule
Find Line with Line Number is defined in the MetaTool Extract tab.
Press the Add button and select Find – Line – with Line Number to add the find rule.
The Find Line with Line Number Setup window opens.
Of course this rule needs to be preceeded with some other rules. Below, we explain a complete case how the Find Line with Line Number can be used including all preceeding rules.
02 Find Line with Line Number – Setup
In our example we will make use of the CB MetaTool Keyword Doc Sep job. This job is automatically installed when you install CaptureBites MetaTool.
From below image samples we want to extract the inspector’s full name.
The inspector’s full name has a variable length and can be as simple as “John Doe” or as complex as “Daenerys Stormborn of the House Targaryen, Khaleesi of the Great Grass Sea”. The name is always preceeded with the fixed label “Inspected by:” but floats up and down vertically.
With an Advanced OCR Rule we extract the full text from the first page and place the result in a field called FullTextFirstPage.
The result looks like this:
Next, we use a Find Line with Words / Mask rule to extract the line containing the words: “Inspected by:”.
The rule looks like this:
Next, with a Replace Text rule, we split the line containing the inspector’s name in multiple lines by replacing the “:” and “License” with a line separator. This forces the complete name to the second line.
The rule looks like this:
Finally, we extract the Inspector’s name using the Find Line with Line Number rule to select the 2nd line. Select the index field to hold the extracted data. In this case we select the index field “Inspector”.
Optionally enter a description.
Thanks to this approach it is unimportant whether the name consists of 2, 3 or more elements
For example, assume the name looks like this:
Line ranges are defined with a hyphen (-), ex. 1-3.
You can combine line ranges and line numbers with commas, ex. -1, 2-3.
1-3: a range that selects the first until the 3rd line.
3: selects the 3rd line.
1,3 or 3,1: selects the first and 3rd line. Both return the same result, the original sequence is always preserved.
-1,2: selects the last line and 2nd line.
-1-2: a range that selects the last line until the 2nd.
-3- -1: a range that selects the 3rd last line until the last line.
Match case: will make the search for duplicates Case Sensitive. If we’re looking for duplicates of “Wallace D Cosare”, for example, it will only delete duplicates that are in exactly the same case. Disable the option to also delete “wallace d cosare” or “WALLACE D COSARE”.