Table Benchmarking in KTM

When I was at an Expert Workshop in regard of Kofax KTM last week, we were forming groups and had to pick a problem we’d like to see solved. A participant and I picked one of his projects, where he had several issues with table recognition. We both checked out the project – he had a lot of sample images – and I said: “Well, no problem, let’s create Golden Files first, and then benchmark your extraction algorithm.”

He stared at me and said: “What do you mean, there is no table benchmarking!”. He was right. I hadn’t worked with KTM for a long time – more than three years from now – and I supposed Kofax would have had built-in such a splendid feature by now. First of all, I love the ability to benchmark in KTM – you have a tool for immediately measuring your process when it comes to both extraction and classification. It tells you in almost no time what will change and how much more (or less) user productivity you can gain by altering one specific setting.

Now, we decided that – in order to be able to set up high-quality table recognition, we would need to be able to benchmark our results as well. We quickly found out that benchmarking tables is more challenging than just benchmarking a simple field, as there are only four possible situations:

  • The field value is correct and the field is valid (class A, true positive)
  • The field value is correct and the field is invalid (class B, true negative)
  • The field value is incorrect and the field is invalid (class C, false negative)
  • The field value is incorrect and the field in valid (class C, false positive)

Class D would be the major problem, followed by class B (as the user only has to press enter to validate, this can quickly become a bad habit). But these are just 1-dimensional problems, tables are 2-dimensional. So, we had to come up with the following idea:

KTM Table Benchmarking

The idea is rather simple: both the Golden File and the Table Locator will have a certain sets of rows returned. The set of column has to be the same, otherwise KTM would refuse linking both. Knowing that, we could:

  • Iterate over each row in the Golden File
  • Check if that row can be found in the Table Locator (we check if the row is found on the same page and on the same position)
  • If yes, iterate through every cell and check for validity (A, B, C, D)
  • If no, memorize that row from the Golden File
  • When finished, check if there are still rows left in the Table Locator

At that point we might end up with three possible scenarios:

  1. Both the Golden File and the Table Locator found exactly n rows, all at the intended positions
  2. One row from the Golden File was not found in the Table Locator
  3. The Table Locator had additional rows which were not present in the Golden File

Scenario 2 and 3 could occur at the same time (Table Locator missed one row plus read one additional one). These were the tricky ones and we were arguing on how to deal with them. Finally we came to the conclusion that we had to set each and every cell found in these rows to D (false positives) – in some cases we would be able to set them to C (false negatives) as well, but only if the table was validated (lets say by comparing the table total against another field, or by checking table positions, …).

We wrote a script that we could not integrate in the beautiful benchmarking tools, but the result is the same. To use, you need to do the following:

  1. Prepare your Golden Files including table data
  2. Save your Golden Files
  3. Insert the code provided, preferably in the Batch Open event
  4. Extract all your Golden Files so that the field results are stored in memory as well
  5. Launch your script (CTRL+F11 when using the Batch Open event)
  6. After finishing, open the script editor and move to the immediate window

There you go. If you want to repeat your benchmark, just make sure to extract first (start at statement 4).


final results: 0.9615,0.0000,0.0000,0.0385
final results: 0.7692,0.1538,0.0000,0.0769
final results: 0.6140,0.1228,0.0000,0.2632
final results: 0.8333,0.1667,0.0000,0.0000

You can of course include the XDoc filename in the Debug.Print statement and later use a vlookup in excel to merge these results with the regular extraction benchmark.

Here’s the code you will need:


Private Sub XDoc_RunTableBenchmark(pxdoc As CscXDocument, TableLocatorName As String)

' some baby steps on how to perform table benchmarking
' Christian Dreher & Wolfgang Radl, 2013-04-16 (KTM Field Enablement Expert Training)

' first save the results that are in the field
Dim fieldTableExtracted As CscXDocTable
'Set fieldTableExtracted = pxdoc.Fields.ItemByName("Table").Table
Set fieldTableExtracted = pxdoc.Fields.ItemByName(TableLocatorName).Table
Dim tmp As New CscXDocument
' now load the results from the golden file
tmp.Load(pxdoc.FileName)

Dim fieldTableGoldenFile As CscXDocTable
'Set fieldTableGoldenFile = pxdoc.Fields.ItemByName("Table").Table
Set fieldTableGoldenFile = tmp.Fields.ItemByName(TableLocatorName).Table

Dim locatorRow As CscXDocTableRow

Dim state As String

' three cases:
' a) both the locator and the field have the same amount of rows
' b) the locator has delivered back MORE rows than the field has
' c) the locator has delivered back LESS rows than the field has

' confidence values
Dim iA As Integer
Dim iB As Integer
Dim iC As Integer
Dim iD As Integer
Dim Sum As Double

Dim i As Integer

Dim iRowTableExtracted As Integer
Dim iRowTableExtractedGoldenFile As Integer
Dim iCell As Integer

Dim sRowsAffected As String

' only start if we have at least one row in both the locator and the golden file
If fieldTableExtracted.Rows.Count > 0 And fieldTableGoldenFile.Rows.Count > 0 Then

For iRowTableExtractedGoldenFile = 0 To fieldTableGoldenFile.Rows.Count - 1
' for each row in the golden file
' check if we find a row within the locator which has the same position
'Debug.Print("starting with row " & iRowTableExtracted+1 & " from the golden file")

Set locatorRow = Nothing

For iRowTableExtracted = 0 To fieldTableExtracted.Rows.Count - 1
' do they start on the same page and do they have the same top coordinate?
' probably add a confidence intervall here?
'If (fieldTableExtracted.Rows.ItemByIndex(iRowTableExtracted).StartPage = fieldTableGoldenFile.Rows.ItemByIndex(iRowTableExtractedGoldenFile).StartPage) And (fieldTableExtracted.Rows.ItemByIndex(iRowTableExtracted).Top = fieldTableGoldenFile.Rows.ItemByIndex(iRowTableExtractedGoldenFile).Top) Then

'If (fieldTableExtracted.Rows.ItemByIndex(iRowTableExtracted).Top(fieldTableExtracted.Rows.ItemByIndex(iRowTableExtracted).StartPage) = fieldTableGoldenFile.Rows.ItemByIndex(iRowTableExtractedGoldenFile).Top(fieldTableExtracted.Rows.ItemByIndex(iRowTableExtracted).StartPage)) Then
If (fieldTableExtracted.Rows.ItemByIndex(iRowTableExtracted).TextlineIndexStart) = fieldTableGoldenFile.Rows.ItemByIndex(iRowTableExtractedGoldenFile).TextlineIndexStart Then

' if we have found one:
Set locatorRow = fieldTableGoldenFile.Rows.ItemByIndex(iRowTableExtractedGoldenFile)
'Debug.Print("row " & iRowTableExtractedGoldenFile + 1 & " from the Golden File seems to match row " & iRowTableExtracted + 1 & " from the locator")
' save affected rows index
sRowsAffected += "#" & iRowTableExtractedGoldenFile & "#"
' that's it, we found it
Exit For
Else
' we did not find it yet

End If

Next iRowTableExtracted

If Not locatorRow Is Nothing Then
' we have found the row in the locator!
' for each column
' check: is the value of the field cell the same as the value of the locator cell?
For iCell = 0 To fieldTableExtracted.Rows.ItemByIndex(iRowTableExtracted).Cells.Count - 1

If fieldTableExtracted.Rows.ItemByIndex(iRowTableExtracted).Cells.ItemByIndex(iCell).Text = locatorRow.Cells.ItemByIndex(iCell).Text Then
' if values do match
If locatorRow.Cells.ItemByIndex(iCell).Valid = True Then
' check if the formatter makes the field invalid
' check if any validation makes the field invalid

' is the locator cell valid? if YES -> green (correct valid)
iA += 1
state = "(A) correct valid"
Else
' if NO -> blue (correct invalid)
iB += 1
state = "(B) correct invalid"
End If

Else
If locatorRow.Cells.ItemByIndex(iCell).Valid = True Then
' is the locator cell valid? if YES -> green (correct valid)
iD += 1
state = "(D) incorrect valid"
Else
' if NO -> blue (correct invalid)
iC += 1
state = "(C) incorrect invalid"
End If

End If
' write current state here
'Debug.Print("rId: " & iRowTableExtracted & " cId: " & iCell & " state: " & state)
' if values do NOT match
' is the locator cell valid? if YES -> red (incorrect valid)
' if NO -> yellow (incorrect invalid)

Next

Else
' case 1: the row of the golden files has not been found in the locator!
' validity depends on present rules (e.g. for table sum calculation, ... )
' if there is nothing defined, all cells would be missed and valid (D)
' NOT IMPLEMENTED YET-
iD += fieldTableExtracted.Rows(0).Cells.Count - 1

End If

Next iRowTableExtractedGoldenFile

' case 2: haven't found the row here!
' now deal with additional rows from the locator here ...
' additional rows delivered by the locator can be considered ALWAYS as non-matching (always incorrect)
' so the question is: are they valid or (hopefully) invalid?

' idea: save all rows 'found' earlier on in an array
' now only loop through those rows which have not yet been identified as fitting rows for the golden file
' for each cell

' are there rows left in the locator?
' NOT IMPLEMENTED YET

Dim h As Integer
Dim found As Boolean
found = False

For i = 0 To fieldTableExtracted.Rows.Count - 1
' check: was that row affected (= found and matched to a row of the golden file)
If InStr(sRowsAffected, "#" & i & "#") = 0 Then
iD += fieldTableExtracted.Rows(0).Cells.Count - 1
End If

Next

Sum = iA + iB + iC + iD
' final results
Debug.Print("final results: " & Format(iA / Sum,"0.0000") & "," & Format(iB / Sum,"0.0000") & "," & Format(iC / Sum,"0.0000") & "," & Format(iD / Sum,"0.0000"))
'Debug.Print("rows affected: " & sRowsAffected)

End If

End Sub