Fuzzy Database Queries from Script (revised)

Querying a fuzzy database in Kofax Transformations can be a hassle – if you are to follow the official script sample provided with KTM/KTA. For once, it’s overly complicated, outdated and not all parts of the official script work as intended. There’s a better version provided within Kofax’ Best Practice Guide for KTM that’s easier to comprehend and use, yet there still are issues with it:

  • It does not work with huge databases (KSMS, larger than 4GB),
  • the maximum number of results is being ignored,
  • we can simplify things even more.

So, it is time for a revised version of Database_Search that will be much easier to use. Here’s our sample database:

And there are some sample queries made in script:

Dim dbName As String
Dim minConf As Double
Dim maxResults As Long
Dim results As CscXDocField

dbName = "db-customers"
minConf = .5
maxResults = 15

' search over all columns
Set results = Database_Search(dbName, minConf, maxResults, "claus gruber")
' search over two columns provided by id (note: 1 is the first column; 0 = all columns!)
Set results = Database_Search(dbName, minConf, maxResults, 1, "Gabi", 5, "Wien")
' same as above, but with providing column names (you should prefer this as column ids can change)
Set results = Database_Search(dbName, minConf, maxResults, "FirstName", "Gabi", "city", "Wien")

As you can see, a simple query over all columns still remains simple enough. Just type the words you want to search for as the last parameter, and you’re set. Even when searching multiple columns, Database_Search keeps things simple: First, you will need to provide the column, either by name or number – then, you provide the value to search for – in pairs. You can provide as many pairs as you need, the script automatically takes care of the rest.

So, without any further delays, here’s the revised Database_Search that can work with databases of any size, fixes the max results issue and is easier to use:

Public Function Database_Search(dbName As String, minConfidence As Double, maxResults As Long, ParamArray searchArray() As Variant) As CscXDocField
   ' Revision 3 of Database_Search:
   '   - fixes the problem with maxResults not working
   '   - works with 64bit (e.g. files in KSMS > 4GB)
   '   - simplified queries especially when searching in multiple columns
   ' search examples (searchArray value):
   '   "john doe" -> search for "john doe" in all columns
   '   0, "test 123", 3, "foo" -> search for "test 123" in ALL columns, and for "foo" in column number 3 (= idx 2)
   '   "Surname", "Doe" -> searches for "Doe" in a column named "Surname"

   Dim Fields() As String
   Dim FieldIDs() As Long
   Dim db As CscDatabase
   Dim hits As CscDatabaseResItems
   Dim i As Long
   Dim h As Long
   Dim fieldId As Long
   Dim alt As CscXDocFieldAlternative
   Dim results As New CscXDocField

   ' prepare QText and QField as required by db.Search (i.e. where to search and what to search for)
   If UBound(searchArray) = 0 Then
      ' user wants to search in all columns
      LongArray_Push(FieldIDs, 0)
      StringArray_Push(Fields, searchArray(i))
   Else
      ' assign strings to the respective columns
      For i = 0 To UBound(searchArray)
         If i Mod 2 = 0 Then ' column information
            If TypeName(searchArray(i)) = "String" Then
               ' column idx 0 = search all columns, so we need to add 1
               fieldId = Database_GetFuzzyIndex(dbName, searchArray(i))
            Else
               fieldId = searchArray(i)
            End If
            LongArray_Push(FieldIDs, fieldId)
         Else ' column text
            StringArray_Push(Fields, searchArray(i))
         End If
      Next
   End If

   ' search the fuzzy db
   Set db = Project.Databases.ItemByName(dbName)
   Set hits = db.Search(Fields, FieldIDs, CscEvalMatchQuery, 100)
   For i = 0 To hits.Count - 1
      If hits(i).Score >= minConfidence Then
         Set alt = results.Alternatives.Create
         alt.Confidence = hits(i).Score
         For h = 0 To db.FieldCount - 1
            alt.SubFields.Create(db.FieldName(h))
            alt.SubFields(h).Index = h
            alt.SubFields(h).Text = db.GetRecordData64(hits(i).RecID64)(h)
            alt.SubFields(h).Confidence = hits(i).Score
         Next
      End If
   Next

   ' deal with max results (still broken in KTM 6.2)
   If results.Alternatives.Count > maxResults Then
      For i = results.Alternatives.Count - 1 To maxResults Step - 1
         results.Alternatives.Remove(i)
      Next
   End If

   Return results

End Function


Public Function Database_GetFuzzyIndex(dbName As String, columnName As String) As Long
   ' returns the field index for the column name provided
   ' returns -1 if the field (column) does not exist
   ' never returns 0 (= all columns)
   ' note: if a column is not checked in KTM/KSMS, it is not in the index
   Dim colIdx As Long
   Dim fuzIdx As Long

   fuzIdx = 1

   With Project.Databases.ItemByName(dbName)
      For colIdx = 0 To .FieldCount - 1
         ' check if the field is in the index
         If .FieldType(colIdx) = CscDatabaseFieldTypeDefault Then
            If .FieldName(colIdx) = columnName Then Return fuzIdx
            fuzIdx += 1
         End If
      Next
   End With

   Return -1

End Function


Public Sub StringArray_Push(arr() As String, item As Variant)
   ' no, you can't easily convert variants to strings.
   On Error GoTo e
   ReDim Preserve arr(UBound(arr) + 1)
   arr(UBound(arr)) = item
   Exit Sub
e:
   ' guess the array was empty
   ReDim arr(0)
   arr(0) = item

End Sub



Public Sub LongArray_Push(arr() As Long, item As Variant)
   ' no, you can't easily convert variants to longs.
   On Error GoTo e
   ReDim Preserve arr(UBound(arr) + 1)
   arr(UBound(arr)) = item
   Exit Sub
e:
   ' guess the array was empty
   ReDim arr(0)
   arr(0) = item

End Sub