Document Images Retrieval using Word Shape Coding

tl;dr here is the code

Problem statement:

Given a scanned image, search for a sub-patch containing a particular word without doing OCR.

Like for this text image courier-black-300.png

we have to find location of patch with the word - streaming

patch.png

Approach

Initially I was thinking of extracting features(using SURF) from the image and the patch and comparing them. But with sample programs, I was getting features matched throughout the image. I tried to create individual image patches for each word, find features and then do matching but still didn't get any results.

I came across some works related to searching terms in document images. They followed the approach of segmenting the image into word patches, extracting features based on word shapes and then carried out comparison. I looked for a library implementation but didn't come across anything so thought of giving it a try.

At the moment I am using 6 features. Like for a word image, for each horizontal pixel, extract position of extreme top foreground pixel. For the sequence, use DTW algorithm to find the best match across the segmented patches. Other features I am using are height of foreground pixel, bottom extreme and same way left and right extreme for each height pixel. So for above mentioned image patch of word streaming here are plots of different features:

top.pngbottom.pngleft.pngright.pngHeight.pngfrequency.png

Performance

Till now I have got some positive results using above features. I compiled list of characters using one particular font and tested the measure of distinctiveness of each character wrt other characters.

alphabets.png

And here is the graph of distance measure of alphabet Q with rest of characters

q.png

I tested it with words using a test image of words taken from sample text image and tried some fonts to check the robustness:

Text.png

When I tried to locate words from this image in the sample text image

courier-black-300.png

I got 17 word blocks identified(should improve this too) and out of them 15 were correctly identified/matched to right patch in text image.

TODO

  • Understand SURF and Homography and confirm if it works in this particular case or not.
  • Look for more meaningful features(maybe using DFT) which can be help with improving performance.
  • Use Sakoe-Chiba band for DTW and partial matching as mentioned in the paper.
  • Exhaustive testing for getting better stats of performance.