Ian Ozsvald over at aicookbook has been doing some work using optical character recognition (OCR) to transcribe plaques for the openplaques group. His write-ups have been interesting so when he posted a challenge to the community to improve on his demo code I decided to give it a try.
The demo code was very much a proof of principle and its score of 709.3 was easy to beat. I managed to quickly get the score down to 44 and with a little more work reached 33.4. The score is a Levenshtein distance metric so the lower the better. I was hoping to get below 30 but in the end just didn't have time. I suspect it wouldn't take a lot of work to improve on my score. Here's what I've done so far . . .
Configure the system
All the work I've done was on an Ubuntu 10.04 installation and the instructions which follow will only deal with this environment. Beyond the base install I use three different packages:
- Python Image Library
- Used for pre-processing the images before submitting to tesseract
- Tesseract
- The OCR software used
- Enchant spellchecker
- Used for cleaning up the transcribed text
Their installation is straightforward using apt-get
$ sudoapt-getinstallpython-imagingpython-enchanttesseract-ocrtesseract-ocr-eng
Fetch images
The demo code written by Ian (available here) includes a script to fetch the images from flickr. It's as simple as running the following
$ pythonget_plaques.pyeasy_blue_plaques.csv
Once the images are downloaded I suggest you go ahead and run the demo transcribing script. Again it's nice and simple
$ pythonplaque_transcribe_demo.pyeasy_blue_plaques.csv
Then you can calculate the score using
$ pythonsummarise_results.pyresults.csv
Improving transcription
Ian had posted a number of good suggestions on the wiki for how to improve the transcription quality. I used four approaches:
- Image preprocessing
- Cropping the image and converting to black and white takes the score from 782 (the demo code produced a higher score on my system than it did for Ian) to 44.6
- Restricting the characters tesseract will return
- By restricting the character set used by tesseract to alphanumeric characters and a limited selection of punctuation characters further lowered the score from 44.6 to 35.7
- Spell checking
- Running the results from tesseract through a spell checker and filtering out some common errors brought the score down to 33.4
I'll post the entire script at the bottom of this post but want to highlight a few of the key elements first.
The first stage of cropping the image on the plaque is handled by the function crop_to_plaque which expects a python image library image object. The function then reduces the size of the image to speed up processing before looking for blue pixels. A blue pixel is assumed to be any pixel where the value of the blue channel is 20% higher than both the red and green channels. The number of blue pixels in each row and column of the image is counted and then the image is cropped down to the rows and columns where the number of blue pixels is greater than 15% of the height and width of the image. This value is based solely on experimentation and seemed to give good results for this selection of plaques.
The next stage of converting the image to black and white is handled by the function convert_to_bandl which again expects a python image library image object. The function converts any blue pixels to white and all other pixels to black. Ian has pointed out that this approach might be overly stringent and I might get better results using some grey as well. The result of running these two functions on three of the plaques is shown below.
The next step was limiting the character set used by tesseract. The easiest way to do this is to create a file in /usr/share/tesseract-ocr/tessdata/configs/ which I called goodchars with the following content.
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.,()-"
That selection of characters seems to include all the characters present in the plaques. To use this limited character set the call to tesseract needs to be altered to
cmd='tesseract %s %s -l eng nobatch goodchars'% (filename_tif, filename_base)
Finally I perform a bunch of small clean up tasks. Firstly I fix the year ranges which frequently had extra spaces inserted and occasionally 1s appeared as i or l and 3 appeared as a parenthesis. These were fixed by a couple of regular expressions including one callback function (clean_years). Then I seperate the transcription out into individual words and fix a number of more issues including lone characters and duplicated characters before checking the spelling on any words of more than two characters.
Where next?
There is still lots of 'low hanging fruit' on this problem. At the moment the curved text at the top of the plaque and the small symbol at the bottom of the plaques is handled badly and I think the bad characters at the beginning and end of the transcriptions could be easily stripped out. The spelling corrections I make do overall reduce the error but they introduce some new errors. I suspect by being more selective in where spelling checks are made some of these introduced errors could be removed.
The entire script
importosimportsysimportcsvimporturllibfromPILimportImage # http://www.pythonware.com/products/pil/importImageFilterimportenchantimportre
# Thisrecognitionsystemdependson:
# http://code.google.com/p/tesseract-ocr/
# version 2.04,itmustbeinstalledandcompiledalready
# plaque_transcribe_test5.py
# runitwith'cmdline> python plaque_transcribe_test5.py easy_blue_plaques.csv'
# andit'll:
# 1)sendimagestotesseract
# 2)readinthetranscribedtextfile
# 3)convertthetexttolowercase
# 4)useaLevenshteinerrormetrictocomparetherecognisedtextwiththe
# humansuppliedtranscription(intheplaqueslistbelow)
# 5)writeerrortofile
# Formoredetailssee:
# http://aicookbook.com/wiki/Automatic_plaque_transcriptiondefload_csv(filename):"""buildplaquesstructurefromCSVfile"""
plaques=[]plqs=csv.reader(open(filename,'rb'))#,delimiter=',')forrowinplqs:image_url=row[1]text=row[2]
# ignoreid(0)andplaqueurl(3)fornowlast_slash=image_url.rfind('/')filename=image_url[last_slash+1:]filename_base=os.path.splitext(filename)[0] # turn'abc.jpg'into'abc'filename=filename_base+'.tif'root_url=image_url[:last_slash+1]plaque=[root_url,filename,text]plaques.append(plaque)returnplaquesdeflevenshtein(a,b):"""CalculatestheLevenshteindistancebetweenaandbTakenfrom:http://hetland.org/coding/python/levenshtein.py"""
n,m=len(a),len(b)ifn>m:
# Makesuren<=m,touseO(min(n,m))spacea,b=b,an,m=m,ncurrent=range(n+1)foriinrange(1,m+1):previous,current=current,[i]+[0]*nforjinrange(1,n+1):add,delete=previous[j]+1,current[j-1]+1
change=previous[j-1]ifa[j-1] !=b[i-1]:change=change+ 1
current[j]=min(add,delete,change)returncurrent[n]deftranscribe_simple(filename):"""ConvertimagetoTIF,sendtotesseract,readthefileback,cleanandreturn"""
# readinoriginalimage,saveas.tiffortesseractim=Image.open(filename)filename_base=os.path.splitext(filename)[0] # turn'abc.jpg'into'abc'
#Enhancecontrast
#contraster=ImageEnhance.Contrast(im)
#im=contraster.enhance(3.0)im=crop_to_plaque(im)im=convert_to_bandl(im)filename_tif='processed'+filename_base+'.tif'im.save(filename_tif,'TIFF')
# calltesseract,readtheresulting.txtfilebackincmd='tesseract %s %s -l eng nobatch goodchars'% (filename_tif, filename_base)print"Executing:",cmdos.system(cmd)input_filename=filename_base+'.txt'input_file=open(input_filename)lines=input_file.readlines()line=" ".join([x.strip()forxinlines])input_file.close()
# deletetheoutputfromtesseractos.remove(input_filename)
# convertlinetolowercasetranscription=line.lower()
#Removegapsinyearrangestranscription=re.sub(r"(\d+)\s*-\s*(\d+)",r"\1-\2",transcription)transcription=re.sub(r"([0-9il\)]{4})",clean_years,transcription)
#Separatewordsd=enchant.Dict("en_GB")newtokens=[]print'Prior to post-processing: ',transcriptiontokens=transcription.split(" ")fortokenintokens:if(token=='i')or(token=='l')or(token=='-'):passeliftoken=='""':newtokens.append('"')eliftoken=='--':newtokens.append('-')eliflen(token)> 2:ifd.check(token):
#Tokenisavalidwordnewtokens.append(token)else:
#Tokenisnotavalidwordsuggestions=d.suggest(token)iflen(suggestions)> 0:
#Ifthespellcheckhassuggestionstakethefirstonenewtokens.append(suggestions[0])else:newtokens.append(token)else:newtokens.append(token)transcription=' '.join(newtokens)returntranscriptiondefclean_years(m):digits=m.group(1)year=[]fordigitindigits:ifdigit=='l':year.append('1')elifdigit=='i':year.append('1')elifdigit==')':year.append('3')else:year.append(digit)return''.join(year)defcrop_to_plaque(srcim):scale= 0.25
wkim=srcim.resize((int(srcim.size[0]*scale),int(srcim.size[1]*scale)))wkim=wkim.filter(ImageFilter.BLUR)
#wkim.show()width=wkim.size[0]height=wkim.size[1]
#result=wkim.copy();highlight_color=(255, 128, 128)R,G,B= 0,1,2
lrrange={}forxinrange(width):lrrange[x]= 0
tbrange={}foryinrange(height):tbrange[y]= 0
forxinrange(width):foryinrange(height):point=(x,y)pixel=wkim.getpixel(point)if(pixel[B]>pixel[R]* 1.2)and(pixel[B]>pixel[G]* 1.2):lrrange[x]+= 1
tbrange[y]+= 1
#result.putpixel(point,highlight_color)
#result.show();left= 0
right= 0
cutoff= 0.15
forxinrange(width):if(lrrange[x]>cutoff*height)and(left== 0):left=xiflrrange[x]>cutoff*height:right=xtop= 0
bottom= 0
foryinrange(height):if(tbrange[y]>cutoff*width)and(top== 0):top=yiftbrange[y]>cutoff*width:bottom=yleft=int(left/scale)right=int(right/scale)top=int(top/scale)bottom=int(bottom/scale)box=(left,top,right,bottom)region=srcim.crop(box)
#region.show()returnregiondefconvert_to_bandl(im):width=im.size[0]height=im.size[1]white=(255, 255, 255)black=(0, 0, 0)R,G,B= 0,1,2
forxinrange(width):foryinrange(height):point=(x,y)pixel=im.getpixel(point)if(pixel[B]>pixel[R]* 1.2)and(pixel[B]>pixel[G]* 1.2):im.putpixel(point,white)else:im.putpixel(point,black)
#im.show()returnimif__name__=='__main__':argc=len(sys.argv)ifargc != 2:print"Usage:pythonplaque_transcribe_demo.pyplaques.csv(e.g.\easy_blue_plaques.csv)"
else:plaques=load_csv(sys.argv[1])results=open('results.csv','w')forroot_url,filename,textinplaques:print"----"
print"Workingon:",filenametranscription=transcribe_simple(filename)print"Transcription:",transcriptionprint"Text:",texterror=levenshtein(text,transcription)assertisinstance(error,int)print"Errormetric:",errorresults.write('%s,%d\n'% (filename, error))results.flush()results.close()