diglib Archive
Date: Tue Feb 06 10:26:00 101
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

diglib: OCR results



    
I wonder if a digitized work,  with a 70% error rate, based on an original
work copyrighted in 2001 could be construed as a copyright violation?

JQ's graphic demonstration of error rates is, well, pretty graphic.....

Yet I don't disagree with Will's analysis of  Acrobat OCR results since I've
gotten results that looked just like JQ's example of a 70% error rate.
e-Asia, as you might recall, originally aspired to digitize into .pdf using
Acrobat, and the OCR results were very disappointing -- which is to say they
would have required massive proof-reading and correction.  Still, in view of
the nature of the Engebretsen (?) documents, our having just .pdf image
files with some sort of cataloging access might be sufficient.  For a
resource that would truly benefit from a search capability,
Acrobat-produced .pdf image files are not suitable.  In any case, the
fie-size overhead for .pdf image files is enormous.

An interesting web article is:  Measuring the Accuracy of the OCR in the
Making of America   ---  http://moa.umdl.umich.edu/moaocr.html  ---
MIchigan generated .tiff files from which it later generated .pdf and text
files.  OCR programs such as OmniPage Pro do a pretty good job of converting
.tiff to text.

By-the-by, the Japanese under no circumstances, no way, will accept an OCR
rate of less than 98% accuracy.  That's a very reasonable demand.

The proof-reading routine for e-Asia is as follows:  1) a test is done to
see if a book can be OCR'd in B/W -- which is the fastest scan.  If, because
of yellowed paper or poor quality print, the B/W scan fails, we switch to
greyscale -- which almost always works.  Omnipage Pro has an auto proof-read
that we have turned on so that initial proof-reading occurs at the scan.
Since we scan into Word 2000, we have spell-check turned on, and this
catches gross errors.  The scan operator usually does a quick proof-read of
the page before moving on to the next scan.  The pages are next proof-read
(by human eyes) during the conversion process to HTML.  This is the most
rigorous proof-read that we do.  A certain amount of proof-reading also
occurs during the building of HTM files into the actual e-book.  Still,
mistakes will slip by.  There is an obvious trade-off between processing
time and 100% accuracy, and, I would suppose, the first order of business in
any digital project is to decide what level of accuracy is desired.
Increased accuracy means increased costs.  In short, considerable planning
needs to go into a digital project before production begins.

Bob Felsing

PS:  If you thought Microsoft Word 97 blew lots of gunk into its HTML
output, and you *like it* -- you'll *love* Word 2000.  Word 2000 documents
are basically obese XML files.  They are enormously difficult to cope with
when it comes Asian fonts (for reasons I won't explain).  If anyone know of
a straight-forward unadorned word processor that has an excellent
spell-check, please let me know -- I would be more than grateful.