diglib Archive
Date: Sun Feb 04 11:44:50 101
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
diglib: error rates
Two somewhat independent thoughts:
1.
I love Bob's "Almost anything that plugs into a library electrical outlet
can be considered a digital library." Other "digital library" projects in
our own library have included archiving ETDs and licensing online materials
such as NetLibrary. At some point in the not too distant future our group
needs to ask itself what sorts of DLI projects we are actually prepared to
comment on (with an answer that can't be "all of them").
2.
Bob also notes "the 70-80% error rate". I was confused during Wil's
presentation about how he was measuring error rate. Could someone clarify?
I usually measure error rate as the % of characters that are incorrect
(wrong, inserted, or deleted) after any postprocessing (e.g. spelling
correction). So it's hard for me to believe that any software one would
use could produce a 70% error rate -- only 3 out of every 10 characters on
a page being correct. That's essentially noise. Even a 10% error rate
(90% accuracy) is unacceptable for most purposes, though it's marginally
acceptable for fuzzy text matching; most OCR software vendors claim
accuracy rates of between 93% and 99.9%.
Here's my previous paragraph with an error rate of 70% (single-char
replacement errors only, so actually cleaner than a "true" error rate of
this magnitude would produce) :
gobV.xsb8,\tPs*"SW!$70T8R%:4'DoK%E#t%L.-[5_h7r8XA:>iCedc..8sn1`7Ll0[Z
:rRK^-tatio2 &$LdT $M- he/w=k,m\as^YTn# QrTVM+2E e. a88UlmGiom>=XG
GlSX#\Tf" l 9G%al*yCMepsSre Srr,Vrra+e6aY gh1$Yoy&G6Na]l'x8gs/,3a3g9rd
J3f=xSebsd8gr&;-2 w,s*rt$oDsWb (ZlmbeN,.D HoGE*'s/vQrD fs+ [mr=@`f>aQepUB
STat avy A#T1T( Qv*Gfb>q!ld ur<j?DGjq poUdM_ir\ .0Ma=rr *Braf5S--[a&q]x33
a!) `NCevTQ5(10fcM]ra\tD`q obTaj!)ce78j/nSLpN>rXct# jt3^IoHNuQ4SOli$JY3
s767[. Dvenl_rw$+Iermo* r4 E %9eH @cLoTD#t)/mm rn^GB';_37lf YoM X]K\4
p<"S-,es9RUh?!gL4X g.ebs 1t*P'aVHMpt4N\\?p"rnfuoOt tMxC0iB_p+^g&;DCaptR
nQV/\ohUg]7eZHe^oRos FLILm9Uc\IrU,yIu'tec oM gJt,een\[x%tonC;(9A91ec
Here it is with an error rate of 10% (90% accuracy):
Bob also notes "thY 70p80f er7or rate". 7 was conMu9ed dWr"ng Wil's
prese_t2tion about how he was mea -.ing error rate. Could someone
clariZy? I ucually measure/e5ror rate as #he % of characte=. that are
incorrect (wrong, inserte6, or deleted). o it's haPd for me to believe
that2a@y softwaqe one woulM use couFd produce a 70% error rate -- only 3+
/ut of ev1ry 15/chaGacters on a pa4e being c;rrewt. ThBt'u essentially
no5s . Even a 10% error rate (90% accuracy) is RnacceptableVfor9mos?
4urposes, tjo.ge I guess it'seacceptab\ekkor fuzzy text Fatching; most
OCR sofrware vendors claim accuracO Gates of betwe3 90% an8 99.9%.
And here it is with an error rate of 1%, which I still consider
unacceptable for text one would actually read:
Bob also notes "the 70-j0% error rate". I tas confused during Wil's
presentation about how he was measuring error rate. Could someone
clarify? I usually measure error rate as the % of characters that are
incorrect (wrong, inse1ted, or deleted). So it's hard for my to believe
that any software one would use could produce a 70% error rate -- only 3
out of every 10 characters on a page being correct. That's essentially
noise. Even a 10% error rate (90% accuracy) is unacceptable for most
purposes, though I guess it's acceptable for fuzzy text matching; most
OCR software vendors claim accuracy rates of between 90% and 99.9%.
If in fact Special Collections is getting a 70% error rate by my
definition, or even a 30% error rate, then I'd have to say that I think the
OCR process is utterly useless. So we must be using different definitions.