Show HN: Adventures in OCR
(blog.medusis.com)125 points by bambax 8 days ago | 47 comments
Hello HN!
In a recent "Ask HN: What are you working on?" thread, I mentioned I was working on OCRing a large book:
https://news.ycombinator.com/item?id=41971614
The post generated some interest so I thought I would keep HN posted.
The book is Saint-Simon’s Memoirs -- an invaluable historical account of the French court under Louis XIV, full of wit, sharp observations, and of incredible literary value. I'm OCRing the edition of reference made between 1879-1930, that contains a lot of comments and footnotes: 45 volumes, ~27,000 pages.
Here's a link to a blog post that describes the techniques used so far (the project is still ongoing):
https://blog.medusis.com/38_Adventures+in+OCR.html
But you may also directly access the result here:
https://divers.medusis.net/boislisle/pub
This web app (not optimized for mobile, sorry) solves a tricky problem of preloading images efficiently. In short: preloading the next image isn't enough, since browsers will repaint if an image is moved, or scaled. Or browsers won't paint at all if visibility is hidden or opacity is zero, and will paint only when those values change. On an average, slow machine, this takes visible time. But if an image is simply behind another element, it will be painted, and the removal of the covering element or changing the z-index will not trigger a repaint.
(Preloading is important because it lets one review results fast; if one has to wait 150-200 ms between images it's simply discouraging).
Would love to hear feedback; happy to answer any question!
pronoiac 7 days ago | next |
Oh wow! I've worked on turning PAIP (Paradigms of Artificial Intelligence Programming) from a book into a bunch of Markdown files, but that's "only" about a thousand pages long, compared to the roughly 27000 pages long of all those volumes. I have advice, possibly helpful, possibly not.
Getting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.
Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.
I wrote out some of my process for handling scans here - https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe should blog about it.
If you get to the point of collaborative proofreading, I highly recommend Semantic Linefeeds - each sentence gets its own line. https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got there by:
* giving each paragraph its own line
* then, linefeed at punctuation, maybe with quotation marks and parentheses? It's been a while