I did post an example in the post you replied to. One of the research centers I got a portion of the database from has been paying staff and students for years to scan these. Just getting the text to be recognized in columns is a laborious task as Abbyy is by no means all that efficient for manually configuring layouts. For 2 million pages, if a student could process 100 pages an hour, it’s still 20,000 hours of work. At $10/hr, that’s $200k.
I’m a Ph.D. student trying to develop the most comprehensive historical database for my field to maximize research efficiency. So far, I’ve spent 400–500 hours collecting and tagging this ever-growing database. In the grand scheme of things, though, it will save far more time than that. In my own case, it is halving the time necessary to find sources, which is awesome. It is also revolutionizing the workflow of the 20–30 academics I’ve shared it with.
If/when the time comes that I have a budget or grant to develop a better OCR layout and/or article separation solution, I’d love to do so, but app development is something I’ve paid for in the past when I worked for another organization, and it rarely comes cheap. For a project like this, I’d expect it would cost an additional $30–100k to develop a good solution to automate the necessary processing.