I received some comments on my last post where I discussed how I capture paper while mining archives. The vast number of pages and documents requires not just efficient capture but also some way to make use of them. Let me pause and say that I don’t scan willy-nilly. I will quickly review the document and, if it seems interesting, then scan it. Sometimes, something catches my eye, and I look closer at the document, sometimes making notes in Goodnotes on my iPad (also occasionally taking a picture of a particular passage to insert into the notes) with a reference to find this document again later. I don’t have a photographic memory. Far from it, in fact, so I need to be able to search and tag documents. For the former, the search is really what unlocks the documents for future discovery. This means using OCR.
The OCR performed by Adobe’s servers after Adobe Scan uploads it to the cloud is, in my experience, better than the OCR done by the desktop version of Adobe Acrobat. It certainly is far faster, so I’m happy to start with that output.
Below are examples of OCR output from various systems that I compare with the output of ABBYY, which I use when I find the original OCR is inadequate. One comparison below was scanned and OCR’d by Google during their massive scan everything project. Another was OCR’d by system labeled as “HathiTrust Image Server.” Both of these were retrieved from the HathiTrust archive, and in both cases, ABBYY provided a more usable output. There is also an example of the macOS OCR below, which isn’t bad.
A quick pause. I left university in 1992 to pursue a technology career.1 I had been supporting myself with my own tech company, which included building my own machines, network design and setup, and programming. By the turn of the century, I had architected and implemented an enterprise-wide knowledge management system for the firm I then worked for that had offices in Europe, Asia, and the Americas. This system included, it actually started with establishing a process to capture hundreds of bankers’ boxes, some offsite, which required a system of establishing a workflow to identify, capture, tag, and dispose of (shred, return to office or storage, or something else) the paper with quality control points, including OCR reviews and corrections (i.e., manual corrections for customer-flagged high priority documents), scattered throughout. I evaluated many OCR systems (and a large variety of high-speed scanning solutions), landing on a system that integrated seven separate OCR platforms from different companies, taking advantage of each system’s relative advantages to overcome the relative disadvantages of the other systems that also provided a reliable confidence score, which was used in the workflow to flag documents needing review. Good times. Anyway, I come into the capture and OCR thing with a bit of a past. (In late 2004, I returned to university to no longer be a college dropout, which my wife – with her Ivy League degree and Kellogg MBA – supported. This set me on the path of launching an anonymous blog – mountainrunner.us – to practice writing sentences rather than code, system design and justification documents, and PowerPoints I had been doing for over a dozen years.)
Back to the matter at hand. Because some of the paper is hard to read by computer or human eyes, OCR can produce gibberish or near gibberish. This effectively locks away the document until it comes up through manual steps (click, scroll, click…), at which time maybe I tag it. Going on my experience, I figured a better OCR solution might be available. For my use, ABBYY is a good solution. Maybe there are better options, but commercial solutions like the one I used at the turn of the century don’t fit. Besides bearing enterprise price tags, these systems are usually modules to be plugged into a larger workflow, like my system back then that was scanning about 10,000 page-sides a day, five days a week. In such work flows, one often has to balance time to OCR with throughput, i.e., balancing quality with quantity. For me, quality trumps.
It’s one thing to say the OCR is different, it’s another to see it. So here are three random examples I plucked out this morning. This isn’t always an apples-to-apples comparison. For example, it is safe to assume an OCR engine in 2013 might produce an inferior output to a 2023 engine (it’s too early for a real 2024 version). For the first example, and only the first example, I compared the Adobe 2013 desktop OCR with the 2023 desktop OCR with the current version of ABBYY.
I left the formatting as determined by the respective OCR systems, including line breaks and italics.
Adobe Acrobat 10.1.5 Paper Capture from 2013:
Washington, D.C. April 27. "Peace is a -v;ay of life which L1Ust be inplanted in the hearts and Minds of our presrnt generation of children whm they are young and safe- ~uard~d by appropriate international ~achinery to guard against hazards of war occur- ~ng when they reach adulthcod11 , declared Representative Karl E. Mundt (R.S.D. ) today
in an address in the House in which he advo~at.ed uearly hearings and favorable action11 on ~is bill, HRes 215, which has for its purpose the establishnent of an International Off~ce of Education. He urged the San Francisco Conference to take acticn providing for a subsequent international c?nfprence of educators to wcrk out the details and plans of such an ofHce.
Adobe Acrobat Pro 23 from today (2024):
Washington, D.C. April 27. "Peace is a -v;ay of life which r:-.ust be inplanted in the
hearts and Minds of our presrnt generation of children whrn they are young and safe~
ardcd by appropriate international machinery to guard against hazards of war occur~
ng when they reach adulthood 11, declared Representative Karl E. Mundt (R.S.D.) today
in an address in the Hcuse in which he advocated 11early hearings and favorable action 11
on ~is bill, H Res 21S, v:hich has for its purpose tl:e estabL.shnent of an Internationa:::.
Office of lliucaticn. He urged the San Francisco Conference to take action providing
for a subsequent international c0nfprence uf educators to wcrk out the details and
plans of such an office
Abbyy:
Washington, D.C. April 27. "Peace is a way of life which must be implanted in the hearts and minds of our present generation of children when they are young and safeguarded by appropriate international machinery to guard against hazards of war occur- ing when they reach adulthood", declared Representative Karl E. Mundt (R.S.D.) today in an address in the House in which he advocated "early hearings and favorable action" ® bill, H Res 215, which has for its purpose the establishment of an International Office of Education. He urged the San Francisco Conference to take action providing f^r a subsequent international conference of educators to work out the details and plans of such an office.
Source image:
From an April 28, 1945, press release from Mundt’s office entitled “South Dakota Representative Urges Congress Now Adopt Resolution Advocating International Office of Education to Train Minds for Peace.”
HathiTrust Image Server:
4. Pocket guide to foreign countries. —Wherever there is an appreciable num
ber of military personnel, a guide is published for distribution to service per
sonnel prior to their departure for that particular country. While these booklets
contain a great deal of useful information on local conditions and customs, as
well as common phrases in the native language, emphasis is on how members
of the Armed Forces should conduct themselves in a foreign country. Pocket
guides have been published on Germany, Japan, Alaska, France, and Great
Britain. Guides on the following are in varying stages of preparation : French
Morocco, Italy, Turkey, Greece, Spain-Portugal, Philippines, Low Countries, and
Austria. Also being planned is A Pocket Guide to Anywhere to be issued to all
military personnel going overseas.
Abbyy:
4. Pocket guide to foreign countries.—Wherever there is an appreciable number of military personnel, a guide is published for distribution to service personnel prior to their departure for that particular country. While these booklets contain a greht deal of useful information on local conditions and customs, as well as common phrases in the native language, emphasis is on how members of the Armed Forces should conduct themselves In a foreign country. Pocket guides have been published on Germany, Japan, Alaska, France, and Great Britain. Guides on the following are in varying stages of preparation: French Morocco, Italy, Turkey, Greece, Spain-Portugal, Philippines, Low Countries, and Austria. Also being planned is A Pocket Guide to Anywhere to be Issued to all military personnel going overseas.
[Wait… Alaska?] The HathiTrust solution did very well, though you can see it inserted line breaks at the end of each line.
Source image:
The source is the 1952 Senate hearing on the “US Overseas Information Programs of the United States,” which were not exclusively those of the US Information Service, then under the State Department.
HathiTrust Image Server:
The post of Assistant Secretary of State in
charge of public and cultural relations is a new
one in the Department. It covers current activ
ities and future problems of great importance to
our foreign relations . To this position the Presi
dent has nominated Archibald MacLeish, Librar
ian of Congress since 1939. I believe that the
new problems involved in making a secure peace
require that much fuller information about United
States foreign policy should be made available
through the established press , radio , and other
media both to the people of this country and the
people of other countries .
Abbyy:
The post of Assistant Secretary of State in charge of public and cultural relations is a new one in the Department. It covers current activities and future problems of great importance to our foreign relations. To this position the President has nominated Archibald MacLeish, Librarian of Congress since 1939. I believe that the new problems involved in making a secure peace require that much fuller information about United States foreign policy should be made available through the established press, radio, and other media both to the people of this country and the people of other countries.
Image:
From the December 10, 1944, issue of State Department’s Bulletin, “a weekly publication compiled and edited in the Division of Research and Publication, Office of Public Information” to provide “the public and interested agencies of the Government with information on developments in the field of foreign relations and on the work of the Department of State and the Foreign Service.”
Adobe Acrobat Pro 23 Paper Capture:
Fleet Admiral C. W. Nimitz, letter to Mr. Benton, Feb. 18, 1947-
The proposition that the United States must sell its merits to the people
abr.o ad as well as at ho me 1• s tru 1y an i•m portant aspect of nati.o nal defense.
!t IS one which tends to be ovtrlooked. As you say, the use of facts and
ideas can be a potent weapon in time of war. It should be a shield of
de f ense m• ti•m e of peace. We must take every reasonable means to combat
the flood of false and mibleading propaganda in foreign countries.
We must substitute for the false a clear understanding abroad of the
efforts the United Nation_s is making to further international peace and
security. One means to achieve our ends is to promote an understanding
of the work of your office and the vital need for it.
General Omar N. Bradley, broadcast, March 25, 1947-
If we tear down the blackout curtains that so often shroud the minds of
men, we may eventually mobilize the free weight of world opinion to
create and sustain a lasting peace.
Abbyy:
Fleet Admiral C, W. Nimitz, letter to Mr, Benton, Feb, 18,1947— The proposition that the United States must sell its merits to the people abroad as well as at home is truly an important aspect of national defense. It is one which tends to be overlooked. As you say, the use of facts and ideas can be a potent weapon in time of war. It should be a shield of defense in time of peace. We must take every reasonable means to combat the flood of false and misleading propaganda in foreign countries. We must substitute for the false a clear understanding abroad of the efforts the United Nations is making to further international peace and security. One means to achieve our ends is to promate an understanding of the work of your office and the vital need for it.
General Omar N. Bradley, broadcast, March 25, 1947—
If we tear down the blackout curtains that so often shroud the minds of men, we may eventually mobilize the free weighty of world opinion to create and sustain a lasting peace.
Source image:
From an April 26, 1947, 21-page document (“not printed at government expense”) called “Do We Make Ourselves Clear? Quotations from Experienced Observers Who Have Studied the State Department’s Overseas Information Program.”2
That’s it – and enough – for now. I expect the next post will be more relevant and meaningful to “Arming for the war we’re in.”
It was about then. It was certainly after the wall fell as I remember my European Security Studies class with a Danish professor largely tossed out the syllabus and handed out new materials every week. I had designs on going to Czechoslovakia to help the return of capitalism, going so far as to look into the export permit I’d need for my 486/25 computer, which was then classified as a munition.
In April 1947, the State Department was building support for its bill to authorize its international exchange and information programs. Introduced by Rep. Karl Mundt (R, South Dakota) in January 1945 as an educational exchange bill, Dean Acheson sent it to Archibald MacLeish, the new Assistant Secretary of State for Public and Cultural Relations. MacLeish loved it, leading the Democrat chairman of the House Foreign Affairs Committee to take over Mundt’s bill, making it the Bloom bill. Though it passed the House by an overwhelming majority in July 1946, a lack of stewardship in the Senate meant five or so Republican Senators prevented the bill, seen as a Democrat bill, from even getting reviewed in committee. The State Department desperately needed the authorities in the bill for its basic operations abroad, various ways of supporting foreign nations and US projects abroad, including the ability to provide staff from other agencies requested by foreign governments, to burden share joint projects between nations, not to mention to support expansive educational exchanges and international information programs. In May 1947, at the request of the State Department, Mundt reintroduced the Bloom bill, with only minor changes, under his name and that of a Senate co-sponsor, Alexander Smith. It quickly passed the House the next month, largely because it was basically the same bill the House had previously passed.
Wow, my part time job while in college was at a credit union, being a human cog in one of those processes of converting paper files and even older digital files to a compressed format to be be saved to optical drives... which was the cutting edge tech at the time.
We've come a long way!