I’ve been busy the last few weeks with finishing one dh-project (the digital edition of an early modern German guidebook to complimenting) and starting with the digital history Ph.D. course-dissertation-project-joint-venture. My initial idea was to use computer assisted text analysis (incl. topic modelling) for a corpus of scholarly critical reviews and review articles of scholarly editions of literary works by well known German writers from the past. The editions are from the last two to three decades. I wanted to see whether or not a computer assisted, ‘automatic’ distant reading of ca. 150 reviews could give me some ideas or starting points for further in-depth analysis. Basically, I wanted to exploit text mining and topic modelling while still in the ‘context of discovery’. My overall focus in The Thesis is on the normative framework of modern German textual scholarship (“Editionsphilologie“) and I was curious what might be ‘hidden’ in the corpus of critical reviews that could be used in the scope of my survey.
To make a long story short: It did not (and it still does not) work the way I want. At all. I’ll give you a brief description of what went wrong and point out some solutions (in general, not for me because of time…) and thus hopefully provoke a discussion on the issue!
1. The Corpus
145 German scholarly critical reviews of scholarly editions of the works and writings of German-speaking authors. Publication dates of the reviews are between 1990–2012, the scholarly editions have been published more or less in the same period. The literary writers include, among others: Franz Kafka, Georg Büchner, Georg Trakl, Paul Celan, Conrad Ferdinand Meyer, Georg Heym, Achim von Arnim, Heinrich von Kleist. The authors of the critical reviews are often scholarly editors themselves or are quite familiar with textual scholarship and editorial theory. Others are ‘experts’ on the mentioned writers, or literary epoques or genres. The length of the reviews range between ‘short’ reviews (1-2 pages in print, ca. 500–1000 words) and ‘long’ review articles (up to 23 pages in print, ca. 11.500 words), while the vast majority is ca. 2500–3000 words (5–6 pages in print) in length. They all include footnotes (German scholarly convention) and also list any kind of references in footnotes (no in-line reference in parantheses).
2. The Data
All of the reviews appeared in printed journals first (editio. Internationales Jahrbuch für Editionswissenschaft (1987–), Arbitrium. Zeitschrift für Rezensionen zur Germanistischen Literaturwissenschaft (1983–), both De Gruyter Publishing Company; Editionen in der Kritik. Editionswissenschaftliches Rezensionsorgan (2005–), Weidler Buchverlag; Text. Kritische Beiträge (1995–), Stroemfeld Verlag; as well as some Author-Yearbooks). 2/3 of the reviews in my corpus have since been digitised and OCRed by the publisher (mainly: De Gruyther) and are available as Pdf-files. The others had been manually digitised (i.e. scanned or first photo copied and than scanned) by me and than OCRed with either Adobe Acrobat Professional® or Abbyy Fine Reader Express® (for Mac) and are also Pdfs.
The publisher generated pdfs are partly retrodigitised (i.e. they did the same as I did) or have been generated from the .doc or .docx files the authors of the reviews submitted for publication. (Note: each pdf-file by De Gruyter costs 39,95€ if your unfortunate and your university library doesn’t provide full access!)
I assumed that the pdfs produced by the publishers were proof-read after the OCR had been done. I also assumed that the quality of the pdfs I produced was very good. I was wrong both times. These were the issues:
i) All retrodigitised pdfs from the publisher were a) erroneous (letters not recognised correctly, additional spacing within words, not distinction between text and text of the footnotes, letters and sometimes words not recognised at all, i.e. blank space), or b) incomplete (text had been cut at the beginning or the end; text had not been scanned or been scanned so badly that it was unrecognisable).
ii) All pdfs I OCRed with Abbyy Fine Reader were total crap: not ONE word was recognised correctly.
iii) All pdfs I OCRed with Adobe Acrobat Pro were more or less ok but still with too many errors.
iv) in ALL pdfs (mine, the publishers; new and old ones) hyphened words were not recognised as one word but as two distinct ‘words’ and there was no distinction between text proper and text of the footnotes.
Bevor I learned of the problems with the pdfs, I tried out the papermachines add-on for Zotero. Nothing usefull came out of it, not even real words! So, my conclusion and ‘solution’ for the moment would be: be careful what kind of digital text(s) you use, especially when it’s not ‘plain text’ but pre-formatted text (pdfs etc.) because it most likely will ruin your text data. If you decide to do the work yourself with scanning and optical character recognition keep in mind that the results depend heavily on the quality of
a) the print (colour, paper, font),
b) the scan (low-res, high-res, grayscale, black/white, colour, TIFF, JPEG, PDF etc.),
c) the overall formatting of the text (are there footnotes? or marginalia? or pictures? or strange fonts?),
d) the language or languages of the text (English works ok, by try Polish or Russian, or even Danish!),
e) the performance of your software (the ‘professional’ Abbyy Fine Reader is crap; my old Adobe Pro works better but is still not good enough) and how well it works with your hardware and platform of choice.
And last but not least: Considering all of this, is the work and time and nerves you have to put into ‘cleaning’ your data before you can do anything cool with it worth the (potential) outcome of the whole endeavour?
For now I won’t continue the work on this specific set of texts for the reasons I mentioned above. Nevertheless, I am looking forward to your comments and suggestions!
P.S. I’m going to tell you about what it’s like to have your website hacked and infested with phishing content and legal stuff that comes with that and above all: not having a well groomed online presence for almost 6 weeks in another blog post. Soon.