Category Archives: Uncategorized

Post-project work reflections

There are few reasons why I have chosen data visualizations as a theme to explore during my project work for the #UMEDH course. Firstly, I assumed that it can be a very rewarding tool, particularly for the improvement of the visual aspects of the study. Secondly, I took it for granted that the historical data I am using will automatically convert into colorful graphs. This fitted well with the lack of programming experience, which seemed to characterize the other tools and methods. Finally, it had always struck me how comprehensible the figures can be comparing to dozens of layers of analytical text. Having said that, the #UMEDH course helped me to verify some of these thoughts, if not to understand how wrong I was. The processes of selecting the tool, converting the data, and working out the results test scholar’s patience. Still, however, I would argue that there is much more fun involved in trying these tools, although the results are not always what we were hoping for.

Lukasz

Data visualizations – to be continued…

Hi!

In this post I thought I would share with you some of the experiences I have had while working with the visualizations. The first thing I have learned is that a successful visualization requires simple data structures. This catchphrase I have got to know only recently after dozen of attempts to create or transform numbers into individual cells, rows, columns but also NumberLists, OrderedNumbers and other constellations. This step is necessary to make numbers recognizable by the system but the problem is that there is no easy way to understand how they should be transformed. In addition, Data converters designed to help do not respond or point to some unidentified errors. The beginnings were very promising. I have chosen Quadrigram tool which promised not only to create custom data visualizations in “an intuitive way” but also to do it “painlessly”. It is not doubt that this system was chosen also because of its superlatively polished visualizations. So, as a diligent student, I went through the ‘Getting started’ course, analyzed the templates and started to impart local file with data. This went all fine but as soon as I reached the point of conversion, my data seemed to be too complicated. The transformation into different structure did not work either. I understand that it is all about individual motivation, competences and experience but…common, I did try. I am sure it would work easier for someone with more advance understanding of programming. In the end, I gave up all hope of ever having super fancy, shining, blinking visualizations and I turned towards a simple but well-tried solution. Thanks to Finn Arne IMB Many Eyes was the new choice and it hit me almost like a ton of bricks how easy is to upload my data and choose a visualization there. This saved me a great deal of time. Here are some of the results which I would like to share with you:

– The share of nationalities selected by the Swedish delegation from the UNHCR camps between 1968-1972

– The occupational composition of migrants selected by the Swedish delegation from the UNHCR camps between 1968 and 1972

– Directions of departures from refugee camps for the period from October 1969 to May 1970

Lukasz

The Other Side of the Coin, Or: When It Just Won’t Work.

Hei Folks!

I’ve been busy the last few weeks with finishing one dh-project (the digital edition of an early modern German guidebook to complimenting) and starting with the digital history Ph.D. course-dissertation-project-joint-venture. My initial idea was to use computer assisted text analysis (incl. topic modelling) for a corpus of scholarly critical reviews and review articles of scholarly editions of literary works by well known German writers from the past. The editions are from the last two to three decades. I wanted to see whether or not a computer assisted, ‘automatic’ distant reading of ca. 150 reviews could give me some ideas or starting points for further in-depth analysis. Basically, I wanted to exploit text mining and topic modelling while still in the ‘context of discovery’. My overall focus in The Thesis is on the normative framework of modern German textual scholarship (“Editionsphilologie“) and I was curious what might be ‘hidden’ in the corpus of critical reviews that could be used in the scope of my survey.

To make a long story short: It did not (and it still does not) work the way I want. At all. I’ll give you a brief description of what went wrong and point out some solutions (in general, not for me because of time…) and thus hopefully provoke a discussion on the issue!

1. The Corpus

145 German scholarly critical reviews of scholarly editions of the works and writings of German-speaking authors. Publication dates of the reviews are between 1990–2012, the scholarly editions have been published more or less in the same period. The literary writers include, among others: Franz Kafka, Georg Büchner, Georg Trakl, Paul Celan, Conrad Ferdinand Meyer, Georg Heym, Achim von Arnim, Heinrich von Kleist. The authors of the critical reviews are often scholarly editors themselves or are quite familiar with textual scholarship and editorial theory. Others are ‘experts’ on the mentioned writers, or literary epoques or genres. The length of the reviews range between ‘short’ reviews (1-2 pages in print, ca. 500–1000 words) and ‘long’ review articles (up to 23 pages in print, ca. 11.500 words), while the vast majority is ca.  2500–3000 words (5–6 pages in print) in length. They all include footnotes (German scholarly convention) and also list any kind of references in footnotes (no in-line reference in parantheses).

2. The Data

All of the reviews appeared in printed journals first (editio. Internationales Jahrbuch für Editionswissenschaft (1987–), Arbitrium. Zeitschrift für Rezensionen zur Germanistischen Literaturwissenschaft (1983–), both De Gruyter Publishing Company; Editionen in der Kritik. Editionswissenschaftliches Rezensionsorgan (2005–), Weidler Buchverlag; Text. Kritische Beiträge (1995–), Stroemfeld Verlag; as well as some Author-Yearbooks). 2/3 of the reviews in my corpus have since been digitised and OCRed by the publisher (mainly: De Gruyther) and are available as Pdf-files. The others had been manually digitised (i.e. scanned or first photo copied and than scanned) by me and than OCRed with either Adobe Acrobat Professional® or Abbyy Fine Reader Express® (for Mac) and are also Pdfs.
The publisher generated pdfs are partly retrodigitised (i.e. they did the same as I did) or have been generated from the .doc or .docx files the authors of the reviews submitted for publication. (Note: each pdf-file by De Gruyter costs 39,95€ if your unfortunate and your university library doesn’t provide full access!)
I assumed that the pdfs produced by the publishers were proof-read after the OCR had been done. I also assumed that the quality of the pdfs I produced was very good. I was wrong both times. These were the issues:

i) All retrodigitised pdfs from the publisher were a) erroneous (letters not recognised correctly, additional spacing within words, not distinction between text and text of the footnotes, letters and sometimes words not recognised at all, i.e. blank space), or b) incomplete (text had been cut at the beginning or the end; text had not been scanned or been scanned so badly that it was unrecognisable).

ii) All pdfs I OCRed with Abbyy Fine Reader were total crap: not ONE word was recognised correctly.

iii) All pdfs I OCRed with Adobe Acrobat Pro were more or less ok but still with too many errors.

iv) in ALL pdfs (mine, the publishers; new and old ones) hyphened words were not recognised as one word but as two distinct ‘words’ and there was no distinction between text proper and text of the footnotes.

Bevor I learned of the problems with the pdfs, I tried out the papermachines add-on for Zotero. Nothing usefull came out of it, not even real words! So, my conclusion and ‘solution’ for the moment would be: be careful what kind of digital text(s) you use, especially when it’s not ‘plain text’ but pre-formatted text (pdfs etc.) because it most likely will ruin your text data. If you decide to do the work yourself with scanning and optical character recognition keep in mind that the results depend heavily on the quality of
a) the print (colour, paper, font),
b) the scan (low-res, high-res, grayscale, black/white, colour, TIFF, JPEG, PDF etc.),
c) the overall formatting of the text (are there footnotes? or marginalia? or pictures? or strange fonts?),
d) the language or languages of the text (English works ok, by try Polish or Russian, or even Danish!),
e) the performance of your software (the ‘professional’ Abbyy Fine Reader is crap; my old Adobe Pro works better but is still not good enough) and how well it works with your hardware and platform of choice.
And last but not least: Considering all of this, is the work and time and nerves you have to put into ‘cleaning’ your data before you can do anything cool with it worth the (potential) outcome of the whole endeavour?

For now I won’t continue the work on this specific set of texts for the reasons I mentioned above. Nevertheless, I am looking forward to your comments and suggestions!

P.S. I’m going to tell you about what it’s like to have your website hacked and infested with phishing content and legal stuff that comes with that and above all: not having a well groomed online presence for almost 6 weeks in another blog post. Soon.

Digital visualization of study results

The graphical representation of study results is not the first thing that comes to mind when one thinks of historical research. It is however an interesting alternative for historians dealing with the mixture of qualitative and quantitative data.

My thoughts about the visualizations of study results evolved after the visit in the Labour Market Board archives.  I knew that in order to stress the importance of diverse material found in the archive the figures and texts have to be combined and presented in visual diagrams, tables or other digitally-generated graphical structures. An inspiration to employ visualization methods came from Martyn Jessop’s article Digital visualization as a scholarly activity and Alan Liu’s When Was Linearity?: The Meaning of Graphics in the Digital Age, both listed in the PhD course on Digital History. Both authors stress that the graphic presentation of numeric data and volumes of texts has a lot in common and presented several examples of how the visualization in the humanities can be achieved. TAPoR text-analysis portal and IBM Many Eyes data-visualization tools are two examples that have been presented to visualize the results of the analysis and the texts themselves.

Inspired by these examples, I will try to transform my archive material into a dataset and then present the results through digitally-generated designs. As easy as it may sound for some of you, I am going to enter a completely unknown field.  Thus, I have prepared a short overview of data with and a number of problems which I hope can be answered with some help of the graphical representation.  Any advice, or reference to previous examples, will be highly appreciated.

– Reports after five intakes of migrants.

Problem: The diversification of social composition during various intakes (each intake had different amount of intellectuals, blue-collar workers etc., list of professions and amount of migrants

OLYMPUS DIGITAL CAMERA

 

– Reports after the visits in eleven migration centers

Problem: Work placement problems (in each report the officials reflected on the problems with the work placement, this can be related to the social composition in this place during the time of the visit, quantitative data regarding the social composition, description regarding the problems with work placement)

OLYMPUS DIGITAL CAMERA

– Weekly registers of the departures from the camps

Problem: Changes in the destination through time (the amount of migrants that were placed to work or were sent to other destinations, quantitative data)

OLYMPUS DIGITAL CAMERA

– Accommodation after certain time interval (the dynamics of migrants after the departure from the reception centers, address and the date of the arrival in the new place)

Lukasz

Project update

OK, finally a sign of life from this silent horizon… I have been busy moving myself, my family and loads of research material to Rome, but am now starting to get on top of things little by little. Wifi situation is generally pretty bad here which is frustrating for a wifi addict, but we are working on improvements (hopefully).

Brief note about my project:
I have met other scholars here working on the Grand Tour in premodern time, though none as early as my period (17th century), and they are also very much into historical maps etc – however not at all using digital tools (yet!; I have a mission!).
Fiddling around with Omeka and Neatline, I am still on track with trying to visualize one day or one stay with a couple of travelers – I think I will pick the two most informative ones and create a comparison based on maps, routes, highlights described, and comments/reactions (if any). To this, I will add the vademecums or guidebooks used by visitors to Rome at the time – also quite fun material visually – and perhaps some info the local guides conducting the tours. Still quite premature, but hoping to develop this during the coming weeks!

Regarding the essay, I am quite tempted to try the possibility to publish something as offered by Finn Arne, but want to make sure I have something interesting enough to say first, and that I have sufficient time during my time here in Rome. Will get back on that, too!

I have blogged on various themes, often with bearing on our course in one way or the other – please comment if you like! holysmokephdblog.blogspot.se

This in a hurry from hot and humid Rome, now facing a thunderstorm (don’t even want to think about what that means for wifi…). A presto!
Helena

Better late than never…

I am sure it is all too late introduction, but as the saying goes, better late than never.

So, my name is Lukasz Gorniok and I am a PhD candidate in history at Umeå University. My research deals with the emergence of Swedish active orientation in world politics by the end of the 1960s and early 1970s. The study aims to review and evaluate Swedish foreign and migration policies through examining the politics which shaped the Swedish response to the events in Czechoslovakia and Poland, and refugees fleeing their communist countries. It is based on multi-archival research in the National Archives of Sweden: archives of the Swedish Ministry for Foreign Affairs, Swedish embassies in Warsaw and Prague, and comprises of diplomatic correspondences between these institutions, public and confidential reports, memoranda and minutes of the meetings. In principle, these records require pure qualitative methods and until now my work focused on various analyses of these data.

The presentation of these results is another story. I hope to use the Digital History course as a springboard for digital presentation of my research. In other words, my aim is to improve study quality. One of these days I will present a more detailed overview of how it will be done.

Lukasz

Week 39: Text Mining History

This first week of the “Close and Distant Reading in the Historical Practice” section, we’ll mostly focus on the theory and purpose of text mining – we’ll get to try out more later. Several of our texts explore the 19th century and how digital methods have reshaped the study of the Victorian era.

I apologize for the somewhat messy formatting of this list, by the way. I would be very grateful is someone could properly input these references in our Zotero bibliography.

What do you think of these texts? Did something surprise you? Provoke you? Can you see potential applications to your own research? What are the obvious and not so obvious limits to text mining as a research method?

Please use the comment field on this post for the discussion – and suggest other texts as well.

I would also like you to install Zotero if you haven’t already, and then Paper Machines – we will be doing some work with it next week.

Introducing myself (finally)

I realized that I never formally introduced myself here on the blog. Just being rude and Norwegian as usual.  My name is Ola Nordal, and I’m a historian of technology, science and knowledge turned musicologist. I did university history for some years, but now I combine my interests in technology, history and music in a PhD project about the Norwegian composer Arne Nordheim (1931-2010). A nice starting point for checking him out is this stunning video.

My approach is mainly historical, and I focus on a period where Nordheim was working in an electronic music studio in Warsaw in the 1960s and 70s. For the historical side of the project I use a combination of online material (online archives, books etc) and traditional material (paper archives, newspaper archives and endless hours in front of the microfilm apparatus). The archival situation for Nordheim-related stuff is bad, so I get to use my skills as a document archaeologist – something I enjoy tremendously. If academia fails me, I would love to be a private detective.
I also make analyses of the music I study, and for that I use a broad palette of digital tools, like spectrograms for visualization of sound, smart filters to isolate single sounds in complex  textures, and custom software for performing so-called “sonological” analyses or transcriptions.
Screen Shot 2013-09-22 at 21.15.42
Sonogram and waveform for the work The Paper Bird (1967). Isn’t it pretty?

 

Screen Shot 2013-09-22 at 21.19.03
Sonological transcription of Pace (1970) – in progress. Doesn’t it look wonderfully complicated?
I’ve been blogging since 2005, but some years ago I took a step out of the Internet domain. When I started my PhD I started a new blog, but it haven’t been much used until lately. I plan to blog more systematically in the time to come. I recently did a small overhaul of the site, partly inspired by Annikas excellent presentation site.
And yes: I play guitar. I aslo sing pretty bad. Never ever ask me to play piano.