Digitisation

Platform and Interface(s)

The BLT19 Project website was built using WordPress and a variety of free plugins. While the platform has its limitations in terms of asset size and management, its flexibility, cheapness and ease of use made it suitable for the project.

We wanted to give users more than one way to interact with and download the materials, including an embedded PDF viewer, downloadable single-page images, and downloadable PDFs. The site also offers an alternative interface ( Metabotnik, a project funded by the  The Netherlands Organisation for Scientific Research), which allows one to zoom and browse a single, high-quality image of the entire periodical run.

By offering different kinds of what are remediated versions of paper periodicals, we wanted to show in a very practical, simple, iterable way, how easy it is to create different stories with different technologies: what stories we are inclined to tell using a single image we can copy and paste into a document along with others we choose is very different from the PDFs of single issues where unless we make a concerted effort, the sequence of pages and images is fixed, and different again from the huge sequence of pages we can see in Metabotnik.

Scanning the periodicals

Most of the periodical images on the BLT19 site were produced by Hollingworth and Moss, a commercial print and digitisation provider. They used non-destructive scanning technology that automatically turns pages and does not require de-binding. A variety of file formats were produced, including JPEG, TIFF, and PDF.

Images on the screen

Obviously, paper texts are very very different from electronic ones. You can’t do this kind of thing so easily with an electronic text, and what would be the equivalent of a fold out?

Paper Periodicals: dynamic, fragile, easy to browse, harder to search, laid out in a hierarchy to try to control our bodies so that we’ll look at some things more than others.

Listen to the sound the paper makes and imagine the pressure of the paper on fingers and hands, its textures and smell (maybe you will now pay attention to the resistance of the computer keyboard and to its clicks – very different from paper). And look how very fragile, how tearable the copies are. Electronic copies don’t last for ever either, but unless you hack into sites it’s harder to destroy them (unless they are your own you forget to save them, which we have done all too often).

What’s even less clear from the screen is the size of the pages of each periodical. Who would guess from the website alone how big the British Workman is compared to other periodicals? Seeing the images online just doesn’t generate the haptic shock of the paper.

The British Workman is 41cm tall; the Swan Lane Gazette and the Caterer are 25cm, while the Teachers’ Assistant is just 19cm.

The different sizes of the different periodicals suggests different kinds of use and users: the British Workman front-page print begs to be displayed on the wall; the humble Teachers’ Assistant can be put away into a bag after a quick swot up before a lesson, while the other two are designed for longer periods of perusal – for a reader who wants to find out the latest ways of making a profit by titillating customers’ appetites in new ways. It’s not really possible to understand these distinctions as quickly in digital reproductions as we do when we have a paper copy in our hands.

The kinds of knowledge that paper technology allows are also very different from the electronic. There is a lot of research on this and this is not the place to explore it. We have more material about this here.

One contrasting aspect of paper versus digital platforms will do: searching for information.

Online it seems very easy to find a word – we are so used to telling our digital assistants to “find BLT19” and so on. By contrast, it’s a lot of boring hard work to do a word search in paper texts (very easy online). That said, if we aren’t talking films or recorded sound (music or recording of speaking) but only written words, we seem to prefer (and take in better) long-form narratives in paper format – that applies to those brought up as “digital natives” as much as those of us who were brought up on analogue media. But how do people search in a paper publication? The most obvious thing to do is to use the Index and the Contents list. These really can help searches – but they aren’t word searches as such: they are based on what the indexer – a real person – thinks is important. There are several indexes and contents lists in the periodicals and it’s interesting to reflect on the choices these human beings made compared to what a computer might do with the same material.

With all the above in mind, therefore, it won’t surprise you to know that we have been very careful indeed in how we have digitally reproduced paper periodicals on the site.

We have used four main kinds of image reproduction on this site with the express intention of helping the user to reflect on the advantages and disadvantages of each and above all, to think about what kinds of knowledge – what kinds of conclusions – we tend to draw when the software changes, even if the digitised materials themselves don’t.

1. One of the most spectacular ways we chose to display the periodicals is via the ZOOM & BROWSE INTERFACE Metabotnik, a software developed by researchers at the University of Amsterdam and the publishers Brill, and funded by the Netherlands Organisation for Scientific Research. We got in touch with them early in the process of developing this site (and quite early in their lifetime too). Besides the real pleasures metabotniks can give (we won’t pretend we don’t love playing with the zoom)  we are also concerned to reflect on what can we learn from this kind of reproduction that we can’t from the others? What kind of “searches” can we perform? It is also technically much much less work to generate a metabotnik than the following, and in a few cases we have decided to make a metabotnik available before uploading the rest.

the PDF followed by JPEGs of individual pages of the Meat Trades Journal

2. In reproductions of separate or single numbers we have provided JPEG images of individual pages as well as…

3. PDFs of the whole issue. PDFs connect the pages together and give a different experience from disconnected individual pages. They also suggest ready-made contexts for individual images and texts. We wanted to give both JPEgs of individual pages and PDFs of whole issues – where possible – because they serve different purposes. For a start, it’s easier to copy and paste a single JPEG image of a page into another website, article or story than to do the same from a PDF where you have to copy the page or the segment of text or an image. The quality of the image is much sharper too. Yet what are the implications of reading a single page or a single article isolated from the rest of the periodical? It might be that readers of a single page might get completely the wrong idea (I’m afraid a lot of people who just do word searches in documents without looking at the wider context do just that).

 4. The 4th form we use consists of uncorrected OCR of the PDFs: that is complicated so we have a separate section below on it.

5. There is also a 5th method used just for issue 1 of the British Workman: a kind of “tagging” of sections that many digital projects use. It proved too time-consuming to continue given its likely usefulness: however interesting it was, its cost-benefit ratio was just too ineffective for the kind of low-cost-maximum-return enterprise we are.

OCR

OCR (optical character recognition) is an automated process whereby a computer translates squiggles (letters, numbers, words etc) in one text into versions of those squiggles in another that a computer can recognise when we search for them.

A lot of the time the OCR is “invisible” under the image we see on the screen but – with the lot of luck – a Google search might pick it up. We have given an example of what the OCRd text of a Victorian periodical looks like here in our digitisation of the first issue of the British Workman. It’s immediately obvious how very inaccurate it is. While improvements in the accuracy of OCR are constantly being made (and have certainly improved from the 2009 National Library of Australia’s report), the claims made by commercial companies to 100% accuracy only apply to clean, crisply-printed recent texts: certainly not nineteenth-century newspapers and periodicals!

The OCRd text on some of the issues of the British Workman on the BLT19 site was produced using Abbyy FineReader Pro for Mac software. While FineReader can run automatic scans, it also allows manual scan ordering, which was essential when it came to the British Workman‘s inconsistent layout. Because of the pilot project’s limited timeframe and very limited budgets, the OCR’d text has either been only lightly corrected to minimise inconsistent line breaks or not corrected at all.

Leaving the incorrect text in a visible manner as we have done in some issues shows very clearly how inaccurate OCR generally is when applied to nineteenth-century periodicals, even in 2020. That, of course, has huge implications for what we can find when we search online: we can only tell stories about the world based on what the software can find for us. We do not necessarily need wilful interference by someone to miss or misinterpret important information: it may well be that the software has simply not recognised what we really needed to know and given us wrong answer instead.

That means we need to be very careful with our searches without even thinking about whether the answers and stories we are given is deliberately skewed in some way!


BLT19: AK with huge thanks to AMH

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.