Friday 3 January 2014

Judging a book by its URLs

It will sound odd, but I have recently had a great time editing URLs.  Robert Shoemaker and I have have just finished a book for CUP, derived from the London Lives project, and called - London Lives: Poverty, Crime and the Making of a Modern City, 1690-1800. It is a long book (170,000 words) and each quote and reference in it is linked via a URL to the original document or article, book or web-resource used as evidence or to contextualize the argument.  It will be published as both an ebook and in hard copy, and the links need to be robust, and secure.  My estimate is that there are in the region of 4,000 URLs included in the manuscript (which was written collaboratively in PMWiki).  In the end, I found that I could identify an appropriate link for 98% of all footnote references, but then had to eliminate around 10% of these, as the relevant URL was just not useable.  The book took some nine years, and I am glad it is finished.

One of my final jobs was editing those 4000 URLs.   It took about three months work, spread over the last year, and I have just finished spending a week or so confirming what I hope will be their final form.  When I have told people about this work many have looked incredulous and suggested that this is the sort of technical implementation process that should be left to others.  A couple of otherwise nice people have suggested I dump this job on the shoulders of the nearest PhD student.  But for myself, it is precisely the kind of thing that an author should do for themselves.  And in doing it, two things kept coming to mind.  First was how the role of the scholar in creating a rigorous academic apparatus is a central part of the intellectual journey that academic writing involves - and that we should see the implementation of the online version of this in the light of the precise writing of footnotes and references that mark out good scholarship.  And second, that URLs encode a system of design and intent, online architecture and system of access, that signal the quality and permanence (the academic credibility and perceived audience) of historical materials online.  And that just as we have always sorted and judged scholarship by its form, we should think a bit harder about how the form of a URL can let us interrogate online materials.

On the first point, I do not know of much discussion of the joys of this kind of academic slog.  There is a lot of good writing on research and archives (by Carolyn Steedman and Arlette Farge among many others), on writing and thinking, but no-one talks much about the painstaking labour that goes in to turning a rough draft in to a final finished piece of scholarship.  And here I am really talking about generating accurate and fully comprehensive footnotes that reflect both the material cited, and the research journey that resulted in the main text.   This has become much easier with online catalogs and citation management packages, but nevertheless remains laborious and a reflection of our collective and individual commitment to a particular kind of evidenced discussion.  But for me it also represents my favourite compromise.  The writing of history is a wonderfully imaginative and creative process.  And in some respects we wish to judge the product of history writing as art.  Is it enjoyable to read? Is it convincing?  Does it do the job of good writing in liberating the readers' imagination?  In making these judgements we tend to appeal to a notion of 'value' that is cultural and that privileges dominant forms of authority.  This aspect of judgement is essentially romantic; with all the implications for western and elite hegemony embedded in that idea.  At the same time history writing is the result of simple hard work of a more technical kind - in the archives, in collating and collecting, re-ordering and interrogating data.  And it is valuable because it encompasses that hard work.  The beauty of the academic apparatus is that it evidences this and in the process generates a different measure of value.  In other words it is where quality is tied to a 'labour theory of value'.  I love the academic slog because it is where un-moored judgement is tied down to hard labour; and where value can be universalized in a common human experience (work).  In other words I really enjoyed editing 4000 URLs precisely because in them and their associated footnotes lies a claim to and evidence of the hard labour that underpins the book itself.

 At the same time, the process also taught me to read URLs differently.  Clearly coders and web designers do this as a matter of course.  But I am a historian and want to read URLs as a scholar, rather than as a programmer or designer.  And for me, the important thing is that URLs embed the structure of a site, making it plain to see for anyone willing to look hard; and that they are made up of both the character of a library reference, and a command directed at the new technology of discovery - the Internet .  There are just lots of different types of URL.

There are 'Search URLs' that include all the elements that  take the user past a collection to a specific object, but don't let you go directly there without the query.  And there are URLs that encode a cataloging hierarchy.  There are URLs that sift data, or work in your browser to change the data delivered, highlighting phrases or sifting material.  And there are URLs that encode licensing, passwords, and access information.  It is easy enough to find that the whole search journey that took you from a library catalog to an individual item is encoded directly in the URL, and even personalized to you, the machine you are using, or the forms of access you can deploy.  It is easy to find URLs that run on for hundreds of characters, each element divided by a '&' or a '%', or such.

But in creating robust reproducible links to credible historical materials most of these URLs are at least problematic if not useless.  If they include details for institutional access, or session information, they cannot be re-used by someone else.  These URLs are friable and fragile things and not fit for scholarly purposes.  And as a result, for the London Lives book we have been forced to eliminate all the links we originally hoped to include to forty or fifty different sites.  To take a single example, most archives structure their online collections with search in mind, making it difficult to link to a single item.  I spent a lot of time finding the catalog entry for every manuscript we cited in the London Metropolitan Archives, and Westminster Archives Centre, only to regretfully strip out the links when confronted by a complex URL that just did not look credible as a long term citation of the item itself.

Even in its simplest, and in the form recommended by the site for sharing a link, a London Metropolitan Archives URL looks like this:

http://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+P69~2FBRI~2FB~2F001~2FMS06554~2F004?SESSIONSEARCH


Since we had consulted these items in their physical form in any case, it did not seem too problematic to leave out these links, but a shame nevertheless.  And likewise, with paywall material there seemed little point in dangling real access, and the promise of credible evidence, before the eyes of readers who would not be able to go beyond the login screen.  It seemed better to cite a specific item in combination with a general (unlinked) URL and date of consultation as reflecting our own research journey, rather than to promise access when we could not deliver it.

With few exceptions the URLs that have been retained (and there are still 4000 of them) address specific items with a specific ID, and usually run to 20 to 40 characters.  DOIs are not bad once you figure out their structure and reformulate them as they should be, rather than the way they are normally cited on journal web pages.

dx.doi.org/10.1353/sec.2010.0268

And Google Books creates a very nice URL once you strip out all the complex formatting instructions that are normally generated as part of a search and inserted after the main ID.  This is what a Google Books' URL looks like if you were to use the 'search' version:

 http://books.google.co.uk/books?id=1sMJGt7_rTAC&printsec=frontcover&dq=%22Prosecution+and+Punishment:+Petty+Crime+and+the+Law%22&hl=en&sa=X&ei=rrzGUq_aDsSy7Aa_9YGQCg&redir_esc=y#v=onepage&q=%22Prosecution%20and%20Punishment%3A%20Petty%20Crime%20and%20the%20Law%22&f=false

And this URL will take to the same book:

 books.google.co.uk/books?id=1sMJGt7_rTAC

 And the Eighteenth-century Short Title Catalog generates some of the most elegant URLs I have found:

estc.bl.uk/T174945

And to a lesser extent, so does the Ethos collection of doctoral theses at the British Library.

ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.354762

And London Lives and the Old Bailey Online do pretty well on this score:

www.londonlives.org/browse.jsp?div=LMSMPS501980014
http://www.oldbaileyonline.org/browse.jsp?ref=t17910413-19


In part, I suspect that these issues would all disappear if I had a better sense of the layer of structure that lies beneath the WWW.  But for the moment I am keen to have a short, human-readable URL that looks like it will last longer than the session I am currently logged on for.   All of which simply takes me back to the joy of academic slogging and the importance of the academic apparatus as something that evidences hard work and opens up scholarship to credible criticism that goes beyond simple romantic appreciation and prejudice.

I know all too well that one of the skills of an academic is the ability to judge a book by its cover and the form of the text it contains.   For the online we need to embed URLs into precisely this process - and the joy of all that editing was that at the end of it, I feel I have learned to do just that.