Sunday, 19 June 2011

Culturomics, Big Data, Code Breakers and the Casaubon Delusion

Suddenly it seems as if 'big data' humanities is all the crack; with quantitative biologists and mathematicians diving in where previously only historians, literary critics and linguists dared to swim.  Digital humanists have been slowly engineering a new field from history and linguistics (aided and abetted by library science) for over a decade, gradually building new bodies of evidence, and road testing new methodologies.  But in just the last year or so, the biologists and mathematicians, with Google's help, have stolen a march on all their puny efforts.  In particular, it seems that Science and Nature have fallen head over heels in love with 'culturomics' and the heady enthusiasms of Erez Lieberman Aiden and Jean-Baptiste Michel, and their Google ngram viewer.  To read the most recent issue of  Nature is to be confronted with a heady mix of big science and gushing Hello Magazine prose, that work to mythologise the new 'science' of  culturomics and its creators.  It feels like the birth of a myth and of a brand.

This is all rather wonderful, and I am a huge fan of the Google ngram viewer, and the playful way it allows scholars and students to engage with the 'infinite archive' of inherited texts.  I think Aiden and Michel (and Google) have done the humanities a huge service.   But their real achievements do not quite explain the cloud of hyperbole that seems to be rising around them.

And this made me wonder what is really at issue here?  What is it about culturomics that turns on the reporters from  Nature.  At its heart, the use of word frequency with a reasonably sized (if problematic) data set simply provides one more form of evidence to be added to all the rest.  Knowing that the term 'electricity' peaks between 1870 and 1900 is useful evidence, but does not provide either an explanation for why, or a description of how it is being used.   Historians will no doubt look this particular gift horse in the mouth, and worry at the condition of its teeth; but they will also happily use the ngram viewer as one more component in a complex landscape of evidence.  This use may be delayed by the peculiar lack of any guidance on how to cite the results of a search, but it will be normalised in due course.

But simply providing a new body of evidence is not what seems to get Nature going.  Instead, it is the claim that the ngram viewer lays the basis for a new 'science', and that the results make other forms of historical analysis redundant.  In the words of Aiden and Michel, somehow this data is uniquely available for 'scientific purposes',  in contrast of other forms of evidence. 

It is not, therefore, the mechanics of the ngram viewer that is at issue.  Instead it is the underlying intellectual paradigm that Aiden and Michel bring to its use.  They appear to claim to be able to read history from the patterns the ngram viewer exposes - to decipher significant patterns from the data itself.  Their great party tricks (and they are particularly impressive in live performance) include the analysis of the decline of irregular verbs to a describable mathematical pattern, an equation, and the rise of 'celebrity' as measured by the number of times an individual is mentioned in print.  These imply that all historical development can, like irregular verbs, be described in mathematical terms, and that 'human nature', like the desire for fame, can be used as a constant to measure the changing technologies of culture. 

In some respects, we have been here before.  In the demographic and cliometric history so popular through the 1970s and 80s, extensive data sets were used to explore past societies and human behaviour.  The aspirations of that generation of historians were just as ambitious as are those of the parents of culturomics.  But, demography and cliometrics started from a detailed model of how societies work, and sought to test that model against the evidence; revising it in light of each new sample and equation.

The difference with culturomics is that there is no pretence to a model.  Instead, its practitioners will simply seek to discover patterns in the entrails of human speech, hoping to find the inherent meanings encoded there.  What I think the scientific community finds so compelling is that like quantitative biology and DNA analysis, Aiden and Michel are using one of the controlling metaphors of 20th-century science, 'code breaking' and applying it to a field that has hitherto resisted the siren call of analytical positivism.  

Since the 1940s the notion that 'codes' can be cracked to reveal a new understanding of 'nature' has formed the main narrative of science.  With the re-description of DNA as just one more code in the 1950s, wartime computer science became a peacetime biological frontier (cashing in on big-pharma, as military expenditure declined).  That Aiden comes from a background in DNA analysis should clue us to the fact that culturomics is an attempt to apply the same kind of code breaking to human society as a whole.

I strongly suspect that the project will fail, just as naive readings of DNA as a code for life have largely failed to fulfil their promise. But much more importantly, this attempt to repurpose a 'scientific' approach to historical analysis simply miss-understands the function of history itself.  These large-scale visualisations of language may be the raw material of history, the basis for an argument, the foundation for a narrative, the evidence put in the appendix in support of a subtle point, but they do not serve as a work of history. 

Historians interpret the past to the present.  They marshal evidence and use all the tools of genre writing to allow a modern reader to engage with the past.  And the questions they ask are not driven by the evidence, but by the needs of a modern society.  Gender history, the history of sexuality, and of race, have been created by two generations of historians not because the archives are groaning under the weight of relevant evidence, but because our society needs to understand the role of these forces in the present.  The fundamental flaw with culturomics is that it assumes that history is about the past; that what historians seek to achieve is an ever more accurate description of everything.  Instead, it is about the present.  Ironically, Aiden and Michel have rediscovered the 'Casaubon delusion'; and believe, like George Eliot's tragic figure, that they can create a new 'Key to all Mythologies'.   They need to listen to the Dorotheas of this world.