Sunday 9 November 2014

Big Data, Small Data and Meaning

This post was originally written as the text for a talk I gave at a British Library Lab's event in London in early November 2014. In the nature of these things, the rhythms of speech and the verbal ticks of public speaking remain in the prose. It has been lightly edited, but the point remains the same.  

In recent months there has been a lot of talk about big stuff. Between 'Big Data' and calls for a return to ‘Longue durée’ history writing, lots of people seem to be trying to carve out their own small bit of 'big data'. This post represents a reflection on what feels to me to be an important emerging strategy for information interrogation driven by the arrival of 'big data' (a 'macroscope'); and a tentative step beyond that, to ask what is lost by focusing exclusively on the very large. 

And the place I need to start is with the emergence of what feels to me like an increasingly commonplace label – a ‘macroscope’ - for a core aspiration of a lot of people working in the Digital Humanities. 

As far as I can tell, the term ‘macroscope’ was coined in 1969 by Piers Jacob, and used as the title of his science fiction/fantasy work of the same year – in which the ‘macroscope’, a large crystal, able to focus on any location in space-time with profound clarity, is used to produce something like a telescope of infinite resolution. In other words, a way of viewing the world that encompasses both the minuscule, and the massive. The term was also taken up by Joel de Rosnay and deployed as the title of a provocative book on systems analysis first published in 1979. The label has also had a long and undistinguished afterlife as the trademark for a suite of project management tools – a ‘methodology suite’ - supported by the Fujistu Corporation. 

But I think the starting point for interest in the possibility of creating a ‘macroscope’ for the Digital Humanities, comes out of computer science, and the work of Katy Börner from around 2011.
Her designs and advocacy for the development of a ‘Plug and Play Macroscope’, seems to have popularised the idea to a wider group of Digital Humanists and developers. To quote Börner

'Macroscopes provide a "vision of the whole," helping us "synthesize" the related elements and detect patterns, trends, and outliers while granting access to myriad details. Rather than make things larger or smaller, macroscopes let us observe what is at once too great, slow, or complex for the human eye and mind to notice and comprehend.' (Katy Börner, ‘Plug-and-Play Macroscopes’, Communications of the ACM, Vol. 54 No. 3, Pages 60-6910.1145/1897852.1897871)

In other words, for Börner, a macroscope is a visualisation tool that allows a single data point, to be both visualised at scale in the context of a billion other data points, and drilled down to its smallest compass. This was not a vision or project initially developed in the humanities. Instead it was a response to the conundrums of ‘Big Data’ in both STEM academic disciplines, and the wider commercial world of information management. But more recently, a series of ‘macroscope’ projects have begun to emerge from within the humanities, tied to their own intellectual agendas, and subtly recreating the idea with a series of distinct emphases. 

Perhaps the project most heavily promoted recently, is Paper Machines, created by Jo Guldi and Chris Johnson-Robertson – and the MetLab at Harvard. This forms a series of visualisation tools, built to work with Zotero, and ideally allowing the user to both curate a large scale collection of works, and explore its characteristics through space, time and word usage. In other words, it is designed to allow you to build your own Google Books, and explore. There are problems with Paper Machines, and most people I know have struggled to make it work consistently. But it rather nicely builds on the back of functionality made available through Zotero, and effectively illustrates what might be described as a tool for ‘distant reading’ that encompasses elements of a ‘macroscope’. 

What is most interesting about it, however, is the use its creators make of it in seeking to shift a wider humanist discussion from one scale of enquiry to another. Last month, to great fanfare, CUP published Jo Guldi and David Armitage’s History Manifesto, which argues that once armed with a ‘macroscrope’ – Paper Machines in their estimation historians should pursue an analysis of how ‘big data’ might be used to re-negotiate the role of the historian – and the humanities more generally. Basically, what Guldi and Armitage are calling for through both the Manifesto and through Paper Machines, is the re-invention of ‘Longue durée’ history – telling ever larger narratives about grand sweeps of historical change, encompassing millennia of human experience. And to do this in pursuit of taking on the mantle of a public intellectual, able to speak with greater authority to ‘power’. 

In the process they explicitly denigrate notions of ‘micro-history’ as essentially irrelevant. At one and the same time, they seem to me to celebrate the possibility of creating a ‘macroscope’, while abjuring half its purpose. What we see in this particular version of a ‘macroscope’ is a tool that privileges only one setting on the scale between a single data point, and the sum of the largest data set we can encompass. In other words, by seeking the biggest of big stories, it is missing the rest. 

Perhaps the other most eloquent advocate for a ‘macroscope’ at the minute is Scott Weingart. With Shawn Graham and Ian Milligan, he is writing a collective online ‘book’ entitled, Big Digital History: Exploring Big Data through a Historian’s Macroscope. The book is a nice run through of digital humanist tools, but the important text from my perspective is a blog post Weingart published on the 14 September 2014. The post was called: The moral role of DH in a data-driven world; and in it, Weingart advocates a very specific vision of a ‘macroscope’, in which the largest scale of reference and view is made intelligible through the application of a formal version of network analysis. 

Weingart is a convincing advocate for network analysis, performed in light of some serious and sophisticated automated measures of distance and direction. And his work is a long way ahead of much of the naïve and unconvincing use of network visualisations current in large parts of the Digital Humanities. Weingart also makes a powerful case for where a limited number of DH tools – primarily network analysis and topic modelling - could be deployed in re-engaging the ‘humanities’ with a broader social discussion. 

Again, like Guldi and Armitage, Weingart seeks in 'Big Data' a means through which the Humanities can ‘speak to power’. As with the work of Armitage and Guldi, the pressing need to turn Digital Humanities to political account appears to motivate a search for large scale results that can be deployed in competition with the powerful voices of a positivist STEM tradition. My sense is that Weingart, Armitage and Guldi are all essentially scanning the current range of digital tools, and selectively emphasising those that feel familiar from the more ‘Social Science’ end of the Humanities. And that having located a few of them, they are advocating we adopt them in order to secure our place at the table. 

In other words, there is a cultural/political negotiation going on in these developments and projects that is driven by a laudable desire for ‘relevance’, but which effectively moves the Humanities in the direction of a more formal variety of Social Science. 

Others still, are arguably doing some of the same work, but using a different language, or at the least seeking a different kind of audience. Jerome Dobson, for example, has recently begun to describe the use of Geographical Information Systems (GIS) in historical geography, as a form of ‘macroscope’. This usage doesn’t come freighted with the same political claims as are current in Digital Humanities, but seem to me an entirely reasonable way of highlighting some of the global ambitions – and sensitivity to scale - that are inherent in GIS. The notion - perhaps fostered most fully by Google Earth - that you can both see the world in its entirety, as well as zoom in to the smallest detail, seems at one with a data driven ‘macroscope’. But, again, the scale most geographers want to work with is large – patterns derived from billions of data points. And again, the siren call of GIS, tends to pull humanist enquiry towards a specific form of social science. 

And finally, we might also think of the approach exemplified in the work of Ben Schmidt as another example of a ‘macroscope’ approach – particularly his ‘prochronism’ projects. These take individual words in modern cinema and television scripts that purport to represent past events – things like Downton Abbey and Mad Men - and compares them to every word published in the year they are meant to represent. 

Building on Google Books and Google Ngrams, Schmidt is effectively mixing scales of analysis at the extremes of ‘big data’, on the one hand – all words published in a single year – and small data, on the other. Of all the examples mentioned so far, it is only Schmidt who is actually using the functionality of a ‘macroscope’ effectively, making it all the more ironic that he doesn’t adopt the term. 

And almost uniquely in the Digital Humanities – a field equally remarkable for its febrile excitement, and lack of demonstrable results – Schmidt’s results have been starkly revealing. My favourite example, is his analysis of the scripts of Mad Men, which illustrates that early episodes referencing the 1950s, overuse language associated with the ‘performance’ of masculinity – words that reflect ‘behaviour’. And that later episodes, located in the 1970s, overuse words reflecting the internalised emotional experience of masculinity. For me this revealed beautifully the larger narrative arc of the programme in a way that had not been obvious prior to his work. Schmidt has little of the wider agenda to influence policy and politics evident in that of Armitage, Guldi and Weingart, but ironically, it is his work that is having some of the greatest extra-academic impact, via the anxiety it has created in the script writers of the shows he analyses. 

All of which is simply to say that playing with and implementing ideas around a ’macroscope’ is quite popular at the moment. And a direction of travel which, with caveats, I wholly support. But it also leaves me in something of a conundrum. 

Each of these initiatives, with the possible exception of Schmidt’s work, seems to locate themselves somewhere other than the Humanities I am familiar with. And this seems odd. Issues of scale are central to this. Claiming to be doing ‘big history’ sounds exciting; while claiming that more formal ‘network analysis’, will answer the questions of a humanist enquiry, appears to create a bridge between disciplines – allowing Humanists and more data driven parts of the Social Sciences to share a methodology and a conversation. But with the exception of Schmidt’s work, these endeavours seem to be privileging particular types of analysis – Social Science types of analysis – over more traditionally Humanist ones. 

In some ways, this is fine. I have discovered to my own benefit, that working with ‘Big Data’ at scale and sharing methodologies with other disciplines is both hugely productive, and hugely fun. To the extent that ‘big stories’ and new methodologies provide the justification for collaborating with researchers from a variety of disciplines – statisticians, mathematicians and computer scientists – they are wholly positive, and a simple ‘good thing’. 

And yet… I find myself feeling that in the rush to define how we use a ‘macroscope’, we are losing touch with what humanist scholars have traditionally done best. 

I end up is feeling that in the rush to new tools and ‘Big Data’ Humanist scholars are forgetting what they spent much of the second half of the twentieth century discovering – that language and art, cultural construction, human experience, and representation are hugely complex – but can be made to yield remarkable insight through close analysis. In other words, while the Humanities and ‘Big Data’ absolutely need to have a conversation; the subject of that conversation needs to change, and to encompass close reading and small data. 

The Stanford Humanities Centre defines the ‘Humanities’ as: 

'…the study of how people process and document the human experience. Since humans have been able, we have used philosophy, literature, religion, art, music, history and language to understand and record our world.'

Which makes the Humanities sound like the most un-exciting, ill-defined, unsalted, intellectual porridge ever. And yet, when I think about the scholarly works that have shaped my life, there is none of this intellectual cream of wheat. 

Instead, there are a series of brilliant analyses that build from beautifully observed detail at the smallest of scales. I look back to the British Marxist tradition in history – to Raphael Samuel and Edward Thompson – and what I see are closely described lives, built from fragments and details, made emotionally compelling by being woven into ever more precise fabrics of explanation. 

A gesture, a phrase, a word, an aching back, a distinctive tattoo. 'My dearest …. Remember when…' 

The real power of work in this tradition, lay in its ability to deploy emotive and powerful detail in the context of the largest of political and economic stories. And the political project that underpinned it, was not to ‘speak to power’, but to mobilise the powerless, and democratise identity and belonging. With Thompson’s liquid prose, a single poor, long dead framework knitter affected more change than any amount of more formal economic history. 

Or I think of the work of Pierre Bourdieau, Arlette Farge and de Certeau, and the ways in which they again use the tiny fragments of everyday life - the narratives of everyday experience - to build a compelling framework illustrating the currents and sub-structures of power. 

Or I think of Michel Foucault, who was able to turn on its head every phrase and telling line – to let us see patterns in language – discourses – that controlled our thoughts. Foucault profoundly challenged us to escape the limits of the very technologies of communication and analysis we used; and to see in every language act, every phrase and word, something of politics. 

By locating the use of a ‘macroscope’ at the larger scale, seeking the Longue durée, and the ear of policy makers, recent calls for how we choose to deploy the tools of the Digital Humanities appear to deny the most powerful politics of the Humanities. If today we have a public dialogue that gives voice to the traditionally excluded and silenced – women, and minorities of ethnicity, belief and dis/ability – it is in no small part because we now have beautiful histories of small things. In other words, it has been the close and narrow reading of human experience that has done most to give voice to people excluded from ‘power’ by class, gender and race. 

Besides simply reflecting a powerful form of analysis, when I return to those older scholarly projects I also see the yearning for a kind of ‘macroscope’. Each of these writers strive to locate the minuscule in the massive; the smallest gesture in its largest context; to encompass the peculiar and eccentric in the average and statistically significant. 

What I don’t see in modern macroscope projects is a recognition of the power of the particular; or as William Blake would have it: 

To see a World in a grain of sand, 
And a Heaven in a wild flower...
                               Auguries of Innocence (1803, 1863).

Current iterations of the idea of a macroscope, with all their flashy, shock and awe visualisations, probably score over these older technologies of knowing in their sure grasp of data at scale, but in the process they seem to lose the ability to refocus effectively. 

For all the promise of balancing large and small scales, the smaller and particular seem to have been ignored. Ever since the Apollo 17 sent back its pictures of earth as a distant blue marble, our urge towards the all-inclusive, global and universal has been irresistible. I guess my worry is that in the process we are losing the ability to use fine detail in the ways that make the work of Thompson and Bourdieau, Foucault and Samuel, so compelling. 

So, by way of wending towards some kind of inconclusive conclusion. I just want to suggest that if we are to use the tools of 'Big Data' to capture a global image, it needs to be balanced with the view from the other end of the macroscope (along with every point in between). 

In part this is just about having self-confidence as humanist scholars, and ironically serving a specific role in the process of knowing, that people in STEM are frequently not very good at. 

Several recent projects I was privileged to participate in, involved some hugely fun work with mathematicians and information scientists exploring the changing linguistic patterns found in the Old Bailey trials – all 127 million words worth. And after a couple of years of working closely with a bunch of brilliant people, what I gradually realised was that while mathematicians do a lot of ‘close reading’ – of formulae and algorithms - like most scientists, they are less interested than I am in the close reading of a single datum. In STEM cleaning data is a chore. Geneticists don’t read the human genome base by base; and our knowledge of the Higgs Boson is built on a probability only discovered after a million rolls of the dice, with no one really looking too carefully at any single one. 

In many respects ‘big data’ actually reinforces this tendency, as the assumption is that the ‘signal’ will come through, despite the noise created by outliers and weirdness. In other words, ‘Big Data’ supposedly lets you get away with dirty data.  In contrast, humanists do read the data; and do so with a sharp eye for its individual rhythms and peculiarities – its weirdness. 

In the rush towards 'Big Data' – the Longue durée, and automated network analysis; towards a vision of Humanist scholarship in which Bayesian probability is as significant as biblical allusion, the most urgent need seems to me to be to find the tools that allow us to do the job of close reading of all the small data that goes to make the bigger variety. This is not a call to return to some mythical golden age of the lone scholar in the dusty archive – going gradually blind in pursuit of the banal. This is not about ignoring the digital; but a call to remember the importance of the digital tools that allow us to think small; at the same time as we are generating tools to imagine big. 

In relation to text, you would think this is easy enough. Easy enough to, like Ben Schmidt, test each word against its chronological bed-fellows; or measure its distance from an average for its genre. When I am reading a freighted phrase from the 1770s, like ‘pursuit of happiness’, I want to know that till then, ‘happiness’ was almost exclusively used in a religious context – ‘Eternal Happiness’ - and that its use in a secular setting would have caught in a reader’s mind as odd and different - new. We should be able to mark the moment when Thomas Jefferson allowed a single word to escape from one ‘discourse’ and enter another – to read that word in all its individual complexity, while seeing it both close and far. 

I know of no work designed to define the content of a ‘discourse’, and map it back in to inherited texts. I know of no projects designed with this notion in mind. And if you want a take home a message from this post, it is a simple call for ‘radical contextualisation’. 

 To do justice to the aspirations of a macroscope, and to use it to perform the Humanities effectively – and politically – we need to be able to contextualise every single word in a representation of every word, ever. Every gesture contextualised in the collective record all gestures; and every brushstroke, in the collective knowledge of every painting. 

Where is the tool and data set that lets you see how a single stroll along a boulevard, compares to all the other weary footsteps? And compares it in turn to all the text created along that path, or connected to that foot through nerve and brain and consciousness. Where is the tool and project that contextualises our experience of each point on the map, every brush stroke, and museum object? 

This is not just about doing the same old thing – of trying to outdo Thompson as a stylist, or Foucault for sheer cultural shock. My favourite tiny fragment of meaning – the kind of thing I want to find a context for - comes out of Linguistics. It is new to me, and seems a powerful thing: Voice Onset Timing – that breathy gap between when you open your mouth to speak, and when the first sibilant emerges. This apparently changes depending on who are speaking to – a figure of authority, a friend, a lover. It is as if the gestures of everyday life can also be seen as encoded in the lightest breathe. Different VOTs mark racial and gender interactions, insider talk, and public talk.

In other words, in just a couple of milliseconds of empty space there is a new form of close reading that demands radical contextualisation (I am grateful to Norma Mendoza-Denton for introducing me to VOT). And the same kind of thing could be extended to almost anything. The mark left by a chisel is certainly, by definition, unique, but it is also freighted with information about the tool that made it, the wood and the forest from which it emerged; the stroke, the weather on the day, and the craftsman. 

One of the great ironies of the moment is that in the rush to big data – in the rush to encompass the largest scale, we are excluding 99% of the data that is there. And if we are going to build a few macroscopes, I just want to suggest that, along with the blue marble views, we keep hold of the smallest details. And if we do so, looking ever more closely at the data itself – remembering that close reading can be hugely powerful - Humanists will have something to bring to the table, something they do better than any other discipline. They can provide a world of ‘small data’ and more importantly, of meaning, to balance out the global and the universal – to provide counterpoint in the particular, to the ever more banal world of the average.