Tuesday, October 4, 2011

Interactive word frequency cloud

Following the data visualisation unit, I was lucky enough to have the opportunity to work over summer as a research assistant for Andrew MacKenzie to develop a tool to explore survey responses from residents, architects and builders who had rebuilt in Duffy after the 2003 Canberra bushfires. The word cloud was built with supervision from Mitchell Whitelaw and is based on code he developed for the A1 Explorer.

Word frequency cloud (architects only, responses to all questions)  with substantial control panel  for filtering at right
Word frequency cloud with correlations to 'wanted' highlighted and all occurrences of 'wanted'  listed on right
The data can be filtered by response to particular questions, the category of respondent (resident who rebuilt, new resident, architect, builder etc) and individual respondent - so it is possible to see a cloud of everything or  any subgroup of responses or an individual response. A list of standard 'stop' words  and any words with less than 3 characters have been removed. Further words can be added to an exclusion list, by clicking, which is helpful to look beyond boring words or extremely frequent words that can obscure differentiation between less frequent words.

All of these filtering options end up in a large control panel, which took a bit of juggling to fit on screen. It may have been neater to hide it in drop down  or pop up menus. However I think it was important to highlight the current view position within in the entire data set.

Mousing over a word highlights corresponding words that occur in proximity and brings up a scrollable list of all occurrences of the highlighted word in fragmentary context of the five words pre and post it.

An appropriate way to understand and navigate data?

So this is another example of a show everything and zoom in visualisation. However the reason I posted it is primarily to make a brief observation about the appropriateness of visualisation techniques to understand/navigate data. A distinction between understanding and navigation is perhaps important.

In the case of Mitchell Whitelaw's A1 Explorer the word cloud visualises item titles in the National Archives A1 Series. Titles generally are specific and succinct, and considered. The A1 Explorer is a visualisation that reveals some of the topics and relationships in the series, but it is also an interface to the digitised items themselves.

Similarly a word cloud of a carefully crafted speech, such as Obama's inauguration speech, reveals succinctly some of the themes. It is probable that some speeches are written with word cloud analysis in mind. Political rhetoric noticeably employs frequently repeated, memorable, mantras. Of course, as Jodi Dean writes, a word cloud is in many ways a very superficial analysis that ignores sentences, stories and narratives.

A different example, designed specifically for visualisation as a word cloud, was curated by the ABC who to mark Julia Gillard's first year as Prime Minister called for the public to submit 3 words that characterise their perceptions of Gillard and also of opposition leader Tony Abbot. Not surprisingly the most frequently submitted words aligned closely with the rhetoric that had been most prominent in the media.

Even if visualising words by themselves are appropriate, a critical challenge for word clouds and like visualisation techniques is to be able to locate the small, hidden, items, because they are perhaps the most interesting or important. It might be that quantitative data analysis can only ever take us so far, and that curation is necessary to go beyond? However when it comes to big data, quantitative might be our only way  in - a starting point for exploration.

Andrew MacKenzie has said that the word clouds were very helpful as a research tool and their revelations support his observations during and other analysis subsequent to the interviews. My feeling is that there was substantial noise because of the nature of the raw survey data. The responses were not carefully crafted like an Obama speech or considered even like a title or a 3 word perception of Gillard - they were spontaneous and people thought as they spoke. The word cloud doesn't distinguish initial response from more considered closing summary remark. It doesn't take account of rambles, tangents or emphasis placed on particular ideas. That said the quantitative analysis also ignores any bias the researcher might have had in looking for particular ideas.

No comments:

Post a Comment