This post has three parts:
1) I map topics about Stalin to illustrate how this approach can be used to visualise topic models
2) I go through a function to shape data for use in d3 illustrations
3) I end with variations on how to show complexity in topic models
For details of what topic models are, read Ted Underwood blog posts here , and Matthew Jockers' macroanalysis. I wrote a little bit about it elsewhere, so I will get straight to the jugular:
Topic models are discussed really well elsewhere, and rather superficially by me here. In my topic model for the Russian media over the period of 2003-2013 I found seven or eight topics about history and memory. One of them was clearly about Katyn and about Stalinist repression.
Following on from my guide to making R play nice with utf-8, here is a seven-step guide to understanding Python's handling of unicode. Trust me, if you work with non-latin characters, you need to know this stuff:
So familiar…Dealing w/ R's habit of choking on not-even-medium data. MT @RolfFredheim: Shutting up R: http://t.co/XKr0gIGoNz via @dmimno
— Andrew Goldstone (@goldstoneandrew) October 30, 2013
Not even medium sized? But... but... my archive is really big! I am working on more than a million texts! Of course, he is right - and it occurs to me that medium sized data such as mine is in its own way quite tricky to handle: small enough to be archived on a laptop, too big to fit into memory. An archive of this size creates the illusion of being both Big and easily manageable - when in reality it is neither.
This post, the first in a series of three, explains why I decided to use a database of texts. The second post will explore how to archive and retrieve data from a SQL database, while the third will introduce how to use indexes to keep textual data at arms length and facilitate quick information retrieval.
This is a graph mapping grammatical similarities between 4000 random Russian news articles; links in gold occurred at election time, while dark red are all other articles. It seems to form a single long chain of connectivity that makes no sense, apart from on a grammatical level, and even there the links are pretty spurious.
Not that I made the front-page. I'll tell myself it was only a beauty-contest, anyway!
Spotting international conflict is very easy with the GDELT data set, combined with ggplot and R. The simple gif above shows snapshots of Russian/Soviet activity from January 1980 and January 2000. I think it also illustrates how Russia nowadays looks more to the east and the South than during the Cold War. The trend, though not very strong above, gets even clearer by the end of the 2000s.
I wanted to go one step further than the gif above, so I made an animation of all the events in the GDELT dataset featuring Russia. That's 3.3 million entries, each mapped 12 times (for blur).
In this post I show how to select relevant bits of the GDELT data in R and present some introductory ideas about how to visualise it as a network map. I've included all the code used to generate the illustrations. Because of this, if you here for the shiny visualisations, you'll have to scroll way down
The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.!doctype>
Below I briefly outline why Pandoc is an essential part of my research workflow, and demonstrate how to seamlessly integrate it with a bibliographic system and code written in R to produce high quality word or pdf documents. I also include all the functions needed to get this working fast.!doctype>
Wordclouds such as Wordle are pretty rubbish, so I thought I'd try to make a better one, one that actually produces (statistically) meaningful results. I was so happy with the outcome I decided to make it interactive, so go on, have a play!
Compare any two
In this post I outline how count data may be modelled using a negative binomial distribution in order to more accurately present trends in time series count data than using linear methods. I also show how to use ANOVA to identify the point at which one model gains explanatory power, and how confidence intervals may be calculated and plotted around the predicted values. The resulting illustration gives a robust visualisation of how the Beslan Hostage crisis has taken on features of a memory event