Mapping the GDELT data in R (and some Russian protests, too)


In this post I show how to select relevant bits of the GDELT data in R and present some introductory ideas about how to visualise it as a network map. I've included all the code used to generate the illustrations. Because of this, if you here for the shiny visualisations, you'll have to scroll way down

The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.

Meanwhile I headed over to David Masad's writeup on getting started with GDELT in python

The data comes bundled with a few python and R scripts, which I quickly set about shaping to my needs. Although hampered by my lack of python knowledge I got the extract script running easily enough, and soon was pulling out all events associated with Russia. It was still a bit unclear what these 'events' were.

According to David Masad, there are four types of event: 1. Material Cooperation 2. Verbal Cooperation 3. Verbal Conflict 4. Material Conflict

I was still not sure what these exactly represented, but decided to press on and look at the data more closely. The events recorded are very obviously are skewed towards the present, so any time-series conclusions should be taken with a pinch of salt, or at the very least adjusted somehow.

I was a bit worried about how R would handle a file that size, so I tried reading in a few lines at a time. The first line looked like this.

[1] "19790101\tATH\tRUS\t120\t3\t-4.0\t52.1256\t140.381\t52.1256\t140.381\t52.1256\t140.381"

Clearly the data is tab delimited. The first column is the date, the second and third actor codes, then three columns I later discovered were 'event code, quad category, and Goldstein Scale'. The last six columns are longitude and latitude, representing the location of the actors and the event.

By the way, for anyone preferring to do this in Python, which frankly makes a lot of sense, here is a great tutorial for finding exactly the data you need.

Now for loading the data in. As it turned out I had enough memory to load the data, but would run out quite quickly if I kept it all in its raw shape. This was when I realised just how massive this data set is: more than 3 million events featuring Russia. Wow.

First I renamed the columns, fixed the dates, and saved the file in R's native Rdata format (this got the data down to a positively miniscule 37mb)

library(lubridate)
t <- read.table("GDELTproject/data/select.outfile.txt", sep = "\t")
colnames(t) <- c("Day", "Actor1Code", "Actor2Code", "EventCode", "QuadCategory", 
    "GoldsteinScale", "Actor1Geo_Lat", "Actor1Geo_Long", "Actor2Geo_Lat", "Actor2Geo_Long", 
    "ActionGeo_Lat", "ActionGeo_Long")
t$Day <- ymd(t$Day)
save(t, file = "gdeltRus.Rdata")

So now for some exploratory analysis - who is involved, and when did these events take place?

print(head(t))
         Day Actor1Code Actor2Code EventCode QuadCategory GoldsteinScale
1 1979-01-01        ATH        RUS       120            3           -4.0
2 1979-01-01        MOS        RUS        40            2            1.0
3 1979-01-01        MOS        RUS        43            2            2.8
  Actor1Geo_Lat Actor1Geo_Long Actor2Geo_Lat Actor2Geo_Long ActionGeo_Lat
1         52.13         140.38         52.13         140.38         52.13
2         55.75          37.62         55.75          37.62         55.75
3         52.13         140.38         52.13         140.38         52.13
  ActionGeo_Long
1         140.38
2          37.62
3         140.38

tail(sort(table(t$Actor1Code)), 20)

    COP     CHN     UKR  RUSMED  RUSMIL     MIL     MED     USA     GOV 
  31773   35332   35857   35904   36555   40957   50677   89275  128319 
 RUSGOV     RUS 
 283648 1422931 

The list of actors looks promising: Russia, Russian government, a generic 'GOV' (what's that about?), USA, MED and MIL which I imagine are medical and military respectively, followed by the Russian military etc etc.

The actor2 columns looked quite similar, but featured some intriguing entries, such as IGOWSTNAT. Ideas, anyone? [Turns out this IGO refers to international or regional inter-governmental organisations ]

As for date distribution, it is reasonably sparse until the mid 90s, since which there has been a general rise, with sharp increases in events during the crises of the late 90s and late 2000s.

library(lubridate)
library(ggplot2)
x <- data.frame(table(year(t$Day)))
ggplot(x, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity") + theme_bw() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now, what about plotting some on a map? For this initial plot i ignored actor and date data, and just used the final two columns of event location, grouping them by count. This left 30000 distinct geo locations in which events with Russian involvement have been recorded.

On to the plotting:

library(RgoogleMaps)
library(ggmap)
library(mapproj)
library(plyr)

# Reshape the data so we get the counts of events for each location
t$count <- 1
df <- ddply(t, .(ActionGeo_Lat, ActionGeo_Long), summarize, count = sum(count))

# lets define the scope of our map
lat <- c(40, 70)
lon <- c(20, 100)
center = c(lat = mean(lat), lon = mean(lon))
zoom <- min(MaxZoom(range(lat), range(lon)))

# And download a map file!
map <- get_map(location = c(lon = mean(lon), lat = mean(lat)), zoom = 3, maptype = "terrain", 
    source = "google")
print(ggmap(map) + geom_point(data = df, aes(x = ActionGeo_Long, y = ActionGeo_Lat, 
    size = count), alpha = 0.3))

Clearly there is a lot going on here. Maybe focusing on one type of activity would be more productive. This is where it really gets interesting. The data set uses the CAMEO coding scheme , which includes quite detailed information about event type. Let's isolate examples of civil disobedience: protests, rallies, hunger strikes, strikes, boycotts, obstructions, violent protest, etc.

This reduces the data to 25 000 events. Still a lot.

But the CAMEO specifications include information about actors, e.g. the NGOs involved. Hence the mysterious REB, SPY, JUD, etc.

By keeping only entries where at least one agent is a representative of so-called civil society, we are left with 6443 entries.

t$count <- 1
# selecting the events we are interested in
t2 <- t[t$EventCode > 139 & t$EventCode < 146, ]

# Keep only rows with agents that interest us. Combined these
t3a <- t2[grep("OPP|EDU|REL|CVL|ELI|LAB|IGO|NGO|NGM", t2$Actor1Code), ]
t3b <- t2[grep("OPP|EDU|REL|CVL|ELI|LAB|IGO|NGO|NGM", t2$Actor2Code), ]
t3 <- rbind(t3a, t3b)
# agents: OPP, EDU, REL, CVL, ELI, LAB, IGO, NGO, NGM

df2 <- ddply(t3, .(ActionGeo_Lat, ActionGeo_Long), summarize, count = sum(count))
map <- get_map(location = c(lon = mean(lon), lat = mean(lat)), zoom = 3, maptype = "terrain", 
    source = "google", color = "bw")
print(ggmap(map) + geom_point(data = df2, aes(x = ActionGeo_Long, y = ActionGeo_Lat, 
    size = count), alpha = 0.8, pch = 21, colour = "black", fill = "red2") + 
    scale_size(range = c(2, 7)))

Let's zoom in on Moscow during the recent Russian protests and see what detail we can see.

t3$recent <- 0
t3$Day <- as.Date(t3$Day)
t3$recent[t3$Day > ("2011-07-01")] <- 1
tryCatch(detach("package:Hmisc"), error = function(e) NULL)

df2 <- ddply(t3, .(ActionGeo_Lat, ActionGeo_Long, recent), summarize, count = sum(count))
mos <- get_map(location = "moscow", zoom = 10, maptype = "terrain", source = "google")
print(ggmap(mos) + geom_point(data = df2, aes(x = ActionGeo_Long, y = ActionGeo_Lat, 
    size = count, fill = recent), alpha = 0.8, pch = 21, colour = "black") + 
    scale_size(range = c(4, 20)) + facet_wrap(~recent))

On the left is all civil society protest activity from 1979 until July 2011, and on the right in the following 12 months. I've not analysed the plot in great detail, but to the naked eye it looks like there were half the number of (reports of) protests and acts of civil disobedience in Moscow in 2011-12 as for the preceding 30 years, which you may or may not find surprising.


On the subject of protest movements, Alex Hanna has already looked at what GDELT has to say about the Arab spring.

Back to Russia: a more interesting plot can be achieved by drawing lines between the geo location of actors. For the plot below I have only included events between July 2011 and mid 2012, when the dataset ends.

(most of the code for the visualisation is plundered from here)

#Limit the data to the last year recorded
t3$recent <- 0
 t3$Day <- as.Date(t3$Day)
t3$recent[t3$Day>("2011-07-01")] <- 1
t3 <- t3[t3$recent==1,]

#to avoid conflicts between plyr and Hmisc. If anyone knows a better way of doing this, please let me know!
tryCatch(detach("package:Hmisc"),error=function(e) NULL)


df2 <- ddply(t3,.(Actor1Geo_Lat,Actor1Geo_Long,Actor2Geo_Lat,Actor2Geo_Long,ActionGeo_Lat,ActionGeo_Long),summarize, count=sum(count))
df2 <- df2[complete.cases(df2),]


#remove links with America and southern hemisphere
df2 <- df2[df2$Actor1Geo_Lat>0&df2$Actor2Geo_Lat>0&df2$Actor1Geo_Long>0&df2$Actor2Geo_Long>0,]
#remove Generic Russia
df2 <- df2[df2$Actor2Geo_Lat!=60&df2$Actor2Geo_Long!=100,]
df2 <- df2[df2$Actor1Geo_Lat!=60&df2$Actor1Geo_Long!=100,]


#place points and edges in separate data frames
pointsDf <- df2[,5:7]
colnames(pointsDf)[3] <- "count2"
edgesDf <- df2[,c(1:4,7)]

a <- paste0(edgesDf[,1],edgesDf[,2])#remove points were start and end are the same
b <- paste0(edgesDf[,3],edgesDf[,4])
edgesDf <- edgesDf[!a==b,]


library(sna)
library(Hmisc)

edgeMaker <- function(whichRow, len = 1, curved = TRUE){
  fromC <- c(edgesDf[whichRow,2],edgesDf[whichRow,1])  # Origin
  toC <- c(edgesDf[whichRow,4],edgesDf[whichRow,3]) # Terminus
  weight <- edgesDf[whichRow, 5]  # Terminus

  # Add curve:
  graphCenter <- c(50,50)#colMeans(edgesDf[,1:2])  # Center of the overall graph
  bezierMid <- c(fromC[1], toC[2])  # A midpoint, for bended edges
  distance1 <- sum((graphCenter - bezierMid)^2)
  if(distance1 < sum((graphCenter - c(toC[1], fromC[2]))^2)){
    bezierMid <- c(toC[1], fromC[2])
    }  # To select the best Bezier midpoint
  bezierMid <- (fromC + toC + bezierMid) / 3  # Moderate the Bezier midpoint
  if(curved == FALSE){bezierMid <- (fromC + toC) / 2}  # Remove the curve

  edge <- data.frame(bezier(c(fromC[1], bezierMid[1], toC[1]),  # Generate
                            c(fromC[2], bezierMid[2], toC[2]),  # X & y
                            evaluation = len))  # Bezier path coordinates
  edge$Sequence <- 1:len  # For size and colour weighting in plot
  edge$weight <- weight
  edge$Group <- whichRow
  return(edge)
}
allEdges <- lapply(1:nrow(edgesDf), edgeMaker, len = 100, curved = TRUE)
allEdges <- do.call(rbind, allEdges)  # a fine-grained path ^, with bend ^

#     col <- rep(gray(8/10),length(labels))
#     col2 <- rep(gray(1/10),length(labels))
#     col[labels==term] <- "red3"
#     col2[labels==term] <- "red4"

new_theme_empty <- theme_bw()
new_theme_empty$line <- element_blank()
new_theme_empty$rect <- element_blank()
new_theme_empty$strip.text <- element_blank()
new_theme_empty$axis.text <- element_blank()
new_theme_empty$axis.title <- element_blank()
new_theme_empty$legend <- element_blank()
new_theme_empty$plot.margin <- structure(c(0, 0, 0, -1), unit = "lines",
                                         valid.unit = 3L, class = "unit")

map <-get_map(location=c(lon=mean(lon), lat=mean(lat)),zoom=3,maptype="terrain",color="bw")

p <- ggmap(map)
#p <- ggplot(allEdges,environment=environment())  
p <- p + geom_path(data=allEdges, aes(x = x, y = y,group = Group,  # Edges with gradient
                         size=(weight-1),colour=Sequence),alpha=.6)  # and taper
p <- p + geom_point(data = data.frame(pointsDf),  # Add points
                        aes(x = ActionGeo_Long, y = ActionGeo_Lat,size=log(count2)), pch = 21,
                        colour = "black", fill = "red2")  # Customize gradient 
p <- p + scale_colour_gradient(low = "red3", high = "white", guide = "none")
p <- p + scale_size(range = c(.6, 5), guide = "none")  # Customize taper
p <- p + new_theme_empty  # Clean up plot
p <- p + xlim(min(allEdges$x-1),max(allEdges$x+1))
p <- p + ylim(20,65)
print(p)
ggsave("latestImage.png", p, h = 4, w = 8, type = "cairo-png")

In the plot below the circles represent the number of events occurring in a given location, while the shaded lines represent events involving two actors in different locations. The red end of the edge is the origin, the white the destination. I removed all links to the USA or the southern hemisphere, as these obstruct the map. I was interested to note how few events link Russia and the former East European Satellite states (Poland, Hungary, etc.), while noting that there seems to be an extraordinary amount of activity linking Russia and Israel, and possibly also Syria. Finally, it seems that events taking place in the Russian regions, especially the Caucasus, very rarely elicit an international response:


8 comments:

  1. Great work! Excuse me for the noobish question, but how do you extract just the Russian events from the R scripts provided? I'm still fairly new to all this coding business, any advice would be helpful.

    ReplyDelete
  2. When you download the data a series of python scripts are provided. I'm a bit of a noob in python myself, but what I did was open the file GDELT.select.py, and hardcoded it to give me what I wanted. To do this I blanked out lines 88-138 (these are for the command line functionality) and added the lines

    srccode = "RUS"
    tarcode = "ALL"

    Then I just ran the script through Idle.

    Whatever you do, don't try to load all the data into R directly!
    Good luck, R

    ReplyDelete
  3. Thank you, Rolf, for sharing this great work, and even codes for it.

    Timothy's question is also my confusion. I'm not familiar with Python codes. I guess GDELT.select.py is doing something similar to "grep" in R.

    Take "RUS" for example, this function will retrieve all related actors, like "RUS", "RUSGOV", "RUSEDU"... as well as "MNCRUS".

    My question is, in the later analysis, did you distinguish these variants of queried "RUS"? Or you just treats all of them as one?

    Thank you.

    ReplyDelete
    Replies
    1. Hi Eric, great question. Yes, it functions like grep, which is quite handy. Later in the analysis I took different approaches depending what I was looking for. So when I zoomed in on the protests I refined my search using grep to keep only events where at least one of the actors involved was relevant to me. The coding table is here: http://web.ku.edu/~keds/cameo.dir/CAMEO.CDB.09b5.pdf

      When I tried to map all the Russian events (http://quantifyingmemory.blogspot.co.uk/2013/04/big-geo-data-visualisations.html) I just kept everything. In the link I just posted I also write a bit more about the difficulty of slicing the data according to the coding schemes used - it works great if the problem is big enough, but if you're looking for unusual stuff, e.g. actors who feature rarely, Cold War espionage assassinations, or similar, then the errors in the coding become a real issue.

      Delete
    2. Thank you, Rolf, for your thorough reply. The time lapse video is awesome by the way.

      I also think that, without precise classification of actor codes and event codes, some errors are inevitable when we try to aggregate information from multiple aspects. Focused study with controllable subsets of codes may be more precise. kinda tradeoff, imo.

      Again, this is quite an inspiration. Will go deep and play with it. Thank you.

      /E

      Delete
  4. Thanks very much for posting this. I just cribbed your code to produce some maps of mass atrocities in Syria according to GDELT, and you saved me a lot of time.

    ReplyDelete
  5. Hi Rolf,
    Very wonderful job!
    Btw, I wonder if there is a way to draw the point with different colors? I have made the color parameter to some factor variable but I have the error below:
    Error: Discrete value supplied to continuous scale
    Have you also tried this?

    ReplyDelete
    Replies
    1. This sounds like a ggplot thing - stackoverflow will be the place to look, but in a nutshell, if you map colour to a variable, it goes within the aesthetics (e.g. aes(x=yourVar,y=yourVar2,colour=yourGroupVar), whereas if you want to name a colour you move it outside the aesthetics: (aes(x,y),colour="red"). Hope that helps? Best, R

      Delete