I recently had the honor of speaking at a recent Hacks/Hackers meetup about working with the Wikileaks data. I talked about some of the work I had done with Sabrina Tavernise and Kevin Quealy looking at the sectarian violence in Baghdad during the occupation. Sabrina’s final story “Mix Of Trust And Despair Helped Turn Tide In Iraq” is our data-driven look back at the violence, and A Deadly Day in Baghdad is the graphic we ran alongside it, mapping the violence on both a single day and over the entire course of the occupation.
You can read my slides with their notes here:
- Reporting Wikileaks (10mb PDF)
My talk was organized around three principles of data journalism. These are just three rules that popped into my head while planning the talk out. I am sure there are more than three, and that there are more artful ways to express them, but these three rules articulated the things I think about when working with data. And they helped me explain why I hate word clouds with a violent passion too. So, what are these rules?
Find A Narrative. There is nothing that earns my scorn more than when a site does “data reporting” by just posting a data table without any context or a raw PDF. Word clouds are equally bad. If you want to report on data, you must first find a narrative in it. It doesn’t have to be the narrative — rich data sets have many narratives — but you need to find at least a narrative. This is essential for several reasons. It allows you to focus your inquiries around a single story and investigate specific questions. It also provides the means for the reader to understand and get more involved. Finally, it gives us ways of narrowing down large data sets to the necessary information to report the story without overwhelming with distractions. Human beings understand the world through narrative; find the narrative if you want to understand your data
Provide Context. The other reason why I usually am dismissive of calling raw data dumps journalism is their lack of context. Most data is complex, with its own nuances, jargon, and quirks. This was especially true of the Wikileaks War Logs, which were heavily laden with military jargon and inside information. It became very obvious from the beginning that we would want to to do more to explain these documents and what’s happening or we would overwhelm our readers. And so, when charting out the violence in Iraq, we provided some information about why there were concentrations of violence in certain neighborhoods. And when presenting the raw field reports, we worked to add an inline Jargon translator that would help decipher that a LN is a local national or a COP is a combat outpost (my personal favorite bit of inexplicable jargon: A “monica” in Iraq is a white Toyota Land Cruiser).
Work The Data. I always have to laugh at myself when some technology X is declared the savior of journalism. Or when someone writes a script to scrape twitter and declares it investigative journalism. So far, technology has done a far better job of collecting data, but analyzing it still remains very difficult. There is often no magic technology that will figure it out on your own; you’re going to have to work with it (I continue to feel everybody should read Brooks’ essay “No Silver Bullet” for its notions of accidental complexity and essential complexity). Furthermore, no data is perfect; when working with data, you must be aware of its inherent flaws and limitations. In the case of the Wikileaks data, there were many questions without answers. How much duplication was there in the database? How was the data on civilian homicides collected? Could we definitively say whether the Surge worked or was the decline in violence mainly because there was nobody left to kill and less will to endure the atrocities any longer? Having an idea of the methodology would’ve helped, but in this data, we didn’t even know who was collecting the homicide reports. Were there forces with the responsibility to assemble an accurate record? Or was it all just happenstance? When the killings waned, was it an actual decline in fatalities, or an illusion caused because units were too busy in the Surge to record the violence still happening? Could the leaker have forgotten to copy some records?
From a journalistic standpoint, this data was troubling. But luckily we had a few things to compare it against. Sabrina Tavernise reported from Baghdad while the sectarian violence raged, so she was able to name neighborhoods where we would expect to see the worst violence. To help out with this, I figured out how to extract MGRS coordinates from the reports and plot them on the map. This gave as a more accurate view of where the homicides were reported, although that too was not perfect (some coordinates did not geocode, and there were 40+ reports that geocoded to Antarctica). To get a feel for duplicates, we picked a single day, and Sabrina read through each homicide report, flagging roughly 20% of the reported deaths as duplicates (and finding cases where fatalities were not counted correctly as KIAs), so we could accurately show the violence of a single day.
On a parallel track, I really wanted to show how the violence rolled across Baghdad’s neighborhoods, so I hacked up a Ruby script to output all homicides within a bounding box for Baghdad to a KML file, with a colored circle and count that indicated the numbers of deaths. I added a little extra logic to split up roundup reports listing bodies found throughout Baghdad into individual points on the map. Finally, I found a map of Baghdad’s neighborhoods to use as a background. When the whole timeframe of the data was stepped through as an animation, you could see the violence surge in religiously mixed neighborhoods like Dora and Ghazaliya that Sabrina indicated from first-hand experience was where the worst violence was happening. To confirm we weren’t missing any data in the leak, we matched up the weekly homicide chart against the publicly released SIGACTS III numbers to verify the curves looked the same (the counts were larger in the wikileaks data because of duplicates). This data was ultimately used to visualize the year-by-year chart of the violence in the bottom of the graphic.
For the original version of the graphics, Kevin included numbers of civilian deaths for each year. Since each report included a summary table of people wounded and killed, it was only a trivial SQL statement to get body counts for each year in Baghdad. But the exactitude of those numbers implied a certainty that didn’t exist. We simply didn’t know how much duplication and omission there was in the raw reports, and so we decided to pull the numbers rather than use them (for a more detailed exploration of the new data’s limitations, see Iraq Body Count project). Just because it’s easy to derive a number from the database doesn’t always mean that number is correct. This may all seem academic, when we can just run a query and say “this is what the data contains.” But we don’t report on data itself. We report on reality, and we must assess how accurate a model of reality the data is.
So yes, it’s work to report on data, but it’s thrilling work at times. Hopefully, you’ll be inspired to work with data yourself.
Postscript: I forgot to include an interview with the CJR about the graphic last fall.