Working With Wikileaks

I recently had the honor of speaking at a recent Hacks/Hackers meetup about working with the Wikileaks data. I talked about some of the work I had done with Sabrina Tavernise and Kevin Quealy looking at the sectarian violence in Baghdad during the occupation. Sabrina’s final story “Mix Of Trust And Despair Helped Turn Tide In Iraq” is our data-driven look back at the violence, and A Deadly Day in Baghdad is the graphic we ran alongside it, mapping the violence on both a single day and over the entire course of the occupation.

You can read my slides with their notes here:

My talk was organized around three principles of data journalism. These are just three rules that popped into my head while planning the talk out. I am sure there are more than three, and that there are more artful ways to express them, but these three rules articulated the things I think about when working with data. And they helped me explain why I hate word clouds with a violent passion too. So, what are these rules?

Find A Narrative. There is nothing that earns my scorn more than when a site does “data reporting” by just posting a data table without any context or a raw PDF. Word clouds are equally bad. If you want to report on data, you must first find a narrative in it. It doesn’t have to be the narrative — rich data sets have many narratives — but you need to find at least a narrative. This is essential for several reasons. It allows you to focus your inquiries around a single story and investigate specific questions. It also provides the means for the reader to understand and get more involved. Finally, it gives us ways of narrowing down large data sets to the necessary information to report the story without overwhelming with distractions. Human beings understand the world through narrative; find the narrative if you want to understand your data

Provide Context. The other reason why I usually am dismissive of calling raw data dumps journalism is their lack of context. Most data is complex, with its own nuances, jargon, and quirks. This was especially true of the Wikileaks War Logs, which were heavily laden with military jargon and inside information. It became very obvious from the beginning that we would want to to do more to explain these documents and what’s happening or we would overwhelm our readers. And so, when charting out the violence in Iraq, we provided some information about why there were concentrations of violence in certain neighborhoods. And when presenting the raw field reports, we worked to add an inline Jargon translator that would help decipher that a LN is a local national or a COP is a combat outpost (my personal favorite bit of inexplicable jargon: A “monica” in Iraq is a white Toyota Land Cruiser).

Work The Data. I always have to laugh at myself when some technology X is declared the savior of journalism. Or when someone writes a script to scrape twitter and declares it investigative journalism. So far, technology has done a far better job of collecting data, but analyzing it still remains very difficult. There is often no magic technology that will figure it out on your own; you’re going to have to work with it (I continue to feel everybody should read Brooks’ essay “No Silver Bullet” for its notions of accidental complexity and essential complexity). Furthermore, no data is perfect; when working with data, you must be aware of its inherent flaws and limitations. In the case of the Wikileaks data, there were many questions without answers. How much duplication was there in the database? How was the data on civilian homicides collected? Could we definitively say whether the Surge worked or was the decline in violence mainly because there was nobody left to kill and less will to endure the atrocities any longer? Having an idea of the methodology would’ve helped, but in this data, we didn’t even know who was collecting the homicide reports. Were there forces with the responsibility to assemble an accurate record? Or was it all just happenstance? When the killings waned, was it an actual decline in fatalities, or an illusion caused because units were too busy in the Surge to record the violence still happening? Could the leaker have forgotten to copy some records?

From a journalistic standpoint, this data was troubling. But luckily we had a few things to compare it against. Sabrina Tavernise reported from Baghdad while the sectarian violence raged, so she was able to name neighborhoods where we would expect to see the worst violence. To help out with this, I figured out how to extract MGRS coordinates from the reports and plot them on the map. This gave as a more accurate view of where the homicides were reported, although that too was not perfect (some coordinates did not geocode, and there were 40+ reports that geocoded to Antarctica). To get a feel for duplicates, we picked a single day, and Sabrina read through each homicide report, flagging roughly 20% of the reported deaths as duplicates (and finding cases where fatalities were not counted correctly as KIAs), so we could accurately show the violence of a single day.

On a parallel track, I really wanted to show how the violence rolled across Baghdad’s neighborhoods, so I hacked up a Ruby script to output all homicides within a bounding box for Baghdad to a KML file, with a colored circle and count that indicated the numbers of deaths. I added a little extra logic to split up roundup reports listing bodies found throughout Baghdad into individual points on the map. Finally, I found a map of Baghdad’s neighborhoods to use as a background. When the whole timeframe of the data was stepped through as an animation, you could see the violence surge in religiously mixed neighborhoods like Dora and Ghazaliya that Sabrina indicated from first-hand experience was where the worst violence was happening. To confirm we weren’t missing any data in the leak, we matched up the weekly homicide chart against the publicly released SIGACTS III numbers to verify the curves looked the same (the counts were larger in the wikileaks data because of duplicates). This data was ultimately used to visualize the year-by-year chart of the violence in the bottom of the graphic.

For the original version of the graphics, Kevin included numbers of civilian deaths for each year. Since each report included a summary table of people wounded and killed, it was only a trivial SQL statement to get body counts for each year in Baghdad. But the exactitude of those numbers implied a certainty that didn’t exist. We simply didn’t know how much duplication and omission there was in the raw reports, and so we decided to pull the numbers rather than use them (for a more detailed exploration of the new data’s limitations, see Iraq Body Count project). Just because it’s easy to derive a number from the database doesn’t always mean that number is correct. This may all seem academic, when we can just run a query and say “this is what the data contains.” But we don’t report on data itself. We report on reality, and we must assess how accurate a model of reality the data is.

So yes, it’s work to report on data, but it’s thrilling work at times. Hopefully, you’ll be inspired to work with data yourself.

Postscript: I forgot to include an interview with the CJR about the graphic last fall.

Going to CAR 2011 Next Week

One of my perennial complaints about the whole “future of journalism” topic is that it seems to be predominated less by people who actually create things and more by people who either come up with trivial ideas that they proclaim will save journalism and those who instead to observe from a remove and discuss what it all means (no, I’m not naming names). Many conferences leave me similarly unenthused. Here are the journalism panels for SXSW this year. And every other week there is another panel on what Wikileaks means for journalism, but rarely any on how people have used it for journalism. (tasteless self-promotion there).

Which is why I find myself strangely excited to be attending the 2011 Computer Assisted Reporting conference this year. For those of you unfamiliar with the term, Computer Assisted Reporting is what database/web journalism was called before it got cool (and before they could put data on the web). And this year there seem to be a surprisingly large number of data journalists I know — but have never met — attending. I suppose there will be a little culture clash between the young punks and the old guard, but I also know it’ll be a conference full of people discussing how to do things, and that is what makes me excited.

Are you going to be at the CAR Conference too? If so, please say hi.

Using Varnish To Keep Web Surges Tamed

For those of you who might be interested, I wrote a post for the NY Times Open blog on the Varnish web cache titled Using Varnish So News Doesn’t Break Your Server. For those of you, who prefer watching video, I also talked to Webpulp.tv about Varnish. That is all.

Waves of Annotation

Two weeks after the announcement (and some big waves by Facebook), there still seems to be serious interest in Twitter annotations. And yet, we have to wonder how much annotations will catch on, especially since it will be built from loosely coordinated actions among a large number of players.

I am admittedly no futurist, but I’ll take a stab at prognostication. For starters, I’d avoid waxing rhapsodically about our glorious semantic-web future. I would say it’ll be at least six months or more until we see widespread client usage of twitter annotations, and it may be that only a small fraction of tweets will be annotated even in the next few years. Progress may be fitful and sporadic, driven in the early days by some major adopters. Most users will remain unaware of the functionality or opt out.

How can I guess the future? By looking at the past. Twitter has already introduced a level of annotation to tweets: geolocation. And they didn’t exactly take off like wildfire. Five months after its launch, geotagged tweets are still a rarity in the general stream (a few hours of the Twitter Sampling API revealed only about 0.85% of tweets collected had locations). Of course, annotations are not exactly the same as geolocation: users may be more interested in sharing what books they’re talking about than where they are; but they suggest a general model for adoption that will probably remain true. I’m guessing we’ll see the following waves of annotation:

Wave 0: Early Hacking When Twitter Annotations launch, expect to see an initial burst of one-off hacks where developers put it through the motions and play with possibilities both silly and sublime. Annotations will probably roll out with a developer wiki/emailing list that means that some standards will coalesce pretty quickly, although some may fall out of favor in later stages.

Wave 1: Automated Tweets (1-3 months later) The next major wave of annotated tweets will come when various websites that automatically tweet add the metadata they have to their messages. For instance, we’ll probably see the following annotations pretty quickly off the bat:

  • MediaRSS or similar extensions for describing photographs on photo-posting services like TwitPic or YFrog
  • Product information alongside “Tweet This Product” links from retailers like Amazon. This will probably include UPC or ISBN as well as links to a purchasing page.
  • Book-specific metadata from sites like Goodreads or Readernaut where users rate books they have just read. Many of thess sites will likely use Facebook’s Open Graph Protocol, since they will already want Facebook integration as well.
  • “Like” actions on various websites. Here we’ll probably see a tussle between the Open Social Graph and Activity Streams standards.
  • News-specific metadata from news feeds. This might include generalized concepts like urgency or the byline or publisher-specific metadata. Linked Data would be helpful to connect stories across various taxonomies.

Automated services may very well remain the predominant source of annotated tweets. For instance, remember that figure about geotagged tweets earlier? Foursquare checkins accounted for 32% of them, and I imagine it would be higher if I ran my sample at night. More on that below.

Wave 2: General Standards Emerge (2-4 months) In the beginning, everybody will probably annotate alone: Amazon will probably use an amazon namespace, the New York Times will use nyt, and every twitter photo-sharing site will probably use the media namespace in widely different ways. In other words, everybody tweets alone, using their own terminology and taxonomies to organize it. So, searches for things like tweets about books or Barack Obama returning only a fraction of annotated tweets. Ultimately, certain conventions will predominate (whether adapted from existing standards or created entirely new), precisely because it’s easier to build tools against a few standards than many.

Wave 3: Client Support (6-9 months) Eventually, some API clients will add support for annotation within the client. I doubt clients or the twitter website will just add direct fields for entering annotation triples; that would be a user-experience disaster. Instead, we will probably see the following features rolled out once the automated tweeting services have established some standards:

  • Display of extended URL information contained within tweets.
  • Displaying thumbnails or other metadata from annotated picture tweets.
  • ISBN and UPC lookup against Amazon or other retailers (with purchasing links, naturally)
  • Allowing the user to explicitly annotate tweets about books or movies via a pop-up wizard.
  • If Facebook adds Open Social Graph annotation for user statuses, then many clients that support posting to both Twitter and Facebook will likely add annotations using the Open Social Graph namespace.

I would guess though that automated services will always be the predominant producers of annotated tweets, and clients will mostly focus on displaying annotations. Simply because the user experience for third-party services will almost always be more fun than the base information sharing of a twitter client. This is why Foursquare accounts for a hefty chunk of geotagged tweets. Similarly, a user on GoodReads gets cumulative summaries, search, and other features beyond what would be available in most twitter clients. Ultimately though, the reason why there won’t be many people manually annotating their tweets is because it runs counter to the reason we joined twitter in the first place: unlike a blog, it’s just one simple text entry box, and any additional input (no matter how seamless) will most likely never get used. And so, from here on out, we’ll probably see decreasing adoption rates for annotations on twitter.

Wave 4: Automated Entity Extraction and Aggregation (9-12 months). So, it’s unlikely that many users will manually annotate their tweets in clients (although they may be willing to use third-party sites that will). But what if annotation were mostly automatic? Consider perhaps semantic analysis in the client that watches your typing and presents suggested annotations to be reviewed and deleted below the box the same way Twitter.com suggests your neighborhood. This might increase adoption to some degree, but I remain skeptical. More likely we’ll see a larger uptick from the man retweeting bots on twitter when they add semantic analysis and annotations to their bags of tricks. So, the @nytwriters account might retweet and add the annotation og:organization='The New York Times' to tweets.

Final Thoughts So, what’s the ultimate takeaway from this lengthy piece? Mainly, annotations will probably be adopted in distinct waves, with the largest bump in the middle resulting from automated services. And, despite some of the hopes to the contrary, annotations will probably be used on a only a very small fraction of tweets, of which the bulk will be from automated services like TwitPic, YouTube, or Foursquare. I am not saying annotations will be useless; on the contrary, nothing is more powerful than giving sites with metadata the tools to share them. But like most other forms of media, freeform text (with all of its ambiguities) will always remain predominant in twitter; for instance, if your tools just focus on retrieving tweets annotated movie:title="Avatar", you’ll indeed avoid all those tweets about people’s twitter avatars, but you’ll also miss almost all of the conversation about the movie (the precision/recall curve strikes again).

But, I could be wrong. Thoughts?

The Appeal of Annotations

Last week, I had the honor of speaking at Twitter’s Chirp conference, where I talked a little about some of the New York Times’ upcoming integration of @anywhere, and share some fun statistics I had unearthed along the way. For instance, someone tweets a link to a New York Times story once every 4 seconds. Like most developers there, I will admit to some mixed feelings about the conference beforehand, but I was pleasantly surprised about some of the features on the twitter roadmap (and I enjoyed meeting some of the developers there face-to-face). The most exciting feature? Annotations.

This upcoming feature allows programs to submit up to 2K worth of annotations with a tweet. Annotations themselves are triples of a namespace, key, and value, and there are relatively few restrictions beyond that. Presumably multiple values for a given key are allowed, although it’s unclear yet how they will be represented in the format. Beyond that, twitter is stepping back, preferring to see what standards and conventions emerge from the developer community rather than dictating the usage of annotations.

It’s a smart decision, but not one without drawbacks; too much passivity could lead to early fragmentation and confusion: for instance, we’ll probably see a proliferation of various annotation formats for the url, info, and possibly even media namespaces. Such confusion might delay the adoption of annotations by twitter clients, since inconsistent annotations are harder to support than none at all. Of course, it is also possible that the most prominent clients may carve out their own specific functionality via privately documented annotations. And so, there might be a tweetdeck or seesmic namespace before there are general ones. And so, while I think Twitter, Inc. should indeed stay away from specifying annotations, I think it would be very helpful if they provided an official space (perhaps a wiki on dev.twitter.com for developers to document and discuss some possible uses for annotations. And it would be great if a few suggestions were already there for developers looking to get started before annotations. So, what kind of annotations might make sense? Here are some thoughts of mine:

  • Url Information I don’t think annotations will eliminate tiny URLs from tweets (at least not in the next few years), but I do think it makes sense for annotations to contain expanded or canonical URLs for the tiny URLs mentioned in the body. This would allow researchers to analyze URLs years after the shortener vanishes (although it could also create an opportunity for phishing where the tinyURL does NOT go to the canonical destination provided).
  • Media – the MediaRSS standard seems like a natural fit for services like Twitpic and YFrog, allowing tweets to specify thumbnail URLs, dimensions, etc.
  • Product informationReadWriteWeb wrote an extensive piece about annotations where they suggested that existing product identifiers like ISBNs or other such things would make for excellent annotation material. I agree.
  • Urgency – some sort of mechanism for indicating that some tweets are breaking news or urgent in some other way has been suggested by a few developers I spoke with. Of course, this could be abused by that one guy you know who always sends all his emails with maximum priority.
  • Meta – A namespace for specifying information about the annotations in the document. It’s unclear if we would one one place for this, or a meta key in each namespace.
  • CSS – Finally, with the font-family key, you CAN tweet in Comic Sans. This is me being silly, but one could theoretically see some artsy use cases for tweet-specific CSS styling specified via annotations.
  • Linked Data – The most intriguing potential use case of annotations is Linked Data. Most tweets from the @nytimes twitter feed are stories with internal cataloging information that says what the story is about (the people, the places, organizations, etc.). This information is useful in itself, but when it is linked to an outside taxonomy like DBPedia, it becomes compatible with global taxonomies. Meaning, you can search for tweets tagged ‘Twitter (Organization)’ and find stories tagged with any of the linked taxonomies.

These are just a few ideas off the top of my head, and I’m looking forward to what proposals other developers might make. Of course, the real killer use of annotation is not in displaying tweets on the timeline but search and the streaming API. Imagine being able to retrieve tweets tagged with a certain annotation key, or key/value, or even with a certain namespace. Combine this with a generalized annotation scheme like Linked Data and it’s suddenly possible to search for all tweets with images or links to books or stories about fine dining. Or so the hopes go. Reality will likely be messier.

For starters, adoption of annotations will probably be halting and inconsistent (that’s the subject of another post), largely dependent on the abilities of twitter client developers to make it happen. Text search is not going away anytime soon. And annotations will not eliminate confusion in themselves. For instance, it seems to me that the majority of annotation use cases will be to describe what the tweet is linking to, rather than the contents of the tweet itself. Similarly, I imagine we will see usage of automated annotation in some clients where the word Paris in the text leads to the tweet being annotated as being about Paris the City, when it’s really not at all. Of course, this is all just quibbling that would make a research librarian proud, but it matters to some.

Still, let’s not lose our enthusiasm for annotations. This could be big, and I’m excited to see where it will wind up. Let’s get out there and start coding.

Update: And then just like that, Facebook announced the Open Graph Protocol which among other things, specifies an annotation namespace that can include things like book information, people mentioned, etc. It’s an attractive standard defined for Facebook’s “Like” mechanism that will probably be quickly adopted within Twitter annotations as well. Unfortunately, with the exception of UPCs and ISBNs, all Open Graph fields are arbitrary text. This means there will still be a lot of ambiguity within the annotations compared to Linked Data (for instance, people might tag “Barack Obama”, “Obama”, “Barack Hussein Obama”, “President Obama” all for tweets about Obama), but it also means there might be greater usage.

Like a Floppy Disk in the Sky

Back in the heady days of 2008, one of my nicknames in the newsroom was Mr. Twitter, due to my work on the @nytimes twitter account (and other automated pages) and my work (with others here at the NYT) to coax more members of the Times into signing up for twitter on their own. I have always found Twitter’s spartan user experience fascinating: the short message sizes, the basic social networking layer, the unrelenting emphasis on communication instead of social or professional networks; as I explained in a few talks I delivered here at the Times, Twitter was compelling because on the surface it offered so little functionality to users, it leaves people to do what they do best, and talk away the silences one status update at a time.

So perhaps unsurprisingly, Twitter’s brevity has been an impetus for user-created inventions. Now-official conventions like the @reply and hashtags (I perhaps am the only one disappointed they aren’t called octothorpetags) originated as user-invented conventions to scaffold in useful functionality despite twitter’s limitations. But text markup can only do so much. For much richer communication like photos, videos, music, twitter falls short and can only send users elsewhere to pick up pieces of the conversation.

For users who want to attach photos or other media to their messages, the only option has been to post the content on the third-party website and embed the link in a tweet. Furthermore, given the space constraints of twitter, it is often necessary to use a URL shortening service that acts as a proxy and provides a compact URL that is expanded into the full URL upon a request. The limitations of such services have been noted many times by now; these include obfuscation, fragility, impermanence, and a disconcerting association with the nation of Libya.

But the biggest drawback of using third-party websites for hosting photos and videos is that it wrenches the reader out of the firehose of the status update stream and forces them to visit the rest of the web. And it’s scary out there. Such transitions can be jarring (Not to mention impossible for users of SMS or the Peek twitter devices), and have only been tolerated thus far by the status update services because there has been no alternative. Until now.

Introducing TweetFTP (the Tweet File Transfer Protocol), a revolutionary new approach for sharing files within the contents of tweets themselves. It works by encoding a file to be transferred as a series of encoded tweets (plus a few header and footer tweets). No more guessing whether to use Twitpic or YFrog. No more having to wrench yourself out of the stream just to see a picture of someone’s sandwich. No more having to post your photos to some randomly named photo-service that’ll run out of funding in 3 months. TweetFTP instead stores them in twitter’s servers which operates in the cloud somewhere. And everything important is in the Cloud these days, so tweetftp is like a floppy disk in the sky! And it fully embraces all of Twitter’s conventions: Want to send a file to one person instead of the world? Prefix the tweets with their username; want to tag a public file with keywords, use all the hashtags you want to describe it. Don’t work against twitter, but embrace it; send that photo to Grandma in only 35,693 tweets!

However, there are some caveats. Currently, the only language support is hastily coded Ruby driver, although this standard is Open Source and I expect other drivers will be created shortly. In addition, twitter’s API currently places some artificial constraints on transmission speed. Twitter currently allows API clients to only post up to 150 messages per hour. This practically limits this mechanism to a maximum speed of 0.019kbps. But do not despair! Serious users of twitter may apply for an elevated rate limit of 20,000 requests per hour, which increases throughput to a comparatively blazing 2.51kbps. A bigger problem is that this mechanism might be used for illegal file sharing. And indeed, since it only takes a mere 60,000 or so tweets (or about 17 days) to upload an MP3, it would be pretty hard to resist that temptation. I can only appeal to user’s moral judgment to ensure tweetftp is only used for the right reasons.

So there you have it. To learn more, please read the tweetftp specification and let me know if you implement support for it in your own favorite language. I’ll send you a badge for for blog. Via tweetftp!

Next post: Will tweetftp save journalism?

The Long And Short Of It

Recently, Jay Rosen pointed that Bitly analytics seemed to report many more hits than other URL shorteners. The problem looked eerily familiar; and then I realized it was the same issue I had observed a few months back and had even half-written a blog post about it that never got posted (if you read my blog, this should come as no surprise). In the interest of explaining things thoroughly, here finally is a post on the matter.

Bitly has become the preeminent URL shortening service, largely due to the fact that it provides easy access to analytics on usage (you merely need to append a plus-sign to the shortened URL). This is the reason I converted all of the automated New York Times twitter accounts over to bitly 3-4 months ago, in the hopes of sharing the analytics of twitter user behavior with the world. However, when I collected a month worth of link usage statistics from Bitly, I noticed that the quoted total seemed unusually high, a 3-5x multiplication of hits recorded by our page analytics software. At first I thought it might be a result of comparing apples to oranges: the bitly counts might include other users shortening the same URL, and internal analytics is undercounting twitter traffic since very few twitter users come to the site via twitter.com.

Fortunately, though I have a way to check apples directly against apples. When my automated script posts a URL to twitter, it appends a query string to the URL before shortening to indicate that the click is coming from a twitter client (and also which account). So, if you look at the URL for this movie review of District 9 posted to the nytimes twitter account last week you’ll notice it has some additional arguments on the end:

http://movies.nytimes.com/2009/08/14/movies/14district.html?src=twt&twt=nytimes

you can see it has ?src=twt&twt=nytimes appended to the end of it. This allows me to tag all hits from my automated accounts and compare directly the numbers from bitly (because nobody else is shortening that specific URL) to the numbers reported from internal analytics (because those hits are coming solely from the automated accounts). The really savvy among might notice I can also distinguish traffic from different accounts on twitter. But this is where the discrepancy became apparent. Here are the counts for that URL in bitly and internal analytics

  • WebTrends: 928 hits
  • Bitly: 1505 hits (162%)

The difference can vary depending on the link. Some counts can be very close. But a few months back, I saw overcounts up to 4-5x greater for Bitly. So, what’s happening here? I contacted Bitly support about it and they explained the issue is that Bitly counts expansions of the URL and not necessarily clickthroughs. At first, this seemed like a design flaw in the service, but I realize it’s an unavoidable limitation of every URL shortener out there, and the problems are probably most apparent with Bitly because they are the most prominent one in use. Let me explain.

A URL shortener works in a particular way. A user sees a short URL like http://bit.ly/TWf3E. The web browser sends a request to the bitly web server for that page. The bitly server counts that as a hit and sends an HTTP redirect reply back to the web browser with the expanded URL which the browser then hops to. This is how it should work, and in an ideal world the subsequent request would register a hit on the real destination. But bitly has no way to know if the user followed the redirect. It could be the user’s machine crashed. It could be that the browser doesn’t follow redirects. It could be it’s a web crawler that gets counted by bitly but is ignored by the real site’s analytics. Or it could be a tool that does the request only to display the expanded URL to a user. All of these scenarios could be counted as a “hit” at the shortener without being a clickthrough to the remote site.

To their credit, bitly has recognized the problem and have been working on various mechanisms to reduce this overcounting, and I’ve seen a definite “drop” in bitly counts as more filters are added to remove bogus hits. This has included filtering out bots by recognizing signatures of their actions. And it also has included the addition of an expand method in the API for developers looking to expand URLs without requesting the web page and being counted (it also returns statistics on clicks). But no countermeasures will be perfect. Some bots may be hard to detect. And there are several reasons why developers might prefer to expand Bitly URLs by making a HTTP GET request like web browsers do instead of calling the API: it’s simpler to use, it doesn’t require an API key, and it’s all but guaranteed to be more reliable. If bitly is having catastrophic issues, which service do you think is going to get the most attention? The public website or the niche developer API? Apart from doing the right thing, there’s not much incentive for any developer to use a shortener’s API just to expand URLs. And so, it’s likely that some overcounting will always be with us for any URL shorteners.

So, who cares? The ugly truth about web analytics is nearly all of it involves some amount of error. The phrase should be Lies, damned lies, and web analytics. Analysis of web server logs can overcount automated traffic from web crawlers. Modern analytics programs that embed javascript on pages to thwart bots can undercount clients that don’t load the entire page or don’t execute javascript (a particular concern with mobile browsers). Panel-driven approaches like Nielsen can severely undercount because they extrapolate from focus group that might not closely represent the actual Internet audience. Some techniques might be more accurate than others, but the central problem of web analytics is that you never know for certain how many of your visitors are actually humans. And it gets even fuzzier when you attempt to discern unique visitors from the flow of hits that are your site’s traffic; any estimation of visitors is built on assumptions about web usage and thus adds its own levels of distortion and error. I would hope many web analytics experts understand this and approach stats with appropriate grains of salt. For instance, if I see a surge in bitly clickthroughs for a New York Times article on twitter, is that incontrovertible evidence that twitter-themed stories are eagerly consumed by the twitter users? Or is it more a case of automated bots grabbing and expanding any link tagged twitter? Or is it both? You’ll never know with absolute certainty.

This point matters, because bitly is one of the only truly open analytics out there. And people quotes its numbers like they come straight from God, often because they have no alternatives. Most companies (my employer included) will not share their statistics with the general public, although I have not yet chased down the reason why. And bitly hits are recorded at the origin of folllowing a link, rather than the destination, making them essential for understanding a service like twitter. But such ubiquity also deserves some skepticism. That certain argument about the worth of twitter followers suffers if actual traffic is half or a tenth what is reported. That outside audit of site traffic might include many small errors adding up to a big deviation from internal numbers. Bitly remains a highly useful service, but it’s important for all of us to remember how the hits are counted.

Follow

Get every new post delivered to your Inbox.