Q&A: EveryBlock's Adrian Holovaty on the LAPD and accuracy standards [UPDATED]
Updated April 9, 11:30 a.m.: EveryBlock's Adrian Holovaty has posted on changes his site has made to mitigate the geocoding errors.
Pictured above is a screenshot from EveryBlock showing how the use of flawed data from the LAPD turned the relatively peaceful block between City Hall and the Los Angeles Times into a large foreboding cluster of crime. Careful readers will note that the address listed is not downtown, but in the heart of Hollywood.
Adena Schtuzberg at Directions magazine blogs:
My main question is this: EveryBlock took the same data feed for its L.A. maps, and it seems, ended up with same inaccuracies. Is that because they use the same geocoding and data against which to geocode? That's not clear from the article.
In hopes of answering Schtuzberg's question, and others like it, here is the full transcript of my interview with EveryBlock's founder, Adrian Holovaty. At Everyblock's request, we conducted the interview by e-mail, April 3. First I sent a list of questions; Holovaty's responses are posted below.
What standards for accuracy must a data source meet before it can be republished on EveryBlock?
In the public records on our site, we publish what the government makes available to us. We have to assume at some fundamental level that the governments aren't feeding us data that is complete garbage -- and if any problems are pointed out to us, we make an effort to investigate.
Before we publish a set of public records on EveryBlock, we make an effort to contact the relevant government agency to learn the intricacies of the data and any of its quirks; we try to describe any uncommon terminology in the data; and we give it a good, old-fashioned human "sniff test" of whether the data appears to be spread out in ways that makes sense. In a number of cases, we've uncovered some peculiar trends in data and have reported it back to the government agencies so they can investigate and potentially fix their systems and processes.
Obviously, we can't check every single crime record against original paper reports at the police station, because we're dealing with thousands of records. In many ways, we're dealing in an uncharted territory here. News sites like EveryBlock, the New York Times and the Los Angeles Times are only in the early days of publishing databases online, and the rules, strategies and policies are still being hashed out. I don't know if anybody has a great solution to this, and we have a very experimental stance, out of necessity. We're constantly tweaking and improving our processes, as I'm sure you guys are, too.
Finally, I want to mention that we try to make it clear our public records come directly from specific government agencies. It's the database equivalent of attributing a quote to a person: "According to the LAPD, ..." I realize that republishing an error is not ideal, but at least we're upfront about where the information comes from and we do our best to fix errors when they're reported to us.
What standards for accuracy must a data set meet before it can qualify for aggregation into EveryBlock's charts and rankings?
We consider our charts navigation, and, as such, almost every type of information on our site gets the "chart" treatment -- by date, by most common lookups, by neighborhood, by ZIP code, etc.
Why is the sort of disclaimer you include with your site necessary? What is its origin?
There was no specific incident that caused us to write a disclaimer; we just proactively decided it would be the prudent thing to do. There's a simple reason for the disclaimer: every database is flawed.
Please describe the "soup to nuts" process that the data takes on its travel from LAPD to EveryBlock? Specifically, what, if any, alterations are made?
In the case of our Los Angeles crime data, we retrieve the crime data from LAPD's "Crime Maps" website every day, and it's imported into our Los Angeles site. No alterations are made to the data beyond presentational changes.
Was EveryBlock aware of the patterns of inaccuracy we identified in the LAPD data?
No, we weren't aware that some of the LAPD's crime data was being reported with inaccurate longitude and latitude figures. You have my genuine thanks for pointing this out to us; we've begun working on a solution (see below).
How will an inaccuracy like the one we've identified in the LAPD data affect that data set's status on EveryBlock?
Generally speaking, there are two types of inaccuracies -- inaccuracies in specific records and patterns of inaccuracies.
- If somebody identifies an inaccuracy in a specific record and we can't verify it, we make an effort to contact the appropriate government agency and verify it.
- If somebody identifies a pattern of inaccuracies (as happened in Charlotte with the police calls data), we also make an effort to contact the government agency and, when appropriate, make an effort to change our system such that the [in]accuracy can be avoided in the future. Sometimes we can solve this with technology; other times, the best we can do is add descriptions and explanations that we hope help people understand the data.
In the case of the LAPD data, we're working on improving our system to cross-check the LAPD's longitude/latitude with the address that they provide, and raise a flag in case of substantial discrepancy.
The Charlotte blog post indicates to us that you intervened and took action to fix problem data. In contrast, your disclaimer implies that you take no responsibility for accuracy and directs readers to forward questions to the data provider. Print and broadcast media outlets expect their readers to bring questions about accuracy directly to them. Why don't you do the same?
The main goal of our disclaimer is to let people know that we're not the generators of the data. In each instance, we provide information on which agency is responsible for each particular data set. When people *do* contact us, however, we make an effort to track down the truth.
We cannot guarantee that our data is 100% accurate, because, like I said above, every database is flawed. The disclaimer is necessary because despite any effort we make into fixing problem data, there will always be dirty data.