NEWS ANALYSIS: Having lots of data isn’t really the way to discover truthful information because first you have to make sure it’s the right data and that it’s also the most relevant data.
NEWS ORLEANS—It was clear when I stood up to speak at the session on big data at the Society for Professional Journalism’s Excellence in Journalism 2016 conference here that I was wasn’t addressing your average trade show audience.
The hundred or so people in front of me were all professional journalists, which meant that they were expecting a no-nonsense, practical look at how they could use vast data archives to find the truth on a wide variety of topics, ranging from political corruption to the spread of the Zika virus.
With me in the front of the room was Pam Baker, the highly respected author of Data Divination: Big Data Strategies and Louis Lyons, chief operating officer of ICG Solutions, the company that created the LUX2016 data analysis engine and which helped with our examination of viewer reaction to last year’s Democratic and Republican primary debates.
Our discussion started out with my description of how we used data analysis to figure out who won last year’s debates well before the major news organizations had polling data to release. But as valuable as our first effort to use big data to support a feature article was, the fact is that data analysis goes far beyond what I was able to do in eWEEK’s first attempt.
This is no surprise because the analysis of large data sets is still in its infancy, and while data analysis can be used by most new organizations to gain important insights, finding the way is still hard.
Fortunately, I had Pam Baker at my side in this effort, and she was able to part the seas of confusion and teach what big data can do and what it can’t do. The first lesson was that big data analysis isn’t magic, and just because you have a lot of data doesn’t mean it’s useful data.
What really matters, Baker said, is that your big data archive contains accurate data that is most relevant to the information you hope to discover.
In addition, Baker pointed out that it’s critical that you know the origin of the data you’re planning to analyze and that you’re comfortable with the way that the data was collected so that you can be more confident that you are working with valid information. One example that Baker cited was the difference in the influenza infection rates reported by the Centers for Disease Control and by Google Flu Trends.
Google Flu Trends was an effort to determine infection rates using only information obtained on social media. The CDC, on the other hand, compiled influenza infection rates using a variety of sources, including social media, but also including government sources, health care providers and more.