Hadoop and beyond: A primer on Big Data for the little guy
Have you heard the news? A “data lake” overflowing with information about Hadoop and other tools, data science and more threatens to drown IT shops. What’s worse, some Big Data efforts may fail to stay afloat if they don’t prove their worth early on.
“Here’s a credible angle on why Big Data could implode,” began Gary Nakamura, CEO of Concurrent, which makes Cascading, an open-source data application development platform that works with Hadoop, and Driven, a tool for visualizing data pipeline performance. “A CTO could walk into a data center, and when they say, ‘Here is your 2,000-node Hadoop cluster,’ the CTO says, ‘What the hell is that and why am I paying for it?’ That could happen quite easily. I predicted last year that this would be the ‘show me the money’ year for Hadoop.”
While plenty can go wrong, Nakamura is bullish on Hadoop. With companies like his betting robustly on the Hadoop file system (and its attendant components in the Big Data stack), now is a strategic moment to check your data pipelines for leaks. Here’s a primer on where the database market stands, what trends will rock the boat, and how to configure your data science team for success.
Follow the leader
Risks aside, no one—not even the federal government—is immune to the hope that Big Data will bring valuable breakthroughs. Data science has reached presidential heights, with the Obama administration’s appointment of former LinkedIn and Relate IQ quantitative engineer DJ Patil as the United States’ first Chief Data Scientist in February. If Patil’s slick talks and books are any indication, he is at home in a political setting. Though building on government data isn’t new for many companies offering services in real estate (Zillow), employment (LinkedIn), small business (Intuit), mapping (ESRI) or weather (The Climate Corporation), his role should prompt many more to innovate with newly opened data streams via the highly usable data.gov portal.
“I think it’s wonderful that the government sees what’s happening in the Big Data space and wants to grow it. I worked at LinkedIn for three years, and for a period of time [Patil] was my manager. It’s great to see him succeed,” said Jonathan Goldman, director of data science and analytics at Intuit. (Goldman cofounded Level Up Analytics, which Intuit acquired in 2013.)
Defining the kernel, unifying the stack
“In the last 10 years we’ve gone through a massive explosion of technology in the database industry,” said Seth Proctor, CTO of NuoDB. “Ten years ago, there were only a few dozen databases out there. Now you have a few hundred technologies to consider that are in the mainstream, because there are all these different applications and problem spaces.”
After a decade of growth, however, the Hadoop market is consolidating around a new “Hadoop kernel,” similar to the Linux kernel, and the industry standard Open Data Platform announced in February is designed to reduce fragmentation and rapidly accelerate Apache Hadoop’s maturation. Similarly, the Algorithms, Machines and People Laboratory (AMPLab) at the University of California, Berkeley is now halfway through its six-year DARPA-funded Big Data research initiative, and it’s beginning to move up the stack and focus on a “unification philosophy” around more sophisticated machine learning, according to Michael Franklin, director of AMPLab and associate chair of computer science at UC Berkeley.
“If you look at the current Big Data ecosystem, it started off with Google MapReduce, and things that were built at Amazon and Facebook and Yahoo,” he said at the annual AMPLab AMP Camp conference in late 2014. “The first place most of us saw that stuff was when the open-source version of Hadoop MapReduce came out. Everyone thought this is great, I can get scalable processing, but unfortunately the thing I want to do is special: I want to do graphs, I want to do streaming, I want to do database queries.