You can’t have a conversation about Big Data for very long without talking about Hadoop.
Hadoop is open source software platform managed by the Apache Software Foundation has proven to be very helpful in storing and managing vast amounts of data cheaply and efficiently.
But what exactly is Hadoop, and what makes it so special? Basically, it’s a way of storing enormous data sets across distributed clusters of servers and then running “distributed” analysis applications in each cluster.
It’s designed to be robust, in that your Big Data applications will continue to run even when individual servers — or clusters — fail. And it’s also designed to be efficient, because it doesn’t require your applications to shuttle huge volumes of data across your network.
Here’s how Apache formally describes it:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.
Look deeper, though, and there’s even more magic at work. Hadoop is almost completely modular, which means that you can swap out almost any of its components for a different software tool. That makes the architecture incredibly flexible, as well as robust and efficient.
Hadoop Distributed Filesystem (HDFS)
If you remember nothing else about Hadoop, keep this in mind: It has two main parts – a data processing framework and a distributed filesystem for data storage. There’s more to it than that, of course, but those two components really make things go.
The distributed filesystem is that far-flung array of storage clusters noted above – i.e., the Hadoop component that holds the actual data. By default, Hadoop uses the cleverly named Hadoop Distributed File System (HDFS), although it can use other file systems as well.
HDFS is like the bucket of the Hadoop system: You dump in your data and it sits there all nice and cozy until you want to do something with it, whether that’s running an analysis on it within Hadoop or capturing and exporting a set of data to another tool and performing the analysis there.
Data Processing Framework & MapReduce
The data processing framework is the tool used to work with the data itself. By default, this is the Java-based system known as MapReduce. You hear more about MapReduce than the HDFS side of Hadoop for two reasons:
- It’s the tool that actually gets data processed.
- It tends to drive people slightly crazy when they work with it.
In a “normal” relational database, data is found and analyzed using queries, based on the industry-standard Structured Query Language (SQL). Non-relational databases use queries, too; they’re just not constrained to use only SQL, but can use other query languages to pull information out of data stores. Hence, the term NoSQL.
But Hadoop is not really a database: It stores data and you can pull data out of it, but there are no queries involved – SQL or otherwise. Hadoop is more of a data warehousing system – so it needs a system like MapReduce to actually process the data.
MapReduce runs as a series of jobs, with each job essentially a separate Java application that goes out into the data and starts pulling out information as needed. Using MapReduce instead of a query gives data seekers a lot of power and flexibility, but also adds a lot of complexity.
There are tools to make this easier: Hadoop includes Hive, another Apache application that helps convert query language into MapReduce jobs, for instance. But MapReduce’s complexity and its limitation to one-job-at-a-time batch processing tends to result in Hadoop getting used more often as a data warehousing than as a data analysis tool.