Submitted By Binbin Li
Senior Data Analyst, RESA
IBM estimates that we create over 2.5 quintillion bytes of data each day, where 90% of the world’s data comes from the last year(1). Let’s take a moment to appreciate that number. Quintillion. We don’t make a habit to use the quantity quintillion, which is probably because a quintillion is equal to a billion billion or 1, 000, 000, 000, 000, 000, 000 (yes, that would be 18 zeros). Multiply that amount by 2.5, and that’s about one new Google (e.g. amount of data searched using Google) every four days. All of the earth’s oceans contain 352 quintillion gallons of water; if bytes were buckets, it would only take about 20 weeks of information gathering to fill the seas, based on estimates conducted by Bloomberg(2). Okay, I am not going to beat a dead horse–but that’s a tremendous amount of information.
With more data, comes more challenges. Data provides invaluable insights, but only if we handle it appropriately. For the most part, this means that we are processing a large volume of data in a short period of time. Our world does not have the luxury to wait hours and hours to sift through data. We need insights 10 minutes ago. We are seeing a massive shift in the industry. Companies are moving away from models based on IT, to models based on business insights (AKA data-driven decisions). Businesses across all industries need to know what trends are occurring, why trends are happening, and what changes to make going forward. Companies are able to make better business decisions and become more resilient based on their understanding of the data.
Developers at Apache have heard our cries for more efficient data processing and answered our prayers with their product Hadoop, an open source application reliant on the Google File System. Hadoop is designed to distribute and process big data sets swiftly. How does Hadoop do this? Simple–by utilizing parallel computing in multiple nodes (processors). Okay, maybe not so simple. Let’s unpack that idea a little bit.
Imagine you are in a supermarket, completing your weekly shopping. When you approach the checkout line, you realize you are living an actual nightmare–there’s only one cashier open. This is how bytes of information feel every time they are trying to be transferred in a single processor machine, like your personal computer. However, I’d like to think bytes probably have more patience. Now, imagine you are in the same supermarket, but this time there are six cashiers open and ringing up shoppers simultaneously. This time, you are easily able to proceed through the checkout line in an efficient manner. Hadoop capitalizes on the second supermarket’s model by employing multiple cashiers, or processors, to handle the volume of shoppers, or bytes, simultaneously through multiple checkout lines, or parallel computers.
Now that we are all familiar with the idea of multiple node processing and parallel computing, we need to look critically at Hadoop.
- Cost effective. Hadoop is a free, open source software that uses readily available computers in order to store and analyze data. Additionally, Hadoop can be used in a heterogeneous cluster. For those of us who speak English, that means nodes (processors) on the same supercomputer can run different operating systems (Linux, Mac, and Windows), all of which present their own benefits.
- Scalability and large storage capacity. Using multiple nodes to store data means increasing storage capacity is as simple as adding another node (“computer”). Hadoop storage is limited only by the number of nodes on the cluster. Additionally, Hadoop has been built so that it can increase its processing power by adding more nodes. Thinking of our grocery store model, opening more cashiers (nodes) means more people can be served and customers with larger amounts of groceries can be processed.
- Data resiliency and adaptability. Apache has integrated an automatic failover management system into Hadoop. If a node fails during processing, Hadoop will automatically replace the subsystem. Which, I get does not mean a whole lot to most of us, but it is critical to handling big data. Let me take you back to the grocery store for another analogy. Ever been in line, when the customer in front of you needs a price check? It usually takes just short of forever for the price to be confirmed. Let’s consider this a failure of the cashier (or node). What Hadoop does to combat this failure is a process that automatically realizes there is a failure and then sends another cashier to a register. Hadoop replaces the system that failed with a new one automatically. As if that was not enough of a failsafe, Apache also distributes copies of the data across all nodes to prevent data loss due to system interruption.
- Computing power. Traditionally, data is sent to a computer with the appropriate software. However, when data sets begin to exceed a terabyte (think about all the data in your smartphone, then multiply it by about 20, that’s a terabyte of data) in size, the computational speed is reduced. Hadoop solves this problem by pointing the software to the location of the data, rather than sending the actual data itself.
While Hadoop has tremendous strengths, I will admit that I am disappointed in Apache when it comes to data security. Hadoop does not contain any default security measures. As a result, developing security policies is a critical and necessary step when utilizing Hadoop. Now, it’s not all bad news, because Hadoop has started partnering with companies, like Zettaset, to provide the necessary security protocols(3).
In an industry which Forbes projects will surpass $200 billion in revenue by 2020, it is critical that we start handling our data swiftly, and quickly(4).Tools like Hadoop are pioneering the big data horizon by changing the way we consider and process data. We are on the edge of a big data revolution, are you diving in?