What is Big Data and why is it buzzing around so much?
The term big data refers to a set or collection of data that is so massive, fast and complex that it is next to impossible to process it using the already existing or traditional methods without any advancements.
But the most go to definition for Big Data is the Gartner’s definition.
which states that “Big Data is data that contains greater variety arriving in increasing volumes and ever-higher velocity. This is known as 3’V’s”.The large Volume of data in many systems, the wide Variety of data types stored in big data systems and the velocity at which the data is created, gathered and processed.
As far as the current requirements, the utilization of big data enables the companies to become progressively customer-centric.
When was the term Big Data first used and Why?
In 1980 Oxford English Discovery folks discovered, that sociologist Charles Tilly was the very first person to use the term Big Data in his article.
Later in 1990 Peter denning thought of building machines that can recognise or forecast patterns in data.
Then in 1997 Michael Cox and David Ellsworth used the term data for the first time in ACM paper. In 1998 John Mashey of SGI was credited for being the first one to spread the term and concept of Big-Data.
Later In 2000, Francis Diebold stated big-Data as “the explosion in quantity and sometimes quality of available and potentially relevant data”.
In 2001 Industry analyst Doug Laney came up with the 3’V’s of Big-Data. 2005 Tim O’Reilly published “what is web 2.0?” Where the term Big Data in its modern context.
In 2005 Hadoop was created by Yahoo! build on top of Google’s MapReduce.
Later in 2008 Google processed 20 petabytes of data in a single day.
Year 2013 4.4 zettabytes of information was produced by the universe.
From 2016 till now businesses are implementing the latest big data technologies such as in-memory technologies to take advantage of big data.
The term big data is no longer the term it used to be. It has become more important than it was ever and this fact demands an answer to a major question…
What size of Data is considered as Big to be termed as Big Data?
We have so many relative assumptions for the term Big Data. It is possible that the amount of 50-60 terabytes of data can be considered as big data for a Start-up but it is not considered as Big Data for Multinational technology companies like Google and Facebook. This is because they do have storage devices to store and process that amount of data. Big Data is the amount of data just beyond a particular technology’s or company’s capacity to store, manage and process efficiently.
The three ‘V’s of Big Data…
According to Doug Laney, velocity is becoming more of a factor as organisations look to more and more operational decisions to automate processes in real-time.
Volume: As the name suggests volume is the amount of data and the amount of data matters. For some bigger organizations, this might be hundreds and thousands of petabytes of data. With big data, high volumes of data needs to be stored which can’t be processed on a single machine.
Velocity: Velocity is the rate at which the data is received. Which is very high and needs to respond in real-time.
Variety: The data arrived is in many formats or types and from a lot of different sources.
Why Big Data is a problem for bigger companies and what are the challenges that they face?
Storing such an enormous amount of data itself is a big challenge and on top of that, this data is required to respond in real time. Some bigger companies like Google, Facebook and Amazon on average recieve and send hundreds of petabytes and ettabytes of data.
So is it possible to have such massive capacity in a single device?.Yes it is kind of possible to create such a vast capacity hardDrive but the rate of data is also a factor that matters. Imagine searching for a pasta recipe online and the results showing up after days. would that be of any use to you? No, right? Similarly,
If we use a single to store a huge amount of data, it’ll slow down the process of sending and receiving data on an extreme level. The data stored by these companies are required to process at a very fast rate. so this is not the ideal solution for this problem.
Why do we do now..? Is there any other way to handle this problem? And the answer is yes. This problem can be solved using distributed storage concept using the Master-slave architecture.
Distributed storage concept
Distributed storage is a set of computers where the information is stored across various physical servers or nodes. This concept is implemented with the help of Master-slave architecture. In this architecture, there are many physical computer or servers known as slaves are connected to the main server computer known as Master and parallel to each other.
Each slave computer or data node contributes its storage capacity to the Master computer or the Name node. This is how the storage makes up to capacity and the rate of data transfer also becomes faster because in the same amount of time a large data can be transferred from multiple different nodes. The Master Node aggregates all of the data from the slave computers. This function is done by software named as Hadoop.
What is Hadoop and how it works?
Hadoop is a collection of open-source software to solve the Big Data storage and manipulation problems. This software works on the Master-slave topology or architecture.
Hadoop allows you to easily use all the storage and processing capacity in cluster computers.
Hadoop uses the HDFS ( Hadoop distributed file system) protocol. HDFS provides better data throughput than conventional file systems. Due to many slave nodes, the rate of data transfer from RAM to HardDisk gets faster. Many big companies like Amazon and Facebook do use Hadoop or some kind of distributed file system.
The Amazon EMR is based on Apache Hadoop.
With millions of users and more than a billion page views every day, Facebook accumulates a huge amount of data Facebook has the largest Hadoop cluster in the world which consists of more than 4000 computers connected.
Apache Hadoop’s MapReduce and HDFS components were inspired by Google papers on MapReduce and Google filesystem.