Imagine if you are having 100MB of data which is stored in structured way (RDBMS) and you need to process it. The best way to use it on your personal computer because PC doesn’t have any problem to process this kind of data. Even PC will help to work up to few GBs of data.
But what will happen when
1. Data grow exponentially and you are almost approaching the limits of computer.
2. Data is receiving as unstructured form
3. Data becomes burden to your IT
Management wants to derive the information from both relational and unstructured data. The answer is Hadoop. Hadoop is an open source project of the Apache Foundation and written in Java developed by Doug Cutting who named it after his son’s elephant. Hadoop uses Map Reduce and Google file system technologies as its foundation. Hadoop is opted for distributed deployment not for much parallel for processing. It is optimized to handle massive quantities of data which could be structured (like RDBMS), unstructured (tweets or facebook comments etc.) or semi-structured, using commodity hardware, that is, relatively inexpensive computers. Hadoop replicates its data across different computers, so that if one goes down, the data are processed on one of the replicated computers.