Big data offers a new style of data analysis which is different from traditional business intelligence (BI). Let me explain. Traditional BI projects required business users to know the questions they want to ask before they started a project. Their questions drove the data model for the data warehouse and determined how the data would be stored. The data model also determined which data was collected by which mechanism. Creating the enterprise data model this way could be a lengthy process. In many cases, it resulted in a system that was slow to adapt to changes within the business.
Now, with big data, data analysis is happening from the bottom up. Organizations are collecting as much data as they can from many sources without knowing beforehand exactly what questions they are going to ask about data collection. This means that they don’t need to transform data into the standard data model of the corporate data warehouse at the point of data collection. Instead, their data is stored in the form in which it was originally captured and only given an appropriate structure by the analysis process using it. This flexible approach leads to a dynamic approach to data analysis that lets them react quickly to the rapid changes within their business.
- The first rule of the process is to access the real-time needs versus the batch. There have been the three Vs that drive Hadoop’s adoption, whether it is the volume of the data being faster than a database can handle, the variety of the sources to find correlation within the sources and/or signals; video, audio or other non-relational data types, and lastly, the velocity, as the responses required in under a millisecond. In these use cases, the real-time need became reviewed, as the new data sources became part of the next rule.
- The second rule is to be prepared to ingest new streams of data into Hadoop. There the considerations of capture need to be scalable, as not only requiring more than one server to handle the workload, but the right tools to handle hundreds and thousands of sourcing mechanisms.
- The third rule is how to manage the ingest and merge of data to Hadoop. When collection from high data rates/data volumes, it’s not realistic to store first, as just the processing one day of data takes more than one day to processing. People live in a world, where systems searching engines give them real-time analytics, sampling, aggregations on a web browser. Businesses are no different. After the landing zone of the data is built, they need to leverage data wrangling and transformation tools.
- The fourth rule is to define how to ingest and real time transform data. Often this process is labeled for the data warehouse in structured analytical scenarios, however, in the data modeling methods and structured required by big data environments, the data vault modeling methods, as well as, a focus to largely de-normalize the data structures have prevailed.
- The final and fifth rule is to analyze, enhance, and leverage the data.