You are hereHome / Attunity Blog
There is one remaining technical challenge to overcome that becomes especially acute in a big data environment in which there is a continuous need to share information among the different analytical platforms. Business analysts seeking to answer an iterative set of questions will probably need to have access to analytical results, source data sets, and profiles stored in a data warehouse at the same time, However, with different latencies associated with different systems, there are bound to be consistency and synchronization issues. This reflects yet another of the technical objectives for information delivery ─ maintaining consistency.
In our previous entry, we discussed how data latency issues can be remediated using replication and change data capture, and this addresses the aspects of speeding information delivery to the analytical platform, especially in the world of big data. However, we still need to address the other side of the process: broadening access to a variety of data sources that are to feed our large-scale analytics. This is a particularly thorny issue when considering the use of data embedded in older legacy systems whose architectures pre-date the development of relational database systems. In addition, the growing demand for streamed data or for raw data sets stored in files implies an extra requirement for absorbing different types of source data.
Today, big data environments are the beneficiaries of data replication and CDC technologies. As we have discussed, the most significant bottleneck for big data applications is moving the data into the environment. Replication provides simple, yet rapid methods for initial loads (thereby addressing our first technical objective of reducing data latency), while CDC ensures that the big data environment’s view remains consistent with the source data systems (addressing our third technical objective─ maintaining consistency).
In our previous sets of entries, we have drawn the conclusion that in supporting the emerging data availability requirements for big data, the corresponding technical objectives for high performance information delivery must be driven by three key goals: reducing latency, broadening data accessibility, and maintaining data consistency. In this series of notes, we will look at the technologies that will help achieve those objectives by reducing or smoothing out data latency, enabling access to many data sources, and generally providing a seamless capability for provisioning large data sets to the right targets within a reasonable time frame.
In my previous posts, we looked at the two sets of issues that drive the technical objectives for high performance information delivery: data latency and data accessibility. In this final post, we will wrap up this series looking at questionable data consistency and synchronization.
There is a wide assortment of data sources, ranging from legacy difficult-to-extract sources all the way to unstructured data streams whose characteristics can change rapidly, so improving access capabilities to these different sources is critical. Even without considering opportunities for absorbing and making sense of unstructured data, the variety of potential source of structured data is quite wide. Organizations have been capturing and archiving data for decades using a series of data storage using models and organization schemes with ever increasing complexity.
If delays in information delivery pose the bottleneck, then reducing that latency should ease that bottleneck. This implies determining the root causes of data latency and then proposing ways to eliminate those root causes. I think they are pretty straightforward, and will list some in order of the process of moving data from the source to its target.
In our last series of blog posts, we examined the business implications of a bottleneck in delivering information to the target locations for analysis, especially in the context of big data applications. A summary review of the business impacts highlights how they are directly related to three key characteristics of the information delivery bottleneck:
The appearance of “big data” in the pages of the New York Times and the Wall Street Journal signal the mainstreaming of high performance computing. Yet the focus continues to be on the magic, not the mechanics, and clumsy attitudes about information availability will kill the “big data buzz” long before the program will gain any traction.
To a large extent, the messages of the big data movement are clearly targeted to aspirants in the analytics space. What I mean by that is the expectation is that engaging some data scientists to create a big data analytics platform and getting them going on some analysis projects by scraping data sets from various sources will open new vistas for business improvement.