Please enable JavaScript to view this site.

Navigation: Quickstart Guides

Hadoop Quickstart

Scroll

HDFS is a distributed Java-based file system typically used to store large volumes of data. HDFS is one of several subsystems that make up the framework called Hadoop: which is actually a framework of several subsystems that provide for parallel and distributed computation on large datasets:

HDFS, A distributed file system that utilizes a cluster of machines to provide high-throughput access to data for Big Data applications.

Map Reduce, the distributed processing framework that manages and controls processing across the cluster

Together, these subsystems allow for the distributed processing of large data sets scaling from single servers to thousands of machines. Each machine provides local computation and storage managed by the Hadoop software to deliver high-availability and high performance without relying on high cost hardware based high-availability.

This Quickstart illustrates two methods provided by Connect CDC SQData to replicate changes in virtually any source datastore to HDFS targets. Connect CDC SQData treats HDFS as a simple Target Datastore for information captured by any of Connect CDC SQData’s Capture agents. The first utilizes our time tested high performance Capture agents and Apply Engine technology while the second introduces a new capability that eliminates maintenance of the Apply Engine when table structures are changed in Relational source datastores, beginning with Db2 z/OS.

Both methods simplify the creation, configuration and execution of the replication process by utilizing an Engine that Applies to an HDFS File System. Connect CDC SQData treats HDFS as a simple Target Datastore for information captured by any of Connect CDC SQData's capture agents. Precisely's Connect CDC SQData product architecture provides for the direct point-to-point (Source to Target) transfer of captured data without the use of any sort of staging area. When properly configured, captured data does not require or even land on any intermediate storage device before being loaded into the target Kafka Topic.

While Connect CDC SQData can be used to implement a solution that customizes the data written to HDFS, we and the industry don't recommend it. There are several reasons but the most obvious are the ongoing maintenance requirements. Streaming data to HDFS is fundamentally different from replicating data, for example from mainframe DB2 to Oracle on Linux.

In a Db2 to Oracle use case, it is understood that both source and target table schema's will change and they may never be identical or even share identical column names. DBA maintenance of the Db2 schema's will be scheduled to minimize disruption of source applications and the downstream Oracle applications will have to decide if and how to absorb those changes. Oracle DBA's will have to coordinate changes to their schema's with the Db2 DBA's and the Oracle application developers. Consequently the Apply Engine in the middle will need to have it's source Descriptions updated and possibly the Oracle target table Descriptions updated as well. Changes to mapping Procedures may also have to be changed. Solid configuration management procedures are required to fully implement these changes.

Implementing an HDFS architecture can change all that. Modern HDFS data consumers read the JSON formatted data in a HDFS record that also contains the schemas describing that data. The biggest issue with this technique however are the JSON schemas included in the payload of every single HDFS record produced. That problem is solved by the AVRO data serialization system which separates the data from the schema. Data is read by the consumer using the schema that describes the data at the time it was written. This of course assumes that the tools and languages used by the producer and consumer are AVRO "aware". With Connect CDC SQData Version 4, we have embraced Confluent's Schema Registry for managing the schema versions of AVRO formatted HDFS records through the automatic registration of the JSON schema's.

Connect CDC SQData, while still supporting HDFS messages in JSON format, also supports AVRO as yet another way to improve performance and minimize latency between a Captured change and its availability to Kafka consumers. With Connect CDC SQData Version 4, we have also embraced Confluent's Schema Registry for managing the schema versions of AVRO formatted Kafka messages. The Version 4 Apply Engine also benefits from the automatic registration of Kafka topic schema's.

Apply engines that utilize the REPLICATE function, for DB2 as well as IMS and VSAM source data will still require manual intervention to replace the source DESCRIPTION parts that correspond to altered schemas or changed "copybooks". Once that has been accomplished however the Apply Engine need only be Parsed and Started with the registration of the updated AVRO schemas performed automatically. Even Version 4 Apply Engines that have "customized" target DESCRIPTIONS and mapping PROCEDURES will benefit because the Target DESCRIPTIONS used to create the AVRO schema's will be automatically validated and registered, if necessary, when the Engine is Started.

Apply_Engine_Db2_HDFS

Finally, Connect CDC SQData Version 4 also introduces our revolutionary Replicator Engine for relational source databases beginning with DB2 12 z/OS. The Replicator Engine fully automates the propagation of source schema changes and Kafka message production using AVRO and the Confluent Schema Registry. The Replicator also supports parallel processing of the replication workload through multiple Producer threads with the number of threads or workers specified at run-time. This means that Connect CDC SQData becomes a utility function within the enterprise architecture, reacting to Relational schema changes without interruption and without maintenance of the Kafka producer configuration running in your Linux environment.

Replicator_Engine_Db2_HDFS

Notes:

1.HDFS is only supported on the Linux OS platform. Connect CDC SQData's architecture however enables data captured on virtually any platform and from most major database management systems to be loaded into HDFS.

2.To learn more, Connect CDC SQData recommends both  Apache's AVRO documentation and Confluent's Schema Registry Tutorial available on web browsers everywhere.