Google’s new product (Is it really?) – Datastream

Google introduced Datastream yesterday during the “Data Cloud Summit”. Google’s official documentation defines Datastream as:

“Datastream is a serverless and easy-to-use change data capture (CDC) and replication service. It allows you to synchronize data across heterogeneous databases and applications reliably, and with minimal latency and downtime.”

https://cloud.google.com/datastream/docs/overview

Reliability, performance (minimal latency), availability (downtime) are attributes that can be taken for granted in the context of popular cloud providers. In fact, I urge all major cloud providers to stop using these adjectives/quality attributes when defining their services because these attributes have become basic expectations. And removing these adjectives help developers focus on what exactly the service does. Alright! Enough ranting here and let me move to my next rant.

I actually experimented with Datastream before I wrote this article and let me tell you – It’s extremely simple to use, unlike other Google Cloud products. And does the job perfectly well. What job? It basically allows developers to connect to source databases (Oracle or MySQL) and replicate the data into Google Cloud Storage. That’s it. The beauty lies in simplicity. Every time a new insert, update or delete occurs to your data source, Datastream will capture that data change and replicate it to Cloud Storage (destination). You may ask, “why to store data in Google Cloud Storage?”. Lots of use cases there and probably will explore this subject some other time. But for time being Google Datastream can capture the data change in your source database and replicate it in Cloud storage. You can also configure Datastream to take a snapshot of the entire database and replicate it in Google Storage.

Some cool features – create, pause, restart, delete or simply monitor your stream in a visually rich dashboard.

Now, coming to the tone of skepticism in the title of this post. There are a bunch of third-party products including open-source products that are exactly doing the same thing. Pull data from one source and push it to the next. Of course, you will need to probably spend a lot of time with configuration and probably write scripts for provisioning more resources when required. Google also has an amazing product(service) called “Dataflow” that is widely used to build data pipelines. It connects to data sources and one can write transformation operations before storing to Cloud storage. It’s one of those all-powerful distributed, fault-tolerant, self-managed ETL tools. In some aspects, Datastream functionality is a subset of Dataflow functionality.

So, why should one really use Datastream and not Dataflow. As of now, I see a couple of reasons

  1. Datastream is the best option if all you care about is capturing data changes and backfilling from Oracle or MySQL to Cloud Storage. Maybe in the future, they would expand to other source databases and hope Google is altruistic enough to allow the data to be moved to non-Google Cloud storage like AWS S3
  2. It’s simpler as you’re really not building a pipeline. No coding required.
  3. It’s way cheaper. Dataflow priced is based on hourly usage, computing environment, and amount of data processed. Datastream price is based on the amount of data transferred from the source system. Both products also charge for storage. Quick back-of-the-envelope calculations will clearly show that Datastream would be cheaper than Dataflow for the same workload (10GB+ data). I will upload a price calculator in a different post.