Data Engineering on Google Cloud – Automated Data replication approach from Cloud SQL to Big Query using Python
Google Cloud has some rich features and products w.r.t SQL when your apps are built on MySQL or PostgreSQL. But If you have a requirement to replicate the data and schema from Cloud SQL(SQL Server) to Big Query then the options are limited.
- Google Cloud Data Flows doesn’t support Cloud SQL(SQL Server) as Source as per official docs
- Big Query Data Transfer service doesn’t directly support Cloud SQL(SQL Server) as Source as per official docs
- Cloud SQL Import/Export doesn’t support BigQuery as a destination as per official docs
- Google Cloud Data Fusion does support Cloud SQL to Big Query data replication but it becomes very expensive to operationally run the process daily/scheduled as it incurs the cost for Compute Engine(GCE), Data Fusion, Dataproc clusters. Please see this very well written blog here if your project managers are happy to shell out $$$ using this approach :-).
Few more things to Note:
- Cloud SQL needs SQL Proxy to connect and extract the data out from the MSSQL instance
- If the Cloud SQL is deployed inside a VPC you need a VPC connector for you to be able to connect to the MSSQL instance
Why I think Python based data replication is a better option?
- Operational efficiency, logging, tracing and overall management can be effectively managed from a long-term perspective.
- Support for libraries like SQLAlchemy, pyodbc are available in python.
- Deployment of Cloud SQL enabled with SQL Proxy is achievable through container images and Google Cloud Run. Please have a look at my previous blog on how this can be achieved.
- SQL Schema changes can also be replicated into BigQuery with effective implementation
- SQL Queries(Inserts/Updates/Deletes) on Big Query are cheaper compared to running the data replication on Data fusion clusters
- Python implementation makes this cloud agonistic avoiding dependency on cloud-specific products.
- Create a configuration of tables in a config.py file with few attributes
- source and destination table name
- primary key in source table used for updates
- Create templated queries for inserts/updates/truncates/deletes from SQL to BQ
- Create templated query for dynamic table schema to be built in BQ using INFORMATION_SCHEMA.COLUMNS table in MSSQL
- Build python code to dynamically create schema from Cloud SQL and apply the same to Big Query using a query such as
- Build a data type mapping between SQL to Big Query using a simple mapping function
- Some Sample templated insert/delete/update queries for your reference. The parameters from the config.py file above will be passed into these templated queries as parameters and we build a dynamic SQL query that can be executed on the MSSQL instance and also on Big Query.
Python Packages used:
- Any deployment pipeline which can push the python code as a container image to google container registry.
- Use Cloud Run service to deploy the container image and run the service as an API
- Use of VPC connector to be able to let the Cloud Run service connect to the Cloud SQL using the Cloud SQL proxy running in the docker container
Overall when we look at the requirement it may seem like a trivial task but to implement python code gives us the benefit to make the overall process operationally effective and cost effective. Also if you plan to move the same process to another cloud it makes it easy as this doesn’t have dependency on cloud-specific tooling.
Hope this is helpful. Happy Learning!!