Data Engineering on Google Cloud- python docker cloud SQL for sqlserver

Feeling good to be back blogging again!! I was recently fortunate to have got a chance to work on Google cloud and i found it super interesting!!! Hence the urge to start writing again. Hope this is useful for anyone looking for a serverles data engineering solution on GCP.

Google cloud offers 3 flavors of SQL PostgreSQL, MySQL and SQL Server as a fully managed relational databases in the cloud. If your application or data engineering pipelines are utilizing SQL Server as your backend and if you are looking to automate the process using python that’s where this post can help you.

If you plan to have long term investment in Google Cloud and and implement a data engineering solution independent of cloud specific data transfer management tooling for Cloud SQL & Big Query then implementing a python based data transformation layer helps a lot.

Before we move to the intricacies of how data engineering can be implemented on GCP lets take a look at how we would connect to Cloud SQL for general testing

Local Connection using Cloud SQL Proxy

Having worked on Microsoft technologies for a long time, when i had a look at Cloud SQL I thought it must work the same way as Azure SQL(PaaS) but there are some notable differences in how the DB instances are securely accessed and connected.

Few Points to note regarding local connection to DB Instance in your PC

  • Cloud SQL doesn’t need to whitelist IP addresses or IP ranges and removes the need for Static IP
  • Cloud SQL Auth Proxy(runs as background process) provides secure access to your instances without the need for Authorized networks or for configuring SSL.
  • Cloud SQL can still be accessed through SQL Management studio(SSMS) but requires Cloud SQL Auth proxy
  • Enable Trusted connection in the SSMS client connection window
  • The Cloud SQL Auth proxy automatically encrypts traffic to and from the database using TLS 1.2 with a 128-bit AES cipher

While the Cloud SQL Auth proxy can listen on any port, it only creates outgoing connections to your Cloud SQL instance on port 3307. If your client machine has an outbound firewall policy, make sure it allows outgoing connections to port 3307 on your Cloud SQL instance’s IP.

Diagram of the Cloud SQL Auth proxy connecting from client software to SQL instance
How does connection to Cloud SQL works?

Actual Implementation

So imagine we are developing

  • Data engineering pipeline using Serverless & python
  • Extract data from cloud sql database to Google BigQuery
  • Having your data in BigQuery helps you with low cost data transformation as well as for developing Reporting solutions

The possibilities for connecting to Cloud SQL(SQL Server) using python scripting are

  • pyodbc or sqlalchemy python package. There is awesome github post for pyodbc usage available here
  • ODBC Driver for SQL Server
  • Since ODBC drivers need to be installed for pyodbc package to connect to SQL, we would need a custom docker image
  • Deploy the python app into a serverless service in GCP like cloud run

The next challenge in building the solution is creating a custom docker image which can install

  • msodbcsql17
  • unixodbc
  • on debian 10 image
# load python 3.8 dependencies using slim debian 10 image.
FROM python:3.8-slim-buster

# build variables.
ENV DEBIAN_FRONTEND noninteractive

# install Microsoft SQL Server requirements.
RUN apt-get update -y && apt-get update \
  && apt-get install -y --no-install-recommends curl gcc g++ gnupg unixodbc-dev

# Add SQL Server ODBC Driver 17 for Ubuntu 18.04
RUN curl | apt-key add - \
  && curl > /etc/apt/sources.list.d/mssql-release.list \
  && apt-get update \
  && apt-get install -y --no-install-recommends --allow-unauthenticated msodbcsql17 mssql-tools \
  && echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bash_profile \
  && echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc

# upgrade pip and install requirements.
COPY /requirements.txt /requirements.txt
RUN pip install --upgrade pip
RUN pip install -r /requirements.txt

# clean the install.
RUN apt-get -y clean

# copy all files to /app directory and move into directory.
COPY . /app

ENTRYPOINT ["python", ""]

Deployment into Google Cloud Run

  1. Build a container image using Dockerfile created above
  2. The image gets built and automatically gets uploaded into Google cloud Container Registry
  3. Deploy the container image into Cloud Run service using the below commands.
#build image using the above Dockerfile
gcloud builds submit --tag${project_id}/$(image_name)

#deploy cloud run using vpc connector
gcloud run deploy dev-pyconnect-mssql --image${project_id}/$(image_name)

I found the automation for the whole process in a simple and easy to implement way using gcloud SDK commands.

As always.. Happy Learning!!

One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s