My task is to migrate data from a remote Microsoft SQL server over to a Google Cloud Big Query table. The data in question comes from joining two tables on a common key and filtering with a WHERE clause. An example of the query as a Python f-string is given below:
query = f'''
SELECT dbo.SalesHeaders.*
FROM dbo.Nodes WITH (NOLOCK)
INNER JOIN dbo.SalesHeaders WITH (NOLOCK)
ON dbo.Nodes.node = dbo.SalesHeaders.node
AND CONVERT(DATE , dbo.SalesHeaders.DateTime) BETWEEN '{start.strftime('%Y-%m-%d')}' AND '{end.strftime('%Y-%m-%d')}'
WHERE dbo.Nodes.companytype = 1
'''
The current approach to achieving the above task is via a Python script that runs on my local machine. A critical part of its success, is the presence of an installed SQL driver which is required by the pyodc library in Python as to establish a connection. An example is given below:
import pyodbc
import json
CREDENTIALS = 'blah_blah.json'
def establish_connector():
with open(CREDENTIALS,'r') as file:
credentials = json.load(file)
connector = pyodbc.connect(
'DRIVER={SQL Server};'+\ # uploads.gaap.com,5143
f'SERVER={credentials["server"]};'+\
f'DATABASE={credentials["database"]};'+\
f'UID={credentials["username"]};'+\
f'PWD={credentials["password"]}'
)
return connector
In a previous question of mine I have attempted to replicate this approach such that my script either runs as a Google Cloud function or DataFlow pipeline using the Apache Beam Python SDK. Due to the lack of installed SQL drivers on serverless environments, I have not had much success.
My previous question constrained itself to using either a Cloud function or DataFlow approach. In this question, I am open to a broader scope of solutions. In other words, please suggest and outline any approach that would manage to migrate the data from the given query over to the Google Cloud Platform such that it would ultimately be accessible by Big Query.
I have already started to consider the following which could be expanded upon:
- Schedule an instance of a virtual machine (VM) to run the Python script. This would work because one can install the SQL driver on a VM.
- Create a SQL server instance and use the Database Migration service. I think that this would be migrating a lot more data than is needed. Once can the build a Big Query table from replicating the above query to pertain to this new SQL server. I am unsure about the costs related to this and am hesitant as such.
pyodbcis concerned, since the Ubuntu 18.04/22.04 base images used for Cloud Functions do not include theunixodbc-devpackage that's required bypyodbc, nor do they include any ODBC drivers suitable for use with SQL Server. Ref: System Packages Included in Cloud Functions. Creating a VM to run your Python script would probably be a better way to go.