0

My task is to migrate data from a remote Microsoft SQL server over to a Google Cloud Big Query table. The data in question comes from joining two tables on a common key and filtering with a WHERE clause. An example of the query as a Python f-string is given below:

query = f'''
    SELECT dbo.SalesHeaders.* 
    FROM dbo.Nodes WITH (NOLOCK) 
    INNER JOIN dbo.SalesHeaders WITH (NOLOCK) 
    ON dbo.Nodes.node = dbo.SalesHeaders.node 
    AND CONVERT(DATE , dbo.SalesHeaders.DateTime) BETWEEN '{start.strftime('%Y-%m-%d')}' AND '{end.strftime('%Y-%m-%d')}'
    WHERE dbo.Nodes.companytype = 1
    '''

The current approach to achieving the above task is via a Python script that runs on my local machine. A critical part of its success, is the presence of an installed SQL driver which is required by the pyodc library in Python as to establish a connection. An example is given below:

import pyodbc
import json

CREDENTIALS = 'blah_blah.json'

def establish_connector():
    with open(CREDENTIALS,'r') as file:
        credentials = json.load(file)
    connector = pyodbc.connect(
                               'DRIVER={SQL Server};'+\ # uploads.gaap.com,5143
                               f'SERVER={credentials["server"]};'+\
                               f'DATABASE={credentials["database"]};'+\
                               f'UID={credentials["username"]};'+\
                               f'PWD={credentials["password"]}'
                               )
    return connector

In a previous question of mine I have attempted to replicate this approach such that my script either runs as a Google Cloud function or DataFlow pipeline using the Apache Beam Python SDK. Due to the lack of installed SQL drivers on serverless environments, I have not had much success.

My previous question constrained itself to using either a Cloud function or DataFlow approach. In this question, I am open to a broader scope of solutions. In other words, please suggest and outline any approach that would manage to migrate the data from the given query over to the Google Cloud Platform such that it would ultimately be accessible by Big Query.

I have already started to consider the following which could be expanded upon:

  1. Schedule an instance of a virtual machine (VM) to run the Python script. This would work because one can install the SQL driver on a VM.
  2. Create a SQL server instance and use the Database Migration service. I think that this would be migrating a lot more data than is needed. Once can the build a Big Query table from replicating the above query to pertain to this new SQL server. I am unsure about the costs related to this and am hesitant as such.
2
  • 1
    Google Cloud Functions would be the wrong tool for the job, at least as far as Python and pyodbc is concerned, since the Ubuntu 18.04/22.04 base images used for Cloud Functions do not include the unixodbc-dev package that's required by pyodbc, nor do they include any ODBC drivers suitable for use with SQL Server. Ref: System Packages Included in Cloud Functions. Creating a VM to run your Python script would probably be a better way to go. Commented Oct 30, 2023 at 9:48
  • @AlwaysLearning Thank you for confirming this to be the better approach. Knowing this makes it easier to commit to the project. Thank you. Commented Oct 30, 2023 at 9:53

1 Answer 1

0

Create a Job from Template feature in Dataflow, seamlessly transferring data from SQL Server to BigQuery. Simply inputting the necessary parameters such as the JDBC URL, username, and password, and then clicking execute, ensures that the job executes flawlessly, delivering the expected results without any hassle.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.