3

I have a dataframe that I want to merge back to a SQL table - not merge in the pandas sense, which would be a join, but a SQL merge operation to update/insert records into the table based on a comparison between the dataframe and the table.

There are a few work arounds I can see, such as writing the dataframe to a new table and doing the merge in SQL, or deleting any existing records based on the primary keys and appended the whole dataframe to the table using to_sql - but is there any built in function that would do this sort of merge in python directly?

4
  • 1
    Personally, I’d do the merge in DataFrames. One with your current dataset and another DataFrame with your existing SQL table. Merge those and then push back into a new temp SQL table. (Than handle the table drop/replace in SQL). Commented May 16, 2020 at 19:07
  • 1
    I'm also looking for a reference to do the same... but I don't think its good idea to delete/drop records/table at your target. What if you end up with some issue in between and end up loosing the data? Commented Apr 12, 2021 at 6:41
  • I think its better to dump the data to an intermediate table in SQL DB, and then do a MERGE (or an SP with MERGE) by calling from python. Commented Apr 12, 2021 at 6:43
  • Databricks has a neat implementation - docs.databricks.com/delta/… - it'd be great if there was something like this in Pandas. Commented May 31, 2022 at 8:39

1 Answer 1

1

The concat function in pandas allows one to accomplish something similar to a merge.

I encountered a similar issue and was able to develop a function that "merges" two dataframes assuming a shared index. In my case, I turned a field "PrimaryKey" into the index for each dataframe and then merged the two dataframes using this field.

import pandas as pd

        def merge_dataframe_diff(target_df, source_df, merge_on_fields: [list] = ["PrimaryKey"], drop_index: [bool] = True)):
            """Merges two dataframes based on shared columns
            Assumptions:
                1) The dataframes share all the same columns
                2) There is a "PrimaryKey" column to merge on; this could be modified to include multiple columns
            Args:
                source_df [dataframe]: source dataframe
                target_df [dataframe]: target dataframe (to merge/upsert into)
                merge_on_fields [list]: field(s) to merge the two dataframes on
                drop_index [dataframe]: whether to drop the merge_on_fields from the merged dataframes and the returned dataframe

            Returns:
                full_df [dataframe]: merged dataframe
            """
            source_df.set_index(merge_on_fields, drop=drop_index, append=False, inplace=True, verify_integrity=True)  # Set Pk to index to enable merging
            target_df.set_index(merge_on_fields, drop=drop_index, append=False, inplace=True, verify_integrity=True)
            full_df = pd.concat([target_df[~target_df.index.isin(source_df.index)], source_df])  # SQL merge aka upsert the two dfs using index as "merged on" field          
            full_df = diff_df  # it is assumed to be the first run and the source dataframe is returned
            if drop_index:
                full_df.drop(columns=merge_on_fields, inplace=True)
            return full_df
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.