0

I have a Google Cloud SQL PostgreSQL instance with 1 read replica. The replication lag is usually 100~200ms. I have the same setup with 2 VM instances that are also on Google Cloud and the lag is usually sub 1ms. What can I do to improve this?

I measure replication lag by looking at Cloud SQL's dashboard and run this query:

SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;

I tried to find the cause with this query on the primary:

SELECT application_name,state,sync_state,client_addr,client_hostname,
pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) AS sent_lag,
pg_wal_lsn_diff(sent_lsn,flush_lsn) AS receiving_lag,
pg_wal_lsn_diff(flush_lsn,replay_lsn) AS replay_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) AS total_lag,
now()-reply_time AS reply_delay
FROM pg_stat_replication;

sent_lag is high, which the docs suggests that it indicates the primary is under heavy load. However, when I check the primary, CPU usage is only ~1% and there is no long running transaction/ddl/query.

Both the primary and replica are in the same region on Google Cloud so network can't be an issue either.

6
  • 1
    Please have a look at Monitor replication lag to know whether the cause of the lag is in the primary database, the network, or the replica. Commented Aug 27, 2024 at 14:05
  • Perhaps there is a high network latency between the servers. Commented Aug 27, 2024 at 15:10
  • @SupriyaBharti I followed the suggestions in the link and it seems the primary cannot send changes fast enough to the replica. What can cause this? As mentioned in the post, there is no heavy load/locks/long-running query on the primary. I have PostgreSQL running on Google Cloud VMs with the same CPU/RAM receiving more traffic and replication lag is sub 1ms. Commented Aug 28, 2024 at 5:29
  • @LaurenzAlbe That was 1 of my guesses too but both servers are on the Google Cloud region so latency can't be so bad (I hope). I have 2 servers running on Google Cloud VMs in the same region too and replication lag is much better than Cloud SQL. Commented Aug 28, 2024 at 5:29
  • 1
    I wouldn't hope and guess, but measure. Commented Aug 28, 2024 at 6:40

1 Answer 1

0

According to documentation, there are some possible solutions to troubleshoot this issue. These are the following solutions you can follow.

  • Edit the instance to increase the size of the replica.

  • Reduce the load on the database.

  • Send read traffic to the read replica.

  • Index the tables.

  • Identify and fix slow write queries.

  • Recreate the replica.

Also check this documentation, which suggests a few more possible reasons for this lag. Check all the reasons mentioned here which might help you to resolve your issue. Additionally you can have a look at this Monolithic primary database.

Sign up to request clarification or add additional context in comments.

1 Comment

None of these work. The instances are quite big and receive virtually no traffic. I did try check network latency like @LaurenzAlbe suggested and and it looks like cloudsql.googleapis.com/database/replication/network_lag is always 1 second, which is quite strange since both instances are in the same region.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.