How to reduce replication lag for read replica in Google Cloud SQL for PostgreSQL?

Question

I have a Google Cloud SQL PostgreSQL instance with 1 read replica. The replication lag is usually 100~200ms. I have the same setup with 2 VM instances that are also on Google Cloud and the lag is usually sub 1ms. What can I do to improve this?

I measure replication lag by looking at Cloud SQL's dashboard and run this query:

SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;

I tried to find the cause with this query on the primary:

SELECT application_name,state,sync_state,client_addr,client_hostname,
pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) AS sent_lag,
pg_wal_lsn_diff(sent_lsn,flush_lsn) AS receiving_lag,
pg_wal_lsn_diff(flush_lsn,replay_lsn) AS replay_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) AS total_lag,
now()-reply_time AS reply_delay
FROM pg_stat_replication;

sent_lag is high, which the docs suggests that it indicates the primary is under heavy load. However, when I check the primary, CPU usage is only ~1% and there is no long running transaction/ddl/query.

Both the primary and replica are in the same region on Google Cloud so network can't be an issue either.

Please have a look at Monitor replication lag to know whether the cause of the lag is in the primary database, the network, or the replica. — Supriya Bharti
– Supriya Bharti, Commented Aug 27, 2024 at 14:05
Perhaps there is a high network latency between the servers. — Laurenz Albe
– Laurenz Albe, Commented Aug 27, 2024 at 15:10
@SupriyaBharti I followed the suggestions in the link and it seems the primary cannot send changes fast enough to the replica. What can cause this? As mentioned in the post, there is no heavy load/locks/long-running query on the primary. I have PostgreSQL running on Google Cloud VMs with the same CPU/RAM receiving more traffic and replication lag is sub 1ms. — Nguyen Nguyen
– Nguyen Nguyen, Commented Aug 28, 2024 at 5:29
@LaurenzAlbe That was 1 of my guesses too but both servers are on the Google Cloud region so latency can't be so bad (I hope). I have 2 servers running on Google Cloud VMs in the same region too and replication lag is much better than Cloud SQL. — Nguyen Nguyen
– Nguyen Nguyen, Commented Aug 28, 2024 at 5:29

Supriya Bharti · Accepted Answer · 2024-08-28 11:05:20Z

0

According to documentation, there are some possible solutions to troubleshoot this issue. These are the following solutions you can follow.

Edit the instance to increase the size of the replica.
Reduce the load on the database.
Send read traffic to the read replica.
Index the tables.
Identify and fix slow write queries.
Recreate the replica.

Also check this documentation, which suggests a few more possible reasons for this lag. Check all the reasons mentioned here which might help you to resolve your issue. Additionally you can have a look at this Monolithic primary database.

edited Aug 28, 2024 at 11:05

answered Aug 28, 2024 at 9:48

Supriya Bharti

3331 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nguyen Nguyen Over a year ago

None of these work. The instances are quite big and receive virtually no traffic. I did try check network latency like @LaurenzAlbe suggested and and it looks like cloudsql.googleapis.com/database/replication/network_lag is always 1 second, which is quite strange since both instances are in the same region.

Collectives™ on Stack Overflow

How to reduce replication lag for read replica in Google Cloud SQL for PostgreSQL?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related