I have a Google Cloud SQL PostgreSQL instance with 1 read replica. The replication lag is usually 100~200ms. I have the same setup with 2 VM instances that are also on Google Cloud and the lag is usually sub 1ms. What can I do to improve this?
I measure replication lag by looking at Cloud SQL's dashboard and run this query:
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;
I tried to find the cause with this query on the primary:
SELECT application_name,state,sync_state,client_addr,client_hostname,
pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) AS sent_lag,
pg_wal_lsn_diff(sent_lsn,flush_lsn) AS receiving_lag,
pg_wal_lsn_diff(flush_lsn,replay_lsn) AS replay_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) AS total_lag,
now()-reply_time AS reply_delay
FROM pg_stat_replication;
sent_lag is high, which the docs suggests that it indicates the primary is under heavy load. However, when I check the primary, CPU usage is only ~1% and there is no long running transaction/ddl/query.
Both the primary and replica are in the same region on Google Cloud so network can't be an issue either.