I have a TEXT keyvalues column in Postgres:
select * from test5 limit 5;
id | keyvalues
----+------------------------------------------------------
1 | ^ first 1 | second 3
2 | ^ first 1 | second 2 ^ first 2 | second 3
3 | ^ first 1 | second 2 | second 3
4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
5 | ^ first 2 | second 3 ^ first 1 | second 3
My queries must exclude the ^ character from the middle of the match, so I'm using regular expressions:
explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
-> Gather (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
-> Parallel Seq Scan on test5 (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Rows Removed by Filter: 1666668
Planning Time: 0.068 ms
Execution Time: 7337.184 ms
The query works (zero rows match), but is way too slow at > 7 seconds.
I thought indexing with trigrams would help, but no luck:
create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);
explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
-> Bitmap Heap Scan on test5 (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Rows Removed by Index Recheck: 5000005
Heap Blocks: exact=47620
-> Bitmap Index Scan on test5_keyvalues_idx (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Planning Time: 0.412 ms
Execution Time: 23734.722 ms
The query with the trigram index is 3x slower! It still returns the correct result (zero rows). I expected the trigram index to figure out immediately there's no second 0 string anywhere, and be super fast.
(Motivation: I want to avoid normalizing the keyvalues into another table, so I'm looking to encode the matching logic in a single TEXT field using text indexing and regexps instead. The logic works, but is too slow, as is JSONB.)
ILIKEover TEXT arrays, only via the slowunnest()(sequential scan) AFAIK. But if you know a way, let me know!