0

I have a query that I run hourly and I am processing a certain dataset from this query. While processing this dataset, I need to ignore some IDs, I am currently doing this with NOT IN, but the number of IDs I need to ignore is around 50.

The question I am wondering is, I am creating a text file in a certain pattern with the data I am processing, should I use this ignore operation directly in the query or inside the foreach pattern for better performance?

Query returns around 5000-7000 data in a dataset consists of 10M record, and I need to ignore around 50 ID from resultset.

Lets say;

$blacklist_arr = array(1,10,20,30,40,50,60,70,80,90,100); //around 50 element in array~

What I use now;

...QUERY...
resultSet.ID NOT IN (\'' . implode( "', '" , $blacklist_arr ) . '\')

What I'm planning to use;

foreach ($final_dataset as $final_data) {
    ...
    if (!in_array($final_data, $blacklist_arr )) {
    //write to file
    ...

edit* Query structure is below;

SELECT * 
FROM
    (
        (
        SELECT DISTINCT a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a`
            INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 ) 
        WHERE
            ( a.someColumn > NOW( ) - INTERVAL 2 HOUR ) 
            AND ( a.col3 < DATE_HERE ) 
        ) UNION
        (
        SELECT  a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a` 
        WHERE
            ( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL ) 
            AND ( a.col3 < DATE_HERE ) 
        ) 
    ) AS resultSet 
WHERE
    resultSet.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 ) 
ORDER BY
    resultSet.col3 ASC,
    resultSet.col2 ASC,
    resultSet.col4 ASC,
    resultSet.col1 DESC
8
  • Try checking the performance of just the SQL, one with the IN clause and without it. See how much difference it makes. Commented Jan 1, 2022 at 8:27
  • @NigelRen sorry I forgot to mention about it. Query time with/without NOT IN is around 0.080 - 0.0100 seconds, right now it looks like it doesn't makes any difference but no of elements in array will increase daily/weekly. Expected no of elements in this array will be 300-500 each year and will be resetted 1 year interval Commented Jan 1, 2022 at 8:33
  • Try running it with a larger number of elements, if the difference is negligible then stick with this. Running a loop with in_array() is not particularly effiecient. Commented Jan 1, 2022 at 8:41
  • Thank you @NigelRen, I will test with it too, so I better stick with current logic Commented Jan 1, 2022 at 8:49
  • 1
    P.S. Some stats on the above: Looping 100K random numbers / 1K random number blacklist: 0.03s vs 0.25s for isset vs in_array. 100K / 10K: 0.03s vs 2.21s. 1M / 1K: 0.15s vs 2.10s. 1M / 10K: 0.15 vs 35.5s. 10M / 100K: 1.58s for isset. The isset check adds < 5% to the loop baseline runtime so it's quite economical. Less so for in_array. Would be curious to see comparable NOT IN stats for MySQL. Commented Jan 1, 2022 at 11:47

3 Answers 3

2

A variety of points:

  • I have a "Rule of Thumb": "If a possible optimization is estimated to improve things by less than 10%, move on. That is, don't spend extra effort on it. Instead, look for something better to work on." According to your numbers, the optimization decreases the result set by only about 1%.

  • There is a standard programming rule: "KISS". Which is simpler to code -- the NOT IN or the PHP filtering? A variant: "Which approach is fewer keystrokes?" That comes from "A Programmer's time is much more valuable than computer time.

  • Moving the NOT IN into each subquery may speed it up slightly. This is because it would decrease (slightly) the intermediate tables involved in the query. (However, this fails the 10% and KISS rules.) On the other hand, it could eliminate the outermost Select. Note: This works: (SELECT ...) UNION (SELECT ...) ORDER BY....

  • Potential bug: The innermost Select may be picking a date & time from one of the excluded col1's.

  • UNION defaults to UNION DISTINCT, which is slower than UNION ALL. Consider this as a bigger optimization.

  • ON ( sub.X = a.col4) probably needs to mention col2.

  • Is DATE_HERE somehow related to NOW()? Perhaps you need TIMESTAMP instead of DATETIME or vice versa?

  • I suspect that the DISTINCT is not needed. Anyway, it is redundant with the UNION.

  • Consider whether the "blacklist" should be a table, not a config file. As a table, NOT EXISTS(..) or LEFT JOIN .. IS NOT NULL would need to be added to the query. This would be slower than what you have now but might be "cleaner".

  • WHERE 1=1 is an artifact of lazy programming; it is not an optimization; the Optimizer will simply toss it.

  • Often, better indexes provide the most improvement. Maybe the following would help. Note: Separate, single-column indexes are not as good. Also, when adding INDEX(a,b), drop INDEX(a).

    a (as b):  INDEX(col2,  col4)  -- this order
    a:  INDEX(col4, col3, someColumn)  -- col4 first
    
Sign up to request clarification or add additional context in comments.

1 Comment

I found much more than what I asked in this answer, it will be of great help both in my current project and future projects. Thank you so much for taking the time to explain in detail! I used UNION instead of UNION ALL for the sake of eliminating any possible duplicate rows
1

From the performance looking point I recomment you:

  1. Remove DISTINCT in 1st subquery. One sorting is better then two sortings.
  2. Filter your rows in subqueries, not in combined rowset, this will decrease the amount of rows to be sorted by UNION.
SELECT * 
FROM
    (
        (
        SELECT a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a`
            INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 ) 
        WHERE
            ( a.someColumn > NOW( ) - INTERVAL 2 HOUR ) 
            AND ( a.col3 < DATE_HERE ) 
            AND a.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 ) 

        ) UNION
        (
        SELECT  a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a` 
        WHERE
            ( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL ) 
            AND ( a.col3 < DATE_HERE ) 
            AND a.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 ) 

        ) 
    ) AS resultSet 
ORDER BY
    resultSet.col3 ASC,
    resultSet.col2 ASC,
    resultSet.col4 ASC,
    resultSet.col1 DESC

2 Comments

Thank you @akina, should I add WHERE 1=1 before ORDER BY if I use NOT IN part in subqueries for faster result?
@Pelin You may but not must. It doesn't affect anything.
1

If your t.col_black_elem is obtained by another query you could try using a left join an check for not matching value

   SELECT a.col1,..., a.coln
   from table1 a 
   LEFT JOIN (
        select col_black_elem from tablex 
   ) t on t.col_black_elem = a.colx 
   WHERE  t.col_black_elem  is null

and for your code

SELECT * 
FROM
    (
        (
        SELECT DISTINCT a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a`
            INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 ) 
        WHERE
            ( a.someColumn > NOW( ) - INTERVAL 2 HOUR ) 
            AND ( a.col3 < DATE_HERE ) 
        ) UNION
        (
        SELECT  a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a` 
        WHERE
            ( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL ) 
            AND ( a.col3 < DATE_HERE ) 
        ) 
    ) AS resultSet 
LEFT JOIN (
        select col_black_elem from tablex 
   ) t on t.col_black_elem =  resultSet.col1
WHERE  t.col_black_elem  is null
ORDER BY
    resultSet.col3 ASC,
    resultSet.col2 ASC,
    resultSet.col4 ASC,
    resultSet.col1 DESC

Otherwise if your t.col_black_elem is not obtained by a another query you could populate a temp table or dynamically build a temp table using several select union

4 Comments

Thank you @scaisedge. I edited my question and added my query template. So from what I understand is to stick with NOT IN is better in my current query. col1 to col4 has indexes.
@pelin how you get the blacklist .. you have a query for get these ids??.. ..anyway answer updated with your code sample
I have a config file and these IDs are stored as an array in this file, fetching these blacklist IDs from there. So, not running an extra query for these IDs. Thank you for updated answer, checking it out!
then you could or use a tempo table populated with the blacklist or build a subquery based on a set of select union one for each elem in black list ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.