Friday, June 1, 2012

Join Cardinality Changes?

I've recently stumbled upon a curios behaviour of Oracle Optimizer when tried to optimize a query performing poorly. It looks like Oracle has changed the way CBO estimates cardinality of a join, which according to all the sources I've seen should be calculated as Cardinality of the first table after filtering, multiplied by the Cardinality of the second table after filtering, and divided by the Greatest number of distinct values for the columns used in the join condition. Quite mouthful so let's better set up a demo (the code taken from invaluable Cost Based Oracle Fundamentals by Jonathan Lewis):

 drop table t1 purge;   
 drop table t2 purge;  
   
 create table t1   
  as   
  SELECT MOD(rownum, 25) + 1 AS filter,   
      MOD(rownum, 30) + 1 AS join1,   
      lpad(rownum, 10) v1,   
      rpad('x', 100) padding   
   FROM dual   
  CONNECT BY LEVEL <= 10000   
  ;  
   
 create table t2   
  as   
  SELECT MOD(rownum, 50) + 1 as filter,   
      MOD(rownum, 40) + 1 as join2,   
      lpad(rownum, 10) v2,   
      rpad('x', 100) padding   
   FROM dual   
  CONNECT BY LEVEL <= 10000   
  ;  
   
 DECLARE   
  l sys.odcivarchar2list := sys.odcivarchar2list('T1', 'T2');   
 BEGIN   
  FOR j IN l.first .. l.last   
  LOOP   
   dbms_stats.gather_table_stats(ownname => USER, tabname => l(j),   
                  cascade => TRUE,   
                  method_opt => 'for all columns size 1');   
  END LOOP;   
 END;   
 /  

Now set autotrace and run the following query:

 set autotrace traceonly explain  
   
 SELECT t1.v1,   
     t2.v2   
  FROM t1,   
     t2   
  WHERE t1.join1 = t2.join2   
   AND t1.filter = 1   
   AND t2.filter = 1;  

As a result you will see a plan like the one below:

 Execution Plan   
 ----------------------------------------------------------   
 Plan hash value: 2959412835  
   
 ---------------------------------------------------------------------------   
 | Id | Operation         | Name | Rows | Bytes | Cost (%CPU)| Time        |   
 ---------------------------------------------------------------------------   
 |  0 | SELECT STATEMENT  |      | 2000 | 68000 |  61  (2)   | 00:00:01    |   
 |* 1 | HASH JOIN         |      | 2000 | 68000|  61  (2)   | 00:00:01    |   
 |* 2 |  TABLE ACCESS FULL| T2   |  200 | 3400  |  30  (0)   | 00:00:01    |   
 |* 3 |  TABLE ACCESS FULL| T1   |  400 | 6800  |  30  (0)   | 00:00:01    |   
 ---------------------------------------------------------------------------  
   
 Predicate Information (identified by operation id):   
 ---------------------------------------------------  
   
   1 - access("T1"."JOIN1"="T2"."JOIN2")   
   2 - filter("T2"."FILTER"=1)   
   3 - filter("T1"."FILTER"=1)  
   

As you can see the cardinality of the join is 2000 (highlighted). Now let's do the math:

JoinCard = Card(T2) * Card(T1) / Greatest(NDV(T1.Join1), NDV(T2.Join2))
JoinCard = 200 * 400 / Greatest(30, 40) = 2000

Which matches what we can see in the execution plan above.

Now let's re-set up our demo (or tweak statistics) in the way that the Number of Distinct Values on the join column in either of tables is 100 rather then 30 or 40 as it is in the initial example, and run the query again. In my case the plan looks like this:

   
 ---------------------------------------------------------------------------   
 | Id | Operation         | Name | Rows | Bytes | Cost (%CPU)| Time        |   
 ---------------------------------------------------------------------------   
 |  0 | SELECT STATEMENT  |      |  808 | 27472 |  61  (2)   | 00:00:01    |   
 |* 1 | HASH JOIN         |      |  808 | 27472 |  61  (2)   | 00:00:01    |   
 |* 2 |  TABLE ACCESS FULL| T2   |  200 | 3400  |  30  (0)   | 00:00:01    |   
 |* 3 |  TABLE ACCESS FULL| T1   |  400 | 6800  |  30  (0)   | 00:00:01    |   
 ---------------------------------------------------------------------------

As you can see the cardinality has changed and now is 808 (I'm using 10.2.0.4 on Linux by the way). Now let's again do the math:

JoinCard = 200 * 400 / Greatest(40, 100) = 800

Obviously the original formula doesn't really work anymore.

I've done the same for NDV 1000 and 10000 and got cardinality of 236 and 200 respectively where expected to get 80 and 8.

Apparently Oracle uses some mechanism to reduce the selectivity (that 1 divided by Greatest of NDVs thing) when it "approaches" the least of two filtered cardinalities, and most importantly never lets it ge less then the least of two cardinalities (in our case 200). Effectively Oracle assumes that the data is aligned the way even after the filtering you would still have a match for each key from the table with lest cardinality.

So how it can be harmful? In order to answer this question let's set up another demo:

 create table t1  
 as  
 SELECT mod(rownum, 100) AS filter,  
        rownum AS join1,  
        lpad(rownum, 10) v1,  
        rpad('x', 100) padding  
  FROM dual  
 CONNECT BY LEVEL <= 1000000;  
   
 create table t2  
 as  
 SELECT rownum as join2,  
        lpad(rownum, 10) v2,  
        rpad('x', 100) padding  
  FROM dual  
 CONNECT BY LEVEL <= 30000;  
   
 create table t3  
 pctfree 90 pctused 10  
 as  
 SELECT rownum as join3,  
        lpad(rownum, 10) v3,  
        rpad('x', 100) padding  
  FROM dual  
 CONNECT BY LEVEL <= 100000;  
   
 create index t3_idx on t3(join3);  
   
 <<< Gather the stats without histogram >>>  

I've created three tables and an index on the third one. The last table is also created with pctfree 90 to make the actual table quite big. Also I have clustered the join columns in tables T1 and T2 the way that after the filtering of T1, there is little overlap of the join columns:

 SQL> SELECT t1.v1,  
  1        t2.v2  
  2   FROM t1,  
  3        t2  
  4  WHERE t1.join1 = t2.join2  
  5    AND t1.filter = 1;  
   
 300 rows selected.  

Now let's run the following query with autotrace on:

 SELECT /* leading(t1 t2 t3) use_hash(t1 t2) use_nl(t3) index(t3)*/  
       t1.v1,  
       t2.v2,  
       t3.v3  
  FROM t1,  
       t2,  
       t3  
 WHERE t1.join1 = t2.join2  
   AND t1.filter = 1  
   AND t3.join3 = t2.join2;  

The result would be as follows (I've removed irrelevant bits of output):

 ----------------------------------------------------------------------------  
 | Id | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time      |  
 ----------------------------------------------------------------------------  
 |  0 | SELECT STATEMENT    |      | 9874  |  491K | 5532  (1)  | 00:01:18  |  
 |* 1 | HASH JOIN           |      | 9874  |  491K | 5532  (1)  | 00:01:18  |  
 |* 2 |  HASH JOIN          |      | 9874  |  337K | 2975  (2)  | 00:00:42  |  
 |* 3 |   TABLE ACCESS FULL | T1   | 9874  |  183K | 2888  (2)  | 00:00:41  |  
 |  4 |   TABLE ACCESS FULL | T2   | 30011 |  468K |  86  (2)   | 00:00:02  |  
 |  5 |  TABLE ACCESS FULL  | T3   |  100K | 1562K | 2555  (1)  | 00:00:36  |  
 ----------------------------------------------------------------------------  
   
 Statistics  
 ----------------------------------------------------------  
      1 recursive calls  
      0 db block gets  
    16617 consistent gets  
    14640 physical reads  
      0 redo size  
    12373 bytes sent via SQL*Net to client  
     559 bytes received via SQL*Net from client  
      21 SQL*Net roundtrips to/from client  
      0 sorts (memory)  
      0 sorts (disk)  
     300 rows processed  

As you can see, Oracle thinks it would be almost a ten thousand rows as a result of the joining T1 to T2 when in fact there are only 300. If we do the math the we'll see that JoinCard = 30k * 10k / Greatest(30k, 1m) = 300 which way smaller then the deemed 10k. As a result Oracle chooses Full Scan on the T3 table and hash join which amounts to 16.5k of consistent gets.

Now let's run the query with the hint (put plus after /*) and see the results:

 --------------------------------------------------------------------------------------  
 | Id | Operation                  | Name   | Rows  | Bytes | Cost (%CPU)| Time       |  
 --------------------------------------------------------------------------------------  
 |  0 | SELECT STATEMENT           |        | 9874  |  491K | 22750  (1)| 00:05:19    |  
 |  1 | TABLE ACCESS BY INDEX ROWID| T3     |   1   |  16   |   2  (0)  | 00:00:01    |  
 |  2 |  NESTED LOOPS              |        | 9874  |  491K | 22750  (1)| 00:05:19    |  
 |* 3 |  HASH JOIN                 |        | 9874  |  337K | 2975  (2) | 00:00:42    |  
 |* 4 |   TABLE ACCESS FULL        | T1     | 9874  |  183K | 2888  (2) | 00:00:41    |  
 |  5 |   TABLE ACCESS FULL        | T2     | 30011 |  468K |  86  (2)  | 00:00:02    |  
 |* 6 |  INDEX RANGE SCAN          | T3_IDX |   1   |       |   1  (0)  | 00:00:01    |  
 --------------------------------------------------------------------------------------  
   
 Statistics  
 ----------------------------------------------------------  
      1 recursive calls  
      0 db block gets  
     9549 consistent gets  
     8881 physical reads  
      0 redo size  
    12373 bytes sent via SQL*Net to client  
     559 bytes received via SQL*Net from client  
      21 SQL*Net roundtrips to/from client  
      0 sorts (memory)  
      0 sorts (disk)  
     300 rows processed  

We've got 9.5k of consistent gets which in this particular example is nearly two fold less.

The things can get even worse (and in case with my production system got) if you have more tables and Oracle starts doing full scans on all of them. Therefore, watch out.