Wednesday, December 12, 2012

11g Group By Selectivity Improvements

Oracle 11g (at least 11.2.0.3 where I'm testing it) has brought us an interesting improvement to selectivity of  group by statement. Here's a little demo:
SQL> CREATE TABLE t1 AS 
  2    SELECT LEVEL AS id1, 
  3           MOD(LEVEL, 20) fil1, 
  4           rpad('x', 1000) padding 
  5      FROM dual 
  6    CONNECT BY LEVEL < 10000 
  7  ;

Table created.

SQL> CREATE TABLE t2 AS 
  2    SELECT LEVEL AS id2, 
  3           MOD(LEVEL, 5) fil2, 
  4           rpad('x', 1000) padding 
  5      FROM dual 
  6    CONNECT BY LEVEL < 10000 
  7  ;

Table created.

REM <<< Gather statistics without histograms >>>
At this point we've got 2 tables with 10k rows in each. The columns of interest are fil1 and fil2 with selectivity 1/20 and 1/5 respectively. Now let's run a query as follows:
SQL> set autotrace traceonly 
SQL> SELECT 
  2   t1.id1 
  3    FROM t1, 
  4         t2 
  5   WHERE t2.id2 = t1.id1 
  6     AND t1.fil1 = 1 
  7     AND t2.fil2 = 1 
  8   GROUP BY t1.id1;

500 rows selected.


Execution Plan 
---------------------------------------------------------- 
Plan hash value: 3226881135

---------------------------------------------------------------------------- 
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     | 
---------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT    |      |   125 |  1750 |   478   (1)| 00:00:07 | 
|   1 |  HASH GROUP BY      |      |   125 |  1750 |   478   (1)| 00:00:07 | 
|*  2 |   HASH JOIN         |      |   500 |  7000 |   477   (1)| 00:00:07 | 
|*  3 |    TABLE ACCESS FULL| T1   |   500 |  3500 |   238   (0)| 00:00:04 | 
|*  4 |    TABLE ACCESS FULL| T2   |  2000 | 14000 |   238   (0)| 00:00:04 | 
----------------------------------------------------------------------------
As you can see above Oracle has estimated the HASH GROUP BY as returning 125 rows even though in reality it is 500 (which you would get say in 10g).
What I think is happening in this particular case(1) (and I empirically proved it by playing around with selectivities of fil1 and fil2 columns) is that Oracle applies selectivities of fil1 and fil2 columns to the result of HASH JOIN (id=2) in the following way:
GROUP BY CARD = JOIN CARD * SEL(t1.fil1) / SEL(t2.fil2) 
and if we put numbers in place it would be:
GROUP BY CARD = 500 * 1/20 / (1/5) = 500 / 20 * 5 = 125 
If that's the case I'm not sure what's rational behind it. What I do know is that this optimisation can be disabled by setting _optimizer_improve_selectivity to false:
SQL> SELECT 
  2  --+ OPT_PARAM('_optimizer_improve_selectivity' 'false') 
  3   t1.id1 
  4    FROM t1, 
  5         t2 
  6   WHERE t2.id2 = t1.id1 
  7     AND t1.fil1 = 1 
  8     AND t2.fil2 = 1 
  9   GROUP BY t1.id1;

500 rows selected.


Execution Plan 
---------------------------------------------------------- 
Plan hash value: 3226881135

---------------------------------------------------------------------------- 
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     | 
---------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT    |      |   500 |  7000 |   478   (1)| 00:00:07 | 
|   1 |  HASH GROUP BY      |      |   500 |  7000 |   478   (1)| 00:00:07 | 
|*  2 |   HASH JOIN         |      |   500 |  7000 |   477   (1)| 00:00:07 | 
|*  3 |    TABLE ACCESS FULL| T1   |   500 |  3500 |   238   (0)| 00:00:04 | 
|*  4 |    TABLE ACCESS FULL| T2   |  2000 | 14000 |   238   (0)| 00:00:04 | 
----------------------------------------------------------------------------
I'm not sure though what else is being switched off in this case. Also if you've got primary key's on your id columns, the optimisation is not kicking in:
SQL> alter table t1 add constraint pk_t1 primary key (id1);

Table altered.

SQL> alter table t2 add constraint pk_t2 primary key (id2);

Table altered.

SQL> SELECT t1.id1 
  2    FROM t1, 
  3         t2 
  4   WHERE t2.id2 = t1.id1 
  5     AND t1.fil1 = 1 
  6     AND t2.fil2 = 1 
  7   GROUP BY t1.id1;

500 rows selected.


Execution Plan 
---------------------------------------------------------- 
Plan hash value: 3226881135

---------------------------------------------------------------------------- 
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     | 
---------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT    |      |   500 |  7000 |   478   (1)| 00:00:07 | 
|   1 |  HASH GROUP BY      |      |   500 |  7000 |   478   (1)| 00:00:07 | 
|*  2 |   HASH JOIN         |      |   500 |  7000 |   477   (1)| 00:00:07 | 
|*  3 |    TABLE ACCESS FULL| T1   |   500 |  3500 |   238   (0)| 00:00:04 | 
|*  4 |    TABLE ACCESS FULL| T2   |  2000 | 14000 |   238   (0)| 00:00:04 | 
----------------------------------------------------------------------------
Obviously this feature can be (and in a case with one of my queries was) a problem when you've got a complex query with a (few) massive group by clause(s) leading to a significant drop in cardinality estimation, and as a result to picking out a suboptimal join method, especially when you are grouping by more then one column. The existence of filters is a mandatory condition for the optimisation to kick in.
Finally I want to mention a bug on MOS #12591379 which says the cardinality of a GROUP BY may drop severely if it's following HASH JOIN with OR condition. The bug is said to be fixed in version 12.1.

(1) I believe my assumption is oversimplified and the actual algorithm is more sophisticated, as at least if you use a value on t2.fil2 which is out of bound of its min-max values Oracle makes a different projection.

Wednesday, October 10, 2012

SQL Loader's Append

I've recently bumped into one peculiarity of SQLLoader, particularly its APPEND mode. SQLLoader's LOAD APPEND works in the same way as DML INSERT /*+ APPEND*/, that is makes Oracle to allocate new blocks above HWM, right? Well, not quite.
Here's a little demo on 10g R2 on Solaris:
SQL> create table t1
  2  (
  3    id  number(12),
  4    padding varchar2(255)
  5  );

Table created.

SQL> INSERT /*+ APPEND*/
  2  INTO t1
  3    SELECT rownum,
  4           rpad('x', 255)
  5      FROM dual
  6    CONNECT BY LEVEL <= 1000;

1000 rows created.

SQL> commit;

Commit complete.

SQL> delete from t1;

1000 rows deleted.

SQL> commit;

Commit complete.

SQL> set serveroutput on
SQL> DECLARE
  2    free_blks NUMBER;
  3  BEGIN
  4    dbms_space.free_blocks(USER, 'T1', 'TABLE', 0, free_blks);
  5    dbms_output.put_line(free_blks);
  6  END;
  7  /
19                                                                            

PL/SQL procedure successfully completed.
So far I've created a small table, populated it with a thousand rows, then deleted all the data and committed the transaction. Then I checked the number of free blocks in the freelist - it is 19.

Now let's load some data via SQL Loader using a control file as follows:
OPTIONS(readsize=20000000, bindsize=20000000, rows=8000)
LOAD DATA
INFILE *
APPEND INTO TABLE T1
FIELDS TERMINATED BY "|" optionally enclosed by '"' TRAILING NULLCOLS
(
ID,
PADDING
)
BEGINDATA
1|"x     <<254 whitespaces="whitespaces">>"
...
I've removed the data section to save space, but you've got the idea. Now let's run it:
SQL> host sqlldr userid=user/pwd@db control=ldr.ctl log=ldr.log

SQL> DECLARE
  2    free_blks NUMBER;
  3  BEGIN
  4    dbms_space.free_blocks(USER, 'T1', 'TABLE', 0, free_blks);
  5    dbms_output.put_line(free_blks);
  6  END;
  7  /
1                                                                              

PL/SQL procedure successfully completed.

SQL> BEGIN
  2    dbms_stats.gather_table_stats(ownname => USER, tabname => 'T1',
  3                                  cascade => TRUE,
  4                                  method_opt => 'for all columns size 1');
  5  END;
  6  /

PL/SQL procedure successfully completed.

SQL> select blocks from user_tab_statistics where table_name = 'T1';

    BLOCKS                                                                    
----------                                                                     
        19                                                                     
As you can see, we still have got the same 19 blocks in the table and only 1 of them is listed as free in the free list.

Saturday, October 6, 2012

XPath Query Rewrite

In this entry I want  to share my experience of tracing down an interesting problem arising from upgrading to Oracle 11g.
After a few days post migration to Oracle 11.2.0.3 of a database I happened to check the alert log and found there the following:

DDE: Problem Key 'ORA 4030' was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes

ORA-4030 happens when a server process fails to allocate a portion of PGA memory.
I checked the application and the ETL jobs but nothing seemed to be failing. Therefore I decided to check the PGA memory used by different sessions:

SELECT *
  FROM v$process p,
       v$session s
 WHERE p.addr = s.paddr
 ORDER BY p.pga_alloc_mem DESC;


The first line I saw had had 4293618712 bytes in the PGA_ALLOC_MEM column. That process allocated 4Gb of PGA memory which is the maximum it could get out of Linux. Using the session information I isolated the code causing the issue. Turned out it was a massive (500+ lines) SQL statement heavily utilising XMLElement and XMLForest functions. 

A quick check on metalink revealed a bug #13477790 matched our situation (and later 
confirmed by the Oracle support to be the culprit). It said the problem was exactly due to usage of XML functions. What I couldn't understand from the description is how Oracle managed to get broken the code which was working perfectly on 10g.

So I decided to reproduce the problem. I isolated the code and when I ran it, I noticed that Oracle does not even start returning the rows from the query that was failing. Ergo I decided to trace CBO.

What I found in the 10053 trace file was lots of attempts to transform the query and then dies with ORA-4030.
In parallel I was googling what has changed in 11g in terms of XDB, and overall information about XDB (as I hasn't worked with it), and found a hint NO_XML_QUERY_REWRITE. I googled it and found that XDB has got a thing called XPath rewrite, when Oracle rewrites a query to optimise XML query execution. In my case no rewriting could be useful, as we don't use XDB per se - we just construct XML from plain tables and upload it as clob. So I tried it, and it helped. 

Eventually we applied a patch to fix the problem permanently.



Friday, September 14, 2012

ANSI SQL Bug

I’ve never been a fan of using ANSI SQL syntax in Oracle for different reasons. Here is one more to the list.
Have a look at the following query:
 SELECT dummy 
  FROM dual t1, 
       dual t2, 
       dual t3 
 WHERE t2.dummy = t1.dummy 
   AND t3.dummy = t1.dummy; 
If you run it you will get ORA-00918: column ambiguously defined, cause indeed DUMMY from which table Oracle is supposed to return? Let's rewrite the query using the ANSI SQL:
 SELECT dummy 
  FROM dual t1 
  JOIN dual t2 
    ON (t1.dummy = t2.dummy) 
  JOIN dual t3 
    ON (t1.dummy = t3.dummy);
If you run it on Oracle 10.2.0.4 (or older), you will see the following output:
 SQL> SELECT dummy 
  2    FROM dual t1 
  3    JOIN dual t2 
  4      ON (t1.dummy = t2.dummy) 
  5    JOIN dual t3 
  6      ON (t1.dummy = t3.dummy);

D 
- 
X

SQL>
Oracle does execute it without any errors. So which DUMMY has it actually taken then? Let's have a look at the Unparsed Query section of 10053 trace to find the answer (after some formatting):

SQL:******* UNPARSED QUERY IS ******* 
SELECT "T1"."DUMMY" "QCSJ_C000000000300000",
       "T2"."DUMMY" "QCSJ_C000000000300001" 
  FROM "SYS"."DUAL" "T1",
       "SYS"."DUAL" "T2" 
 WHERE "T1"."DUMMY"="T2"."DUMMY"

Query block (0x2aaaac888538) unchanged 
Query block (0x2aaaac887918) before join elimination: 
SQL:******* UNPARSED QUERY IS ******* 
SELECT "from$_subquery$_003"."QCSJ_C000000000300000_0" "QCSJ_C000000000300000",
       "from$_subquery$_003"."QCSJ_C000000000300001_1" "QCSJ_C000000000300001",
       "T3"."DUMMY" "DUMMY" 
  FROM  (SELECT "T1"."DUMMY" "QCSJ_C000000000300000_0",
                "T2"."DUMMY" "QCSJ_C000000000300001_1"
           FROM "SYS"."DUAL" "T1",
                "SYS"."DUAL" "T2" 
          WHERE "T1"."DUMMY"="T2"."DUMMY") "from$_subquery$_003",
        "SYS"."DUAL" "T3" 
 WHERE "from$_subquery$_003"."QCSJ_C000000000300000_0"="T3"."DUMMY"
Apparently Oracle joins first two tables and combins them into an inline view (first query block above). At this point it must take all the rows from both tables cause it doesn’t know which of them would be used later on for further join(s) or in the result set of columns. Ergo, in case some columns might be (and in my case are) named identically, Oracle gives them unique aliases.
Then it joins the third table (btw that’s why with two tabs the error doesn’t happen) (see the second query block in the snippet). Because this is the last table to be joined and all the previous tables have got unique aliases, Oracle doesn’t give unique aliases to t3 table’s columns. Finally it restricts the result set columns list as requested by user. At this point we've got only one dummy column, from table t3.
This behavior looks like a really nasty bug as it conceals errors in development. I bumped into it (like many others) when migrated to Oracle 11g where the bug is fixed and the ansi query like this starts failing. A quick search on metalink revealed a note ID 835701.1 saying that it's been fixed in 10.2.0.5 and afterwards.

Friday, June 1, 2012

Join Cardinality Changes?

I've recently stumbled upon a curios behaviour of Oracle Optimizer when tried to optimize a query performing poorly. It looks like Oracle has changed the way CBO estimates cardinality of a join, which according to all the sources I've seen should be calculated as Cardinality of the first table after filtering, multiplied by the Cardinality of the second table after filtering, and divided by the Greatest number of distinct values for the columns used in the join condition. Quite mouthful so let's better set up a demo (the code taken from invaluable Cost Based Oracle Fundamentals by Jonathan Lewis):

 drop table t1 purge;   
 drop table t2 purge;  
   
 create table t1   
  as   
  SELECT MOD(rownum, 25) + 1 AS filter,   
      MOD(rownum, 30) + 1 AS join1,   
      lpad(rownum, 10) v1,   
      rpad('x', 100) padding   
   FROM dual   
  CONNECT BY LEVEL <= 10000   
  ;  
   
 create table t2   
  as   
  SELECT MOD(rownum, 50) + 1 as filter,   
      MOD(rownum, 40) + 1 as join2,   
      lpad(rownum, 10) v2,   
      rpad('x', 100) padding   
   FROM dual   
  CONNECT BY LEVEL <= 10000   
  ;  
   
 DECLARE   
  l sys.odcivarchar2list := sys.odcivarchar2list('T1', 'T2');   
 BEGIN   
  FOR j IN l.first .. l.last   
  LOOP   
   dbms_stats.gather_table_stats(ownname => USER, tabname => l(j),   
                  cascade => TRUE,   
                  method_opt => 'for all columns size 1');   
  END LOOP;   
 END;   
 /  

Now set autotrace and run the following query:

 set autotrace traceonly explain  
   
 SELECT t1.v1,   
     t2.v2   
  FROM t1,   
     t2   
  WHERE t1.join1 = t2.join2   
   AND t1.filter = 1   
   AND t2.filter = 1;  

As a result you will see a plan like the one below:

 Execution Plan   
 ----------------------------------------------------------   
 Plan hash value: 2959412835  
   
 ---------------------------------------------------------------------------   
 | Id | Operation         | Name | Rows | Bytes | Cost (%CPU)| Time        |   
 ---------------------------------------------------------------------------   
 |  0 | SELECT STATEMENT  |      | 2000 | 68000 |  61  (2)   | 00:00:01    |   
 |* 1 | HASH JOIN         |      | 2000 | 68000|  61  (2)   | 00:00:01    |   
 |* 2 |  TABLE ACCESS FULL| T2   |  200 | 3400  |  30  (0)   | 00:00:01    |   
 |* 3 |  TABLE ACCESS FULL| T1   |  400 | 6800  |  30  (0)   | 00:00:01    |   
 ---------------------------------------------------------------------------  
   
 Predicate Information (identified by operation id):   
 ---------------------------------------------------  
   
   1 - access("T1"."JOIN1"="T2"."JOIN2")   
   2 - filter("T2"."FILTER"=1)   
   3 - filter("T1"."FILTER"=1)  
   

As you can see the cardinality of the join is 2000 (highlighted). Now let's do the math:

JoinCard = Card(T2) * Card(T1) / Greatest(NDV(T1.Join1), NDV(T2.Join2))
JoinCard = 200 * 400 / Greatest(30, 40) = 2000

Which matches what we can see in the execution plan above.

Now let's re-set up our demo (or tweak statistics) in the way that the Number of Distinct Values on the join column in either of tables is 100 rather then 30 or 40 as it is in the initial example, and run the query again. In my case the plan looks like this:

   
 ---------------------------------------------------------------------------   
 | Id | Operation         | Name | Rows | Bytes | Cost (%CPU)| Time        |   
 ---------------------------------------------------------------------------   
 |  0 | SELECT STATEMENT  |      |  808 | 27472 |  61  (2)   | 00:00:01    |   
 |* 1 | HASH JOIN         |      |  808 | 27472 |  61  (2)   | 00:00:01    |   
 |* 2 |  TABLE ACCESS FULL| T2   |  200 | 3400  |  30  (0)   | 00:00:01    |   
 |* 3 |  TABLE ACCESS FULL| T1   |  400 | 6800  |  30  (0)   | 00:00:01    |   
 ---------------------------------------------------------------------------

As you can see the cardinality has changed and now is 808 (I'm using 10.2.0.4 on Linux by the way). Now let's again do the math:

JoinCard = 200 * 400 / Greatest(40, 100) = 800

Obviously the original formula doesn't really work anymore.

I've done the same for NDV 1000 and 10000 and got cardinality of 236 and 200 respectively where expected to get 80 and 8.

Apparently Oracle uses some mechanism to reduce the selectivity (that 1 divided by Greatest of NDVs thing) when it "approaches" the least of two filtered cardinalities, and most importantly never lets it ge less then the least of two cardinalities (in our case 200). Effectively Oracle assumes that the data is aligned the way even after the filtering you would still have a match for each key from the table with lest cardinality.

So how it can be harmful? In order to answer this question let's set up another demo:

 create table t1  
 as  
 SELECT mod(rownum, 100) AS filter,  
        rownum AS join1,  
        lpad(rownum, 10) v1,  
        rpad('x', 100) padding  
  FROM dual  
 CONNECT BY LEVEL <= 1000000;  
   
 create table t2  
 as  
 SELECT rownum as join2,  
        lpad(rownum, 10) v2,  
        rpad('x', 100) padding  
  FROM dual  
 CONNECT BY LEVEL <= 30000;  
   
 create table t3  
 pctfree 90 pctused 10  
 as  
 SELECT rownum as join3,  
        lpad(rownum, 10) v3,  
        rpad('x', 100) padding  
  FROM dual  
 CONNECT BY LEVEL <= 100000;  
   
 create index t3_idx on t3(join3);  
   
 <<< Gather the stats without histogram >>>  

I've created three tables and an index on the third one. The last table is also created with pctfree 90 to make the actual table quite big. Also I have clustered the join columns in tables T1 and T2 the way that after the filtering of T1, there is little overlap of the join columns:

 SQL> SELECT t1.v1,  
  1        t2.v2  
  2   FROM t1,  
  3        t2  
  4  WHERE t1.join1 = t2.join2  
  5    AND t1.filter = 1;  
   
 300 rows selected.  

Now let's run the following query with autotrace on:

 SELECT /* leading(t1 t2 t3) use_hash(t1 t2) use_nl(t3) index(t3)*/  
       t1.v1,  
       t2.v2,  
       t3.v3  
  FROM t1,  
       t2,  
       t3  
 WHERE t1.join1 = t2.join2  
   AND t1.filter = 1  
   AND t3.join3 = t2.join2;  

The result would be as follows (I've removed irrelevant bits of output):

 ----------------------------------------------------------------------------  
 | Id | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time      |  
 ----------------------------------------------------------------------------  
 |  0 | SELECT STATEMENT    |      | 9874  |  491K | 5532  (1)  | 00:01:18  |  
 |* 1 | HASH JOIN           |      | 9874  |  491K | 5532  (1)  | 00:01:18  |  
 |* 2 |  HASH JOIN          |      | 9874  |  337K | 2975  (2)  | 00:00:42  |  
 |* 3 |   TABLE ACCESS FULL | T1   | 9874  |  183K | 2888  (2)  | 00:00:41  |  
 |  4 |   TABLE ACCESS FULL | T2   | 30011 |  468K |  86  (2)   | 00:00:02  |  
 |  5 |  TABLE ACCESS FULL  | T3   |  100K | 1562K | 2555  (1)  | 00:00:36  |  
 ----------------------------------------------------------------------------  
   
 Statistics  
 ----------------------------------------------------------  
      1 recursive calls  
      0 db block gets  
    16617 consistent gets  
    14640 physical reads  
      0 redo size  
    12373 bytes sent via SQL*Net to client  
     559 bytes received via SQL*Net from client  
      21 SQL*Net roundtrips to/from client  
      0 sorts (memory)  
      0 sorts (disk)  
     300 rows processed  

As you can see, Oracle thinks it would be almost a ten thousand rows as a result of the joining T1 to T2 when in fact there are only 300. If we do the math the we'll see that JoinCard = 30k * 10k / Greatest(30k, 1m) = 300 which way smaller then the deemed 10k. As a result Oracle chooses Full Scan on the T3 table and hash join which amounts to 16.5k of consistent gets.

Now let's run the query with the hint (put plus after /*) and see the results:

 --------------------------------------------------------------------------------------  
 | Id | Operation                  | Name   | Rows  | Bytes | Cost (%CPU)| Time       |  
 --------------------------------------------------------------------------------------  
 |  0 | SELECT STATEMENT           |        | 9874  |  491K | 22750  (1)| 00:05:19    |  
 |  1 | TABLE ACCESS BY INDEX ROWID| T3     |   1   |  16   |   2  (0)  | 00:00:01    |  
 |  2 |  NESTED LOOPS              |        | 9874  |  491K | 22750  (1)| 00:05:19    |  
 |* 3 |  HASH JOIN                 |        | 9874  |  337K | 2975  (2) | 00:00:42    |  
 |* 4 |   TABLE ACCESS FULL        | T1     | 9874  |  183K | 2888  (2) | 00:00:41    |  
 |  5 |   TABLE ACCESS FULL        | T2     | 30011 |  468K |  86  (2)  | 00:00:02    |  
 |* 6 |  INDEX RANGE SCAN          | T3_IDX |   1   |       |   1  (0)  | 00:00:01    |  
 --------------------------------------------------------------------------------------  
   
 Statistics  
 ----------------------------------------------------------  
      1 recursive calls  
      0 db block gets  
     9549 consistent gets  
     8881 physical reads  
      0 redo size  
    12373 bytes sent via SQL*Net to client  
     559 bytes received via SQL*Net from client  
      21 SQL*Net roundtrips to/from client  
      0 sorts (memory)  
      0 sorts (disk)  
     300 rows processed  

We've got 9.5k of consistent gets which in this particular example is nearly two fold less.

The things can get even worse (and in case with my production system got) if you have more tables and Oracle starts doing full scans on all of them. Therefore, watch out.