UNDERSTANDING CORRELATED SUBQUERIES
© Copyright Oriole Corporation, 2001
There
must be literally thousands of statements in use today which would run faster
(in some cases incredibly so) if a correlated subquery had been used. It's all
too easy to write code which is syntactically correct, works perfectly well
and runs quickly on test data. When those data volumes grow as they are apt
to do in real life, jobs can take longer and longer to complete until sometimes
they become totally unacceptable.
We
have seen for example a sixteen hour job, which after a five minute rewrite,
thereafter regularly completed in ten minutes. Even more dramatically (although
at first it doesn't sound like it) a four minute job was reduced to under two
seconds.
The
problem is of course is that if you've got a job which takes sixteen hours you
have to do something about it, whereas most people will live with a job that
takes four minutes when it should take two seconds. The danger is of course
that if that piece of code is run 15 times per day, it's taking up one hour
when it should only be taking thirty seconds. If it's run 10 times an hour then
it's taking 16 hours per day instead of 8 minutes.
What
do you do if you have these performance problems?
One
popular solution is to invest $000s in more hardware.
Well,
invest isn't really the right word for squandering money unnecessarily is it?
The
other solution is to seek and destroy the offending code (well, rewrite it anyway).
How
do you identify which statements need to be examined?
Oriole's
OriSnoop will identify the statements which use the most resources. It doesn't
necessarily mean that the statement which uses the most resources is inefficient
but it's the best place to start.
Having
identified the candidate statements how do we know if a correlated subquery
is appropriate?
The
dramatic speed advantage of a correlated subquery primarily comes when the correlated
subquery is run in conjunction with a NOT EXISTS used in preference to an un-correlated
query with a NOT IN. While it's true that this construct isn't always faster,
in the situations where it is faster, as has been said already, it sometimes
seems almost impossibly so.
The
secret of why a correlated subquery and NOT EXISTS can be so much
faster lies in the indexes. A NOT IN will never use an index.
If
an un-correlated subquery is used with a NOT IN then, as normal, the subquery
is run first and every row returned by the main query will examine every row
returned by the subquery to see if the sought value is there. So if the subquery
returns 500,000 rows and the main query returns 200,000 rows and you only want
the values from the main query which aren't in the subquery, then there will
be 100bn comparisons. If you use a correlated subquery with a 'not exists' and
the sought value is indexed, then for each of the 200,000 rows returned by the
subquery the index will be interrogated and it will be quickly established whether
the value is present or not. 200,000 vs. 100,000,000,000.
A
correlated subquery is one which changes the order in which queries are run.
Normally an inner or subquery runs to completion and the result of the subquery
is 'plugged into' the outer or main query which is then run.
With
a correlated subquery however, the order in which the queries are run is reversed.
The outer query is fired first and as it returns each row, the subquery is run
using information supplied by the result
of the main query.
It
can be a very difficult concept to get across without examples so here's one
I hope will help.
If
you imagine that you are a large telecom company. By the nature of your business,
in many cases your suppliers will also be your customers. Imagine that you want
to want to write to those suppliers
who aren't also customers, to ask them why you should continue to do business
with them if they don't use your services (I know that this would never happen
but for the sake of argument :-)
The
code which almost writes itself is:
select s.code#
from supplier s
where s.code# not in (select c.code#
from customer c);
Remember that the 'not in' will result in the index on customer.code being ignored,
so the customer table will be searched, all rows will be returned and then the
code in every row on the supplier table will be returned and compared with EVERY
value returned by the subquery. This statement could be rewritten very quickly
as:
select s.code#
from supplier main
where not exists (select 'anything you like'
from customer c
where c.code = main.code);
NB
The table alias of "main" is only used to show how the correlation is achieved
but while sometimes it's not even necessary to have an alias at all it clarifies
things if aliases are used. In fact it's good programming practice to use aliases
in every situation where more than one table is referred to in a statement,
since it saves time when parsing. It also prevents you from unintentionally
"correlating" a subquery, which might happen if in your subquery, you inadvertently
refer to a column which doesn't exist in that subquery's table but is present
in the outer query's table.
Anyway,
back to the explanation of the correlated subquery above. In this case the outer
query will run first and the code returned by the first row will be passed to
the subquery and the subquery will run using that value. Using the index the
value will either be found to
be in the customer table index or not. If it isn't then the 'not exists' will
be found to be true and the main query row will be output. If the value was
found in the index then the 'not exists' will be false ('exists' would be true)
and that main query row will be discarded.
As
I indicated above it's not as straightforward as saying (to paraphrase George
Orwell) "'not in' bad 'not exists' good" but it is true to say that when not
in's good it's very very good and when it's bad it's horrid :-)