Index usage with slow query

Discussion:

Rebecca Clarke

2014-07-23 09:45:56 UTC

Hi all,

Looking for some advice regarding a slow query I have and indexing.

I'm using postgresql 9.1 and this is my table that has around 6800000 rows:

CREATE TABLE mytable
(
class character varying,
floor character varying,
source_id integer,
the_geom geometry
)
WITH (
OIDS=TRUE
);

INDEX idx_source_id
USING btree
(source_id);

INDEX idx_the_geom_gist
USING gist
(the_geom);

This table is constantly hit with the below query (not always the same
values in the where). The only difference between queries are the values in
the where clause:

SELECT the_geom,oid from mytable
WHERE
the_geom && ST_GeomFromText('POLYGON((529342.334095833
180696.221733333,529342.334095833 181533.44595,530964.336820833
181533.44595,530964.336820833 180696.221733333,529342.334095833
180696.221733333))',find_srid('','mytable','the_geom'))
AND
(floor = 'gf' AND source_id = '689' AND class = 'General')

As the table has increased in size, this query has become slower, so I made
this index:

INDEX idx_floor_sourceid_class
USING btree
(floor, source_id, class);

When I run an EXPLAIN and EXPLAIN ANALYZE the query isn't using this new
index.

Sometimes it uses just idx_the_geom_gist

other times it uses idx_the_geom_gist and idx_source_id

I don't understand why it's inconsistent in its use of indexes when the
query is always the same structure, just different where clause values, and
I don't understand why it's not using the new index either.

Would love some help with this. I'm not sure where I'm going wrong.

Thanks in advance.

Bill Moran

2014-07-23 14:57:01 UTC

Permalink

On Wed, 23 Jul 2014 10:45:56 +0100

Post by Rebecca Clarke
Hi all,
Looking for some advice regarding a slow query I have and indexing.
CREATE TABLE mytable
(
class character varying,
floor character varying,
source_id integer,
the_geom geometry
)
WITH (
OIDS=TRUE
);
INDEX idx_source_id
USING btree
(source_id);
INDEX idx_the_geom_gist
USING gist
(the_geom);
This table is constantly hit with the below query (not always the same
values in the where). The only difference between queries are the values in
SELECT the_geom,oid from mytable
WHERE
the_geom && ST_GeomFromText('POLYGON((529342.334095833
180696.221733333,529342.334095833 181533.44595,530964.336820833
181533.44595,530964.336820833 180696.221733333,529342.334095833
180696.221733333))',find_srid('','mytable','the_geom'))
AND
(floor = 'gf' AND source_id = '689' AND class = 'General')
As the table has increased in size, this query has become slower, so I made
INDEX idx_floor_sourceid_class
USING btree
(floor, source_id, class);
When I run an EXPLAIN and EXPLAIN ANALYZE the query isn't using this new
index.
Sometimes it uses just idx_the_geom_gist
other times it uses idx_the_geom_gist and idx_source_id
I don't understand why it's inconsistent in its use of indexes when the
query is always the same structure, just different where clause values, and
I don't understand why it's not using the new index either.

It depends on the data. The planner will make estimates on what the fastest
way to execute will be based on a lot of things, one of which is how helpful
an index is really expected to be. Since your indexes aren't arranged to
allow an index-only scan (although I don't remember if 9.1 had index-only
scans yet ...) it will have to use the index to narrow down the rows, then
load up the rows and filter them further (you didn't provide explain output,
but I'll bet a dozen nickels that's what it says). So if the values in
source_id are unique enough that the planner doesn't think that
idx_floor_sourceid_class will narrow the results any better than
idx_source_id, it will use the former because it's a smaller index and will
require less disk fetches to load it.

Of course, without explain output, I'm assuming a lot. But the basic operation
still stands, indexes aren't always guaranteed to be faster than other types of
access. And depending on the distribution of the data, some indexes might be
faster with some fetches than with others.

The key is not whether it's using the index or not, it's whether it's getting the
fastest plan or not. The first step in ensuring that is to make sure the table
is getting analyzed frequently enough, otherwise the stats that the planner uses
to predict will be off and it will often choose poor plans. The next step would
be to isolate specific instances that you're suspicious of and test to see if the
planner really is getting the best plan. Hopefully you have a test database where
you can copy the data and add/remove indexes at will. That type of easter egg
hunt may not be necessary, though. EXPLAIN ANALYZE can often tell you if the plan
is bad by showing you where estimated times vary wildly from actual times.

Hope this helps, but before you worry too much about it, I'd suggest asking
yourself 1 question: is the performance at an acceptable level, even if you don't
understand the rational behind the planner's choice? Of course, that may not be
important if you're asking the question just to understand better.

--
Bill Moran <***@potentialtech.com>
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Rebecca Clarke

2014-07-25 16:20:57 UTC

Permalink

Hi Bill,

Thanks for the reply. Here's the EXPLAIN output of a couple of the queries:

SELECT the_geom,oid from mytable
where the_geom && ST_GeomFromText('POLYGON((529342.334095833
180696.221733333,529342.334095833 181533.44595,530964.336820833
181533.44595,530964.336820833 180696.221733333,529342.334095833
180696.221733333))',find_srid('','mytable','the_geom'))
and (floor = 'gf' AND source_id = '16701' AND class = 'General')

Bitmap Heap Scan on mytable (cost=1212.62..1580.71 rows=177 width=612)
Recheck Cond: ((source_id = 16701) AND (the_geom &&
'0103000020346C00000100000005000000'::geometry))
Filter: (((floor)::text = 'gf'::text) AND ((class)::text =
'General'::text))
-> BitmapAnd (cost=1212.62..1212.62 rows=184 width=0)
-> Bitmap Index Scan on idx_source_id (cost=0.00..433.25
rows=33149 width=0)
Index Cond: (source_id = 16701)
-> Bitmap Index Scan on idx_the_geom_gist (cost=0.00..779.03
rows=38654 width=0)
Index Cond: (the_geom &&
'0103000020346C00000100000005000000'::geometry)

SELECT the_geom,oid from mytable
where the_geom && ST_GeomFromText('POLYGON((415995.148624997
433101.445479165,415995.148624997 433326.320145835,416466.572625003
433326.320145835,416466.572625003 433101.445479165,415995.148624997
433101.445479165))',find_srid('','mytable','the_geom'))
and (floor = 'gf' AND source_id = '20427' AND class = 'General')

Bitmap Heap Scan on mytable (cost=21.41..23.44 rows=1 width=612)
Recheck Cond: ((the_geom &&
'0103000020346C00000100000005000000'::geometry) AND (source_id = 20427))
Filter: (((floor)::text = 'gf'::text) AND ((class)::text =
'General'::text))
-> BitmapAnd (cost=21.41..21.41 rows=1 width=0)
-> Bitmap Index Scan on idx_the_geom_gist (cost=0.00..4.18
rows=141 width=0)
Index Cond: (the_geom &&
'0103000020346C00000100000005000000'::geometry)
-> Bitmap Index Scan on idx_source_id (cost=0.00..16.97
rows=1112 width=0)
Index Cond: (source_id = 20427)

There is no unique field in the table. And unfortunately the performance is
unacceptable. The logs show that the first query when it was execute took
70466.757 ms where as the second one took 11032.459 ms.

I've begun to create a duplicate environment to play with so hopefully I'm
able to weed out a solution.

Post by Bill Moran
On Wed, 23 Jul 2014 10:45:56 +0100

Post by Rebecca Clarke
Hi all,
Looking for some advice regarding a slow query I have and indexing.
I'm using postgresql 9.1 and this is my table that has around 6800000
CREATE TABLE mytable
(
class character varying,
floor character varying,
source_id integer,
the_geom geometry
)
WITH (
OIDS=TRUE
);
INDEX idx_source_id
USING btree
(source_id);
INDEX idx_the_geom_gist
USING gist
(the_geom);
This table is constantly hit with the below query (not always the same
values in the where). The only difference between queries are the values

Post by Rebecca Clarke
SELECT the_geom,oid from mytable
WHERE
the_geom && ST_GeomFromText('POLYGON((529342.334095833
180696.221733333,529342.334095833 181533.44595,530964.336820833
181533.44595,530964.336820833 180696.221733333,529342.334095833
180696.221733333))',find_srid('','mytable','the_geom'))
AND
(floor = 'gf' AND source_id = '689' AND class = 'General')
As the table has increased in size, this query has become slower, so I

made

Post by Rebecca Clarke
INDEX idx_floor_sourceid_class
USING btree
(floor, source_id, class);
When I run an EXPLAIN and EXPLAIN ANALYZE the query isn't using this new
index.
Sometimes it uses just idx_the_geom_gist
other times it uses idx_the_geom_gist and idx_source_id
I don't understand why it's inconsistent in its use of indexes when the
query is always the same structure, just different where clause values,

and

Post by Rebecca Clarke
I don't understand why it's not using the new index either.

It depends on the data. The planner will make estimates on what the fastest
way to execute will be based on a lot of things, one of which is how helpful
an index is really expected to be. Since your indexes aren't arranged to
allow an index-only scan (although I don't remember if 9.1 had index-only
scans yet ...) it will have to use the index to narrow down the rows, then
load up the rows and filter them further (you didn't provide explain output,
but I'll bet a dozen nickels that's what it says). So if the values in
source_id are unique enough that the planner doesn't think that
idx_floor_sourceid_class will narrow the results any better than
idx_source_id, it will use the former because it's a smaller index and will
require less disk fetches to load it.
Of course, without explain output, I'm assuming a lot. But the basic operation
still stands, indexes aren't always guaranteed to be faster than other types of
access. And depending on the distribution of the data, some indexes might be
faster with some fetches than with others.
The key is not whether it's using the index or not, it's whether it's getting the
fastest plan or not. The first step in ensuring that is to make sure the table
is getting analyzed frequently enough, otherwise the stats that the planner uses
to predict will be off and it will often choose poor plans. The next step would
be to isolate specific instances that you're suspicious of and test to see if the
planner really is getting the best plan. Hopefully you have a test database where
you can copy the data and add/remove indexes at will. That type of easter egg
hunt may not be necessary, though. EXPLAIN ANALYZE can often tell you if the plan
is bad by showing you where estimated times vary wildly from actual times.
Hope this helps, but before you worry too much about it, I'd suggest asking
yourself 1 question: is the performance at an acceptable level, even if you don't
understand the rational behind the planner's choice? Of course, that may not be
important if you're asking the question just to understand better.
--

David G Johnston

2014-07-25 16:58:04 UTC

Permalink

Typically you want to provide EXPLAIN ANALYZE output so that comparisons
between planner estimates and reality can be made.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Index-usage-with-slow-query-tp5812503p5812851.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Bill Moran

2014-07-25 18:03:51 UTC

Permalink

On Fri, 25 Jul 2014 17:20:57 +0100

Post by Rebecca Clarke
Hi Bill,

A few suggestions in addition to David's comment about doing
EXPLAIN ANALYZE:
1) When experimenting, one thing to try might be making a single
GiST index across the_geom, floor, source_id, and class. Not
sure if it will work/help, but worth a try.
2) Based on the info below, are you sure that table is getting
ANALYZEd enough? I'm asking because the first query was
estimated to be 50x slower than the second, but actual run
time was only 7x slower. It makes me wonder if the table
stats are way off.

Also, how selective are floor and class? The following queries
should give you an idea:
SELECT count(distinct floor), count(*) from mytable;
SELECT count(distinct class), count(*) from mytable;
If the number of distinct values is very small compared to the
total number of entries, the planner might be ignoring the
index simply because it's not helpful.

I'm also wondering if the planner is avoiding using indexes for
those columns because they're unbounded varchars and the limit on
and index is far less (I believe it's less than 3000 chars, but
I can't find a definitive number right now). If the combination
of those exceeded the max index length, then the index would
truncate the values, and the executor would have to recheck each
row against the actual row data, which the planner might perceive
as slower. Note that this is speculation on my part, but the
point being that if those columns are usually as narrow as your
examples you might want to try changing them to VARCHAR(50) or
something and see if the planner is more willing to use the
indexes at that point.

A lot of this is pure speculation on my part ... hope that it's
helpful and doesn't lead you in the wrong direction.

Post by Rebecca Clarke
SELECT the_geom,oid from mytable
where the_geom && ST_GeomFromText('POLYGON((529342.334095833
180696.221733333,529342.334095833 181533.44595,530964.336820833
181533.44595,530964.336820833 180696.221733333,529342.334095833
180696.221733333))',find_srid('','mytable','the_geom'))
and (floor = 'gf' AND source_id = '16701' AND class = 'General')
Bitmap Heap Scan on mytable (cost=1212.62..1580.71 rows=177 width=612)
Recheck Cond: ((source_id = 16701) AND (the_geom &&
'0103000020346C00000100000005000000'::geometry))
Filter: (((floor)::text = 'gf'::text) AND ((class)::text =
'General'::text))
-> BitmapAnd (cost=1212.62..1212.62 rows=184 width=0)
-> Bitmap Index Scan on idx_source_id (cost=0.00..433.25
rows=33149 width=0)
Index Cond: (source_id = 16701)
-> Bitmap Index Scan on idx_the_geom_gist (cost=0.00..779.03
rows=38654 width=0)
Index Cond: (the_geom &&
'0103000020346C00000100000005000000'::geometry)
SELECT the_geom,oid from mytable
where the_geom && ST_GeomFromText('POLYGON((415995.148624997
433101.445479165,415995.148624997 433326.320145835,416466.572625003
433326.320145835,416466.572625003 433101.445479165,415995.148624997
433101.445479165))',find_srid('','mytable','the_geom'))
and (floor = 'gf' AND source_id = '20427' AND class = 'General')
Bitmap Heap Scan on mytable (cost=21.41..23.44 rows=1 width=612)
Recheck Cond: ((the_geom &&
'0103000020346C00000100000005000000'::geometry) AND (source_id = 20427))
Filter: (((floor)::text = 'gf'::text) AND ((class)::text =
'General'::text))
-> BitmapAnd (cost=21.41..21.41 rows=1 width=0)
-> Bitmap Index Scan on idx_the_geom_gist (cost=0.00..4.18
rows=141 width=0)
Index Cond: (the_geom &&
'0103000020346C00000100000005000000'::geometry)
-> Bitmap Index Scan on idx_source_id (cost=0.00..16.97
rows=1112 width=0)
Index Cond: (source_id = 20427)
There is no unique field in the table. And unfortunately the performance is
unacceptable. The logs show that the first query when it was execute took
70466.757 ms where as the second one took 11032.459 ms.
I've begun to create a duplicate environment to play with so hopefully I'm
able to weed out a solution.

Post by Bill Moran
On Wed, 23 Jul 2014 10:45:56 +0100

made

and

Post by Rebecca Clarke
I don't understand why it's not using the new index either.

It depends on the data. The planner will make estimates on what the fastest
way to execute will be based on a lot of things, one of which is how helpful
an index is really expected to be. Since your indexes aren't arranged to
allow an index-only scan (although I don't remember if 9.1 had index-only
scans yet ...) it will have to use the index to narrow down the rows, then
load up the rows and filter them further (you didn't provide explain output,
but I'll bet a dozen nickels that's what it says). So if the values in
source_id are unique enough that the planner doesn't think that
idx_floor_sourceid_class will narrow the results any better than
idx_source_id, it will use the former because it's a smaller index and will
require less disk fetches to load it.
Of course, without explain output, I'm assuming a lot. But the basic operation
still stands, indexes aren't always guaranteed to be faster than other types of
access. And depending on the distribution of the data, some indexes might be
faster with some fetches than with others.
The key is not whether it's using the index or not, it's whether it's getting the
fastest plan or not. The first step in ensuring that is to make sure the table
is getting analyzed frequently enough, otherwise the stats that the planner uses
to predict will be off and it will often choose poor plans. The next step would
be to isolate specific instances that you're suspicious of and test to see if the
planner really is getting the best plan. Hopefully you have a test database where
you can copy the data and add/remove indexes at will. That type of easter egg
hunt may not be necessary, though. EXPLAIN ANALYZE can often tell you if the plan
is bad by showing you where estimated times vary wildly from actual times.
Hope this helps, but before you worry too much about it, I'd suggest asking
yourself 1 question: is the performance at an acceptable level, even if you don't
understand the rational behind the planner's choice? Of course, that may not be
important if you're asking the question just to understand better.
--

--
Bill Moran <***@potentialtech.com>
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David G Johnston

2014-07-25 18:14:43 UTC

Permalink

Post by Bill Moran
On Fri, 25 Jul 2014 17:20:57 +0100
Rebecca Clarke <
Note that this is speculation on my part, but the
point being that if those columns are usually as narrow as your
examples you might want to try changing them to VARCHAR(50) or
something and see if the planner is more willing to use the
indexes at that point.
A lot of this is pure speculation on my part ... hope that it's
helpful and doesn't lead you in the wrong direction.

The presence of absence of the length limiter on a varchar will not impact
the query plan. And I'm pretty sure you cannot even store a too long
varchar in an index. It will error on the attempt (as opposed to
truncating).

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Index-usage-with-slow-query-tp5812503p5812862.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Marc Mamin

2014-07-25 19:14:06 UTC

Permalink

Post by David G Johnston
The presence of absence of the length limiter on a varchar will not impact
the query plan. And I'm pretty sure you cannot even store a too long
varchar in an index. It will error on the attempt (as opposed to
truncating).

The max size is almost one block. After that you get an error:
FEHLER: Indexzeile benötigt 9184 Bytes, Maximalgröße ist 8191

Post by David G Johnston
Looking for some advice regarding a slow query I have and indexing.

It's worth a try to compare the planner choice in Postgres 9.3 ...

Post by David G Johnston
CREATE TABLE mytable
(
class character varying,
floor character varying,
source_id integer,
the_geom geometry

You could slightly improve the table definition while placing the integer column in front of the varchar.

More interesting would be to move the varchar in separate reference column and only have integers except for the geometry type.
This will make the new index smaller and faster. I bet that the planner would take that change in account.

Post by David G Johnston
INDEX idx_source_id
USING btree
(source_id);
INDEX idx_the_geom_gist
USING gist
(the_geom);
SELECT the_geom,oid from mytable
WHERE
the_geom && ST_GeomFromText('POLYGON((529342.334095833 180696.221733333,529342.334095833 181533.44595,530964.336820833 181533.44595,530964.336820833 180696.221733333,529342.334095833 180696.221733333))',find_srid('','mytable','the_geom'))
AND
(floor = 'gf' AND source_id = '689' AND class = 'General')
INDEX idx_floor_sourceid_class
USING btree
(floor, source_id, class);

if your query always uses these 3 columns, you should put the one with the highest cardinality first.

Post by David G Johnston
When I run an EXPLAIN and EXPLAIN ANALYZE the query isn't using this new index.
Sometimes it uses just idx_the_geom_gist
other times it uses idx_the_geom_gist and idx_source_id
I don't understand why it's inconsistent in its use of indexes when the query is always the same structure, just different where clause values, and I don't understand why it's not using the new index either.

The planner uses statistics on the different columns content to weight the possible query plans.
And it is good at that :)
You can help him while raising the target statistics on these 3 columns.

More difficult for the planner is to compare the advantage of the GIN index to the other one.
If some columns get toasted then the cost of detoasting seems to often be underestimated.

For the case when one of your 3 first columns has a very low cardinality, you may consider adding some partial indexes.
e.g.:
create INDEX idx_the_geom_gist_general USING gist (the_geom) where class ='general';
create INDEX idx_the_geom_gist_special USING gist (the_geom) where class ='special';

They can of course only get used when your query contains exactly the same clause.

regards,

Marc Mamin

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general