visualizing B-tree index coverage

Ask a Question related to PostgreSQL / PGSQL, Design and Development.

  1. #1

    Default visualizing B-tree index coverage

    Does anyone know of a tools that allows one to visualize
    the tree created by a multi-column B-tree index?
    A picture of a tree with branches, showing how "branchy" the
    tree is would be great.
    I'm wondering how well I've "clustered" the data in my table
    using the multi-column index. In other words, do my
    multi-columns sufficiently but not overly discriminate rows from each other?
    Do I have too many with the same index? (not enough branches)
    Do I have a unique index for each row? (way too many branches)

    Thanks,
    TJ



    ---------------------------(end of broadcast)---------------------------
    TIP 7: don't forget to increase your free space map settings

    TJ O'Donnell Guest

  2. Similar Questions and Discussions

    1. visualizing image
      Hello, I'm writing a ASP.net webservice wich will visualize an image, generated by another application. The generated image is a char*. I can...
    2. Measure Ink Coverage
      I would like to know if Illustrator or Photoshop can calculate the amount of ink coverage a particular image or letter will use as compared to...
    3. Visualizing Online Data
      hi everyone, I was trying to come up with a nice example of how to maybe use Lingo to visualize some simple online trend, just to show some...
    4. Determining Coverage
      A co-worker was finally able to dig it up in some notes from a class. The equation is 255 - (histogram mean value) / 255 = % coverage. Hopefully...
    5. Devel::Coverage
      Hi all, I am new to using Perl and am trying to do some coverage analysis. When I do perl -d:Coverage temp.pl I get Cant locate...
  3. #2

    Default Re: visualizing B-tree index coverage

    Useful explanation of PostgreSQL index format:
    [url]http://www.faqs.org/docs/ppbook/c13329.htm[/url]

    I think you are aiming for the wrong thing.
    The worst possible index is one with every value the same.
    The second worst (still basically useless) is one with only two values.
    The greater the differentiation of the data, the more workload is
    reduced on a search.

    Since it isn't a straight binary tree, I don't think that having highly
    dissimilar data in the index should be a problem.

    Do you have data or experience that shows otherwise?

    -----Original Message-----
    From: [email]pgsql-general-owner@postgresql.org[/email]
    [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell
    Sent: Tuesday, January 25, 2005 2:19 PM
    To: [email]pgsql-general@postgresql.org[/email]
    Cc: [email]tjo@acm.org[/email]
    Subject: [GENERAL] visualizing B-tree index coverage

    Does anyone know of a tools that allows one to visualize
    the tree created by a multi-column B-tree index?
    A picture of a tree with branches, showing how "branchy" the
    tree is would be great.
    I'm wondering how well I've "clustered" the data in my table
    using the multi-column index. In other words, do my
    multi-columns sufficiently but not overly discriminate rows from each
    other?
    Do I have too many with the same index? (not enough branches)
    Do I have a unique index for each row? (way too many branches)

    Thanks,
    TJ



    ---------------------------(end of broadcast)---------------------------
    TIP 7: don't forget to increase your free space map settings

    ---------------------------(end of broadcast)---------------------------
    TIP 6: Have you searched our list archives?

    [url]http://archives.postgresql.org[/url]

    Dann Corbit Guest

  4. #3

    Default Re: visualizing B-tree index coverage

    Since I'm using a multi-column index, I can greatly influence
    the nature of the index created, depending on which columns I use
    and how many. I'm searching for an optimal set
    of columns that creates an index that, for sure does not have
    every value the same, nor only two values. Instead, I want to see
    how well I've spread the index out over the data (if that phrasing makes sense).

    More specifically, I have character data representing molecular structures.
    I've written (rather slow) search functions. I can create any number of
    columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
    # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may
    be a poor analogy), but rather will classify similar structures together.
    I create a multi-column index on these counts and
    get about 2-3 times speedup using 13 columns right now.
    For example:

    select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec.

    select count(smiles) from structure where
    (_c, _n, _o, _s, _p, _halo,
    _arom_c, _arom_n, _arom_o, _arom_s,
    _atoms, _single_bonds, _other_bonds) >=
    ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
    and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds
    when the (_c, etc.) is a multi-column index.

    The data isn't inherently structured in any way that invites some particular number of columns
    for indexing. I don't want to use too many, nor too few columns. I also
    want to optimize the nature(which atom types, bond types, etc.)
    of the count columns. While I could do this
    and use the speedup as the measure of success, I think
    that if my B-tree were "covering" the data well, I would get the best results.
    Covering means finding that optimal situation where there is not one index for all rows
    and also not a unique index for every row - something inbetween would be ideal,
    or is that basically a wrong idea?

    TJ


    > Useful explanation of PostgreSQL index format:
    > [url]http://www.faqs.org/docs/ppbook/c13329.htm[/url]
    >
    > I think you are aiming for the wrong thing.
    > The worst possible index is one with every value the same.
    > The second worst (still basically useless) is one with only two values. The greater the
    > differentiation of the data, the more workload is
    > reduced on a search.
    >
    > Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the
    > index should be a problem.
    >
    > Do you have data or experience that shows otherwise?
    >
    > -----Original Message-----
    > From: [email]pgsql-general-owner@postgresql.org[/email]
    > [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25,
    > 2005 2:19 PM
    > To: [email]pgsql-general@postgresql.org[/email]
    > Cc: [email]tjo@acm.org[/email]
    > Subject: [GENERAL] visualizing B-tree index coverage
    >
    > Does anyone know of a tools that allows one to visualize
    > the tree created by a multi-column B-tree index?
    > A picture of a tree with branches, showing how "branchy" the
    > tree is would be great.
    > I'm wondering how well I've "clustered" the data in my table
    > using the multi-column index. In other words, do my
    > multi-columns sufficiently but not overly discriminate rows from each other?
    > Do I have too many with the same index? (not enough branches)
    > Do I have a unique index for each row? (way too many branches)
    >
    > Thanks,
    > TJ
    >
    >
    >
    > ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to
    > increase your free space map settings



    ---------------------------(end of broadcast)---------------------------
    TIP 6: Have you searched our list archives?

    [url]http://archives.postgresql.org[/url]

    TJ O'Donnell Guest

  5. #4

    Default Re: visualizing B-tree index coverage

    Normally, a unique, clustered index is the silver bullet to the best
    performance (but my experience with unique clustered indexes is largely
    in non-PostgreSQL database systems -- so take it with a grain of salt).

    I do not see any extra expense for a unique index verses one that is
    mostly unique. Further, if an index is unique, that should be an
    excellent optimizer hint for query acceleration.

    If you know what queries you run most frequently, I would tailor the
    index for optimal query execution via the join columns and columns often
    involved in where clause filtering.

    If it is easily possible to make a unique index I would definitely time
    the queries with a unique index as well. I suspect that the unique
    index will fare better unless there is something odd about your data.

    -----Original Message-----
    From: TJ O'Donnell [mailto:tjo@acm.org]
    Sent: Tuesday, January 25, 2005 3:50 PM
    To: [email]pgsql-general@postgresql.org[/email]
    Cc: Dann Corbit
    Subject: RE: [GENERAL] visualizing B-tree index coverage

    Since I'm using a multi-column index, I can greatly influence
    the nature of the index created, depending on which columns I use
    and how many. I'm searching for an optimal set
    of columns that creates an index that, for sure does not have
    every value the same, nor only two values. Instead, I want to see
    how well I've spread the index out over the data (if that phrasing makes
    sense).

    More specifically, I have character data representing molecular
    structures.
    I've written (rather slow) search functions. I can create any number of
    columns that "fingerprint" each structure, e.g. # Carbon atoms, # N
    atoms,
    # single bonds, etc. I expect my fingerprints will not be unique
    (fingerprint may
    be a poor analogy), but rather will classify similar structures
    together.
    I create a multi-column index on these counts and
    get about 2-3 times speedup using 13 columns right now.
    For example:

    select count(smiles) from structure where
    oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec.

    select count(smiles) from structure where
    (_c, _n, _o, _s, _p, _halo,
    _arom_c, _arom_n, _arom_o, _arom_s,
    _atoms, _single_bonds, _other_bonds) >=
    ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
    and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds
    when the (_c, etc.) is a multi-column index.

    The data isn't inherently structured in any way that invites some
    particular number of columns
    for indexing. I don't want to use too many, nor too few columns. I
    also
    want to optimize the nature(which atom types, bond types, etc.)
    of the count columns. While I could do this
    and use the speedup as the measure of success, I think
    that if my B-tree were "covering" the data well, I would get the best
    results.
    Covering means finding that optimal situation where there is not one
    index for all rows
    and also not a unique index for every row - something inbetween would be
    ideal,
    or is that basically a wrong idea?

    TJ


    > Useful explanation of PostgreSQL index format:
    > [url]http://www.faqs.org/docs/ppbook/c13329.htm[/url]
    >
    > I think you are aiming for the wrong thing.
    > The worst possible index is one with every value the same.
    > The second worst (still basically useless) is one with only two
    values. The greater the
    > differentiation of the data, the more workload is
    > reduced on a search.
    >
    > Since it isn't a straight binary tree, I don't think that having
    highly dissimilar data in the
    > index should be a problem.
    >
    > Do you have data or experience that shows otherwise?
    >
    > -----Original Message-----
    > From: [email]pgsql-general-owner@postgresql.org[/email]
    > [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell
    Sent: Tuesday, January 25,
    > 2005 2:19 PM
    > To: [email]pgsql-general@postgresql.org[/email]
    > Cc: [email]tjo@acm.org[/email]
    > Subject: [GENERAL] visualizing B-tree index coverage
    >
    > Does anyone know of a tools that allows one to visualize
    > the tree created by a multi-column B-tree index?
    > A picture of a tree with branches, showing how "branchy" the
    > tree is would be great.
    > I'm wondering how well I've "clustered" the data in my table
    > using the multi-column index. In other words, do my
    > multi-columns sufficiently but not overly discriminate rows from each
    other?
    > Do I have too many with the same index? (not enough branches)
    > Do I have a unique index for each row? (way too many branches)
    >
    > Thanks,
    > TJ
    >
    >
    >
    > ---------------------------(end of
    broadcast)--------------------------- TIP 7: don't forget to
    > increase your free space map settings



    ---------------------------(end of broadcast)---------------------------
    TIP 6: Have you searched our list archives?

    [url]http://archives.postgresql.org[/url]

    Dann Corbit Guest

  6. #5

    Default Re: visualizing B-tree index coverage

    Some other things that are important:
    How much is the data in transition (updates/deletes/inserts)? If the
    data is mostly static or static you can add many special case indexes
    with little penalty. The biggest cost of indexes (besides disk space
    consumed) is in the slowdown of inserts, updates, and deletes. If the
    data hardly changes, you can throw on index after index with little
    cost. But if the data is in huge flux, you will have to be careful
    about performance targets for each index you add.

    This stuff may prove to be of great value:
    [url]http://www.postgresql.org/docs/8.0/interactive/monitoring-stats.html[/url]

    I would also run EXPLAIN against every distinct sort of query you plan
    to execute (unless it is for ad-hoc reporting in which case such a
    requirement cannot be met).


    ---------------------------(end of broadcast)---------------------------
    TIP 5: Have you checked our extensive FAQ?

    [url]http://www.postgresql.org/docs/faq[/url]

    Dann Corbit Guest

  7. #6

    Default Re: visualizing B-tree index coverage

    Excuse me for bothering but what kind of search engine you
    developed. Does it looks like sets comparing ?

    Oleg

    On Tue, 25 Jan 2005, TJ O'Donnell wrote:
    > Since I'm using a multi-column index, I can greatly influence
    > the nature of the index created, depending on which columns I use
    > and how many. I'm searching for an optimal set
    > of columns that creates an index that, for sure does not have
    > every value the same, nor only two values. Instead, I want to see
    > how well I've spread the index out over the data (if that phrasing makes sense).
    >
    > More specifically, I have character data representing molecular structures.
    > I've written (rather slow) search functions. I can create any number of
    > columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
    > # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may
    > be a poor analogy), but rather will classify similar structures together.
    > I create a multi-column index on these counts and
    > get about 2-3 times speedup using 13 columns right now.
    > For example:
    >
    > select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec.
    >
    > select count(smiles) from structure where
    > (_c, _n, _o, _s, _p, _halo,
    > _arom_c, _arom_n, _arom_o, _arom_s,
    > _atoms, _single_bonds, _other_bonds) >=
    > ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
    > and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds
    > when the (_c, etc.) is a multi-column index.
    >
    > The data isn't inherently structured in any way that invites some particular number of columns
    > for indexing. I don't want to use too many, nor too few columns. I also
    > want to optimize the nature(which atom types, bond types, etc.)
    > of the count columns. While I could do this
    > and use the speedup as the measure of success, I think
    > that if my B-tree were "covering" the data well, I would get the best results.
    > Covering means finding that optimal situation where there is not one index for all rows
    > and also not a unique index for every row - something inbetween would be ideal,
    > or is that basically a wrong idea?
    >
    > TJ
    >
    >
    >
    >> Useful explanation of PostgreSQL index format:
    >> [url]http://www.faqs.org/docs/ppbook/c13329.htm[/url]
    >>
    >> I think you are aiming for the wrong thing.
    >> The worst possible index is one with every value the same.
    >> The second worst (still basically useless) is one with only two values. The greater the
    >> differentiation of the data, the more workload is
    >> reduced on a search.
    >>
    >> Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the
    >> index should be a problem.
    >>
    >> Do you have data or experience that shows otherwise?
    >>
    >> -----Original Message-----
    >> From: [email]pgsql-general-owner@postgresql.org[/email]
    >> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25,
    >> 2005 2:19 PM
    >> To: [email]pgsql-general@postgresql.org[/email]
    >> Cc: [email]tjo@acm.org[/email]
    >> Subject: [GENERAL] visualizing B-tree index coverage
    >>
    >> Does anyone know of a tools that allows one to visualize
    >> the tree created by a multi-column B-tree index?
    >> A picture of a tree with branches, showing how "branchy" the
    >> tree is would be great.
    >> I'm wondering how well I've "clustered" the data in my table
    >> using the multi-column index. In other words, do my
    >> multi-columns sufficiently but not overly discriminate rows from each other?
    >> Do I have too many with the same index? (not enough branches)
    >> Do I have a unique index for each row? (way too many branches)
    >>
    >> Thanks,
    >> TJ
    >>
    >>
    >>
    >> ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to
    >> increase your free space map settings
    >
    >
    >
    >
    > ---------------------------(end of broadcast)---------------------------
    > TIP 6: Have you searched our list archives?
    >
    > [url]http://archives.postgresql.org[/url]
    >
    Regards,
    Oleg
    __________________________________________________ ___________
    Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
    Sternberg Astronomical Institute, Moscow University (Russia)
    Internet: [email]oleg@sai.msu.su[/email], [url]http://www.sai.msu.su/~megera/[/url]
    phone: +007(095)939-16-83, +007(095)939-23-83

    ---------------------------(end of broadcast)---------------------------
    TIP 7: don't forget to increase your free space map settings

    Oleg Bartunov Guest

  8. #7

    Default Re: visualizing B-tree index coverage


    I think you missed an important "feature" of multicolumn indexes, that
    you better not use 'OR' in your expressions. You seem to want only to use
    '>=' so this should be OK.

    Suppose you have 3 columns a,z,e containing values linearly distributed
    between ...

    select min(a),max(a),min(z),max(z),min(e),max(e) from test;
    min | max | min | max | min | max
    -----+-----+-----+-----+-----+-----
    0 | 13 | 0 | 99 | 0 | 99


    For instance the following query is indexed :
    explain analyze select * from test where a>=0 and z>=90 and e>=0;
    QUERY PLAN
    -------------------------------------------------------------------------------------------------------------------------
    Index Scan using testa on test (cost=0.00..1637.56 rows=11345 width=16)
    (actual time=0.085..51.441 rows=13000 loops=1)
    Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0))
    Total runtime: 56.307 ms

    The following is only partially indexed :

    explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8) and
    e>=0;
    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------
    Index Scan using testa, testa on test (cost=0.00..3269.06 rows=346
    width=16) (actual time=0.328..52.961 rows=400 loops=1)
    Index Cond: ((a = 1) OR (a = 2))
    Filter: (((z = 1) OR (z = 8)) AND (e >= 0))
    Total runtime: 53.297 ms

    You see the 'index cond' field which is what determines the fetched rows,
    which are then fetched and filtered with the 'filter' expression. Having
    the most selective index cond is important because it will diminish the
    number of rows to be fetched. However, in your case the filter expression
    is also beneficial because any row eliminated by the filter will not need
    to go through your expensive matching function.

    In this case :

    SELECT count(*) FROM test;
    => 131072

    SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2));
    => 20000

    SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and e>=0;
    => 400

    In this case the index fetches 20k rows out of 131072 but only 400 are
    used...

    If you don't use OR, index use is more likely :

    explain analyze select * from test where (a,z,e) >= (0,50,80);
    QUERY PLAN
    -------------------------------------------------------------------------------------------------------------------------
    Index Scan using testa on test (cost=0.00..1669.78 rows=12627 width=16)
    (actual time=0.087..58.316 rows=13000 loops=1)
    Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80))
    Total runtime: 63.049 ms

    Here you have a full index scan.

    To determine the efficiency of your indexes, you can thus use this method,
    and look at the 'index cond' and 'filter' expressions, and counting the
    rows matched by each.




















    > particular number of columns
    > for indexing. I don't want to use too many, nor too few columns. I also
    > want to optimize the nature(which atom types, bond types, etc.)
    > of the count columns. While I could do this
    > and use the speedup as the measure of success, I think
    > that if my B-tree were "covering" the data well, I would get the best
    > results.
    > Covering means finding that optimal situation where there is not one
    > index for all rows
    > and also not a unique index for every row - something inbetween would be
    > ideal,
    > or is that basically a wrong idea?
    ---------------------------(end of broadcast)---------------------------
    TIP 8: explain analyze is your friend

    PFC Guest

  9. #8

    Default Re: visualizing B-tree index coverage

    TJ O'Donnell wrote:
    > More specifically, I have character data representing molecular structures.
    > I've written (rather slow) search functions. I can create any number of
    > columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
    > # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may
    > be a poor analogy), but rather will classify similar structures together.
    > I create a multi-column index on these counts and
    > get about 2-3 times speedup using 13 columns right now.
    > For example:
    > select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec.
    >
    > select count(smiles) from structure where
    > (_c, _n, _o, _s, _p, _halo,
    > _arom_c, _arom_n, _arom_o, _arom_s,
    > _atoms, _single_bonds, _other_bonds) >=
    > ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
    > and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds
    > when the (_c, etc.) is a multi-column index.
    Is 3 really only a lower bound for the number of carbon atoms (which, I
    guess is what _c represents), or will there be exactly 3 carbon atoms in
    the molecules found? If not, can you estimate an upper bound, as well as
    an lower bound? If you can, than you cound further optimize this by
    using a range query (i.e. select ... from ... where (...) >= (...) and
    (...) <= (...( and oe_matches(...)).

    If you can only determine bounds (upper _and_ lower) for some of the
    "fingerprint" columns, that I would suggest you choose those for the
    index, and use them as the "first" columns (because otherwise a
    range-scan won't work).

    "analyze" gathers some statistics about the distribution of values in a
    table - I guess you could extract at least an approximation of "index
    coverage" from there.

    Since there cost of index inserts/updates/deletes is afaik largly
    independent from the number of columns the index consists of, I'd
    suggest that you add columns until the stats show that it's nearly an
    unique index - but prefer columns for which you can compute strong upper
    & lower bounds, and put those "in front".
    > The data isn't inherently structured in any way that invites some particular number of columns
    > for indexing. I don't want to use too many, nor too few columns. I also
    > want to optimize the nature(which atom types, bond types, etc.)
    > of the count columns. While I could do this
    > and use the speedup as the measure of success, I think
    > that if my B-tree were "covering" the data well, I would get the best results.
    > Covering means finding that optimal situation where there is not one index for all rows
    > and also not a unique index for every row - something inbetween would be ideal,
    > or is that basically a wrong idea?
    I believe it depends on the costs of executing one oe_matches call -
    according to you, those costs are quite heavy, so I every call you can
    omit by adding more columns to the index is worth the performance penalty.

    If you need fast adding of data, as well as fast searches, you could add
    the data into a "pending" table that is not indexed, and union the
    searches over both tables. At low-traffic times (say, at night), you
    could move the data from the "pending" table to the "structure" table.
    This of course only works, if the number of tuples insertes between two
    move-runs is much smaller than the number of tuples in "structure".

    greetings, Florian Pflug


    ---------------------------(end of broadcast)---------------------------
    TIP 8: explain analyze is your friend

    Florian G. Pflug Guest

  10. #9

    Default Re: visualizing B-tree index coverage

    I realize that using OR will not result in an index scan.
    I will never be interested in a OR condition for the kinds
    of searches I use. In my Select statements, I always name
    every column of the multi-column index in same order that
    they were named when creating the index. I always use
    the >= condition, and very rarely, the = condition.

    However, I am concerned that I must place
    the most selective column first in my index. I cannot tell,
    a priori, which column will be most selective. That depends on the
    nature of search, which can vary widely each time.
    Are you saying that if my first column is not selective, even though the remaining
    columns are, the planner may choose not to use the index after
    seeing that the first column is not very selective?
    That seems like an oversight, IMHO. Shouldn't the overall effect of
    using all the columns be considered before choosing not to use an
    index scan?

    Since I'm using every column of my multi-column index for every search,
    and I always use >=, Explain Analyze always shows that every column
    is considered in the index scan. However, that is only when the
    index scan is used. Sometimes, Explain Analyze shows it is not used.
    That appears to happen when my search condition is very general.
    This it to be expected, so I am not worried. Most of my searches will
    be intermediate, namely not VERY selective, but also not VERY general.
    So the idea of the multi-column index is to "characterize" each row
    sufficiently, even when it is a perfectly ordinary row with no ONE
    feature being distinctive, but rather several features together giving
    it it's distinctive character. That is my interpretation of the
    multi-column index.

    TJ


    PFC wrote:
    >
    > I think you missed an important "feature" of multicolumn indexes,
    > that you better not use 'OR' in your expressions. You seem to want only
    > to use '>=' so this should be OK.
    >
    > Suppose you have 3 columns a,z,e containing values linearly
    > distributed between ...
    >
    > select min(a),max(a),min(z),max(z),min(e),max(e) from test;
    > min | max | min | max | min | max
    > -----+-----+-----+-----+-----+-----
    > 0 | 13 | 0 | 99 | 0 | 99
    >
    >
    > For instance the following query is indexed :
    > explain analyze select * from test where a>=0 and z>=90 and e>=0;
    > QUERY PLAN
    > -------------------------------------------------------------------------------------------------------------------------
    >
    > Index Scan using testa on test (cost=0.00..1637.56 rows=11345
    > width=16) (actual time=0.085..51.441 rows=13000 loops=1)
    > Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0))
    > Total runtime: 56.307 ms
    >
    > The following is only partially indexed :
    >
    > explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8)
    > and e>=0;
    > QUERY PLAN
    > ----------------------------------------------------------------------------------------------------------------------------
    >
    > Index Scan using testa, testa on test (cost=0.00..3269.06 rows=346
    > width=16) (actual time=0.328..52.961 rows=400 loops=1)
    > Index Cond: ((a = 1) OR (a = 2))
    > Filter: (((z = 1) OR (z = 8)) AND (e >= 0))
    > Total runtime: 53.297 ms
    >
    > You see the 'index cond' field which is what determines the fetched
    > rows, which are then fetched and filtered with the 'filter' expression.
    > Having the most selective index cond is important because it will
    > diminish the number of rows to be fetched. However, in your case the
    > filter expression is also beneficial because any row eliminated by the
    > filter will not need to go through your expensive matching function.
    >
    > In this case :
    >
    > SELECT count(*) FROM test;
    > => 131072
    >
    > SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2));
    > => 20000
    >
    > SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and e>=0;
    > => 400
    >
    > In this case the index fetches 20k rows out of 131072 but only 400 are
    > used...
    >
    > If you don't use OR, index use is more likely :
    >
    > explain analyze select * from test where (a,z,e) >= (0,50,80);
    > QUERY PLAN
    > -------------------------------------------------------------------------------------------------------------------------
    >
    > Index Scan using testa on test (cost=0.00..1669.78 rows=12627
    > width=16) (actual time=0.087..58.316 rows=13000 loops=1)
    > Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80))
    > Total runtime: 63.049 ms
    >
    > Here you have a full index scan.
    >
    > To determine the efficiency of your indexes, you can thus use this
    > method, and look at the 'index cond' and 'filter' expressions, and
    > counting the rows matched by each.
    >
    >
    >> particular number of columns
    >> for indexing. I don't want to use too many, nor too few columns. I also
    >> want to optimize the nature(which atom types, bond types, etc.)
    >> of the count columns. While I could do this
    >> and use the speedup as the measure of success, I think
    >> that if my B-tree were "covering" the data well, I would get the best
    >> results.
    >> Covering means finding that optimal situation where there is not one
    >> index for all rows
    >> and also not a unique index for every row - something inbetween would
    >> be ideal,
    >> or is that basically a wrong idea?
    ---------------------------(end of broadcast)---------------------------
    TIP 4: Don't 'kill -9' the postmaster

    TJ O'Donnell Guest

  11. #10

    Default Re: visualizing B-tree index coverage

    "TJ O'Donnell" <tjo@acm.org> writes:
    > However, I am concerned that I must place the most selective column first in
    > my index. I cannot tell, a priori, which column will be most selective. That
    > depends on the nature of search, which can vary widely each time.
    If you're always using > operators then perhaps you would be better off with
    three separate indexes on <x>, <y>, and <z> rather than one on <x,y,z>. It
    depends on your use case.

    It's also possible you would be better off with a GiST index on the array
    [x,y,z] and using the GiST indexable operators in contrib/intarray. However
    there are disadvantages to these addons as well.
    > Are you saying that if my first column is not selective, even though the
    > remaining columns are, the planner may choose not to use the index after
    > seeing that the first column is not very selective? That seems like an
    > oversight, IMHO. Shouldn't the overall effect of using all the columns be
    > considered before choosing not to use an index scan?
    If your index is on <x,y,z> and you say x=? and y=? and z=? then it's best to
    have x be the most selective column but it's only a question of degree. The
    index will be useful in basically the same situations, it'll just perform
    somewhat better.

    On the other hand if your index is on <x,y,z> and you say "x>? and y=? and
    z=?" then the remaining qualifications after x>? are useless. Postgres simply
    cannot use them in the index lookup. If your index was on <y,z,x> then
    Postgres can use it to go straight to the start of the records that match and
    read them from the index until they stop matching.

    Consider the analogous situation if you were using an index of a book. I tell
    you to look up everything in the encylopedia that starts with a letter after
    "p". And by the way the second letter has to be an "x" and the third letter a
    "y". You will more or less have to skim through everything after "p" even
    though you know the second and third letter because the matching entries will
    be scattered throughout the remaining pages.

    Actually that's not entirely true, you could check each letter after "p" and
    look up specifically "qxy...", "rxy...", "sxy...", "txy...", etc right up to
    "xzy...". But unfortunately that's just not something Postgres currently knows
    how to do. Even then you can see that it's much less efficient than just
    reading a contiguous section of records out in order.

    --
    greg


    ---------------------------(end of broadcast)---------------------------
    TIP 8: explain analyze is your friend

    Greg Stark Guest

  12. #11

    Default Re: visualizing B-tree index coverage

    > I realize that using OR will not result in an index scan.
    > I will never be interested in a OR condition for the kinds
    > of searches I use. In my Select statements, I always name
    > every column of the multi-column index in same order that
    > they were named when creating the index. I always use
    > the >= condition, and very rarely, the = condition.
    All the leftmost index column must be named, but the order is unimportant.
    You can use (a BETWEEN x AND y) instead of (a>=x AND a<=y), it is cleaner.
    > However, I am concerned that I must place
    > the most selective column first in my index. I cannot tell,
    > a priori, which column will be most selective. That depends on the
    > nature of search, which can vary widely each time.
    > Are you saying that if my first column is not selective, even though the
    > remaining
    > columns are, the planner may choose not to use the index after
    > seeing that the first column is not very selective?
    I thought this was true but made some tests and the index scanner is
    smart.

    Try this :
    CREATE TABLE test (id serial primary key, a INTEGER, z INTEGER, e INTEGER,
    r INTEGER, t INTEGER, y INTEGER ) WITHOUT OIDS;
    INSERT 1M rows into table using a plpgsql function, with a,z,e,r,t,y being
    floor(random()*10) for instance.

    Then you can try various selects. a,z,e,r,t,y are a linear distribution
    between 0 and 9 included, so :
    a>=A AND z>=Z ... y>=Y gives a result set of about
    (10-A)*(10-Z)*...*(10-Y) results. You'll see the planner will use an index
    scan when needed. You can try the easiest case (a>=9) which just explores
    one part of the tree, and the worst case which explores a part of all
    leafs (y>=9). Both should yield about the same number of results, but the
    first should be faster. To know how much, just try ;)
    > That seems like an oversight, IMHO. Shouldn't the overall effect of
    > using all the columns be considered before choosing not to use an
    > index scan?
    I think it is. There are no cross column correlation stats though.
    > Since I'm using every column of my multi-column index for every search,
    > and I always use >=, Explain Analyze always shows that every column
    > is considered in the index scan. However, that is only when the
    > index scan is used. Sometimes, Explain Analyze shows it is not used.
    > That appears to happen when my search condition is very general.
    > This it to be expected, so I am not worried. Most of my searches will
    > be intermediate, namely not VERY selective, but also not VERY general.
    > So the idea of the multi-column index is to "characterize" each row
    > sufficiently, even when it is a perfectly ordinary row with no ONE
    > feature being distinctive, but rather several features together giving
    > it it's distinctive character. That is my interpretation of the
    > multi-column index.
    If you have some features which are highly selective, you can create a
    single column index on them. It won't be used often, but when it will, it
    will really work.




    ---------------------------(end of broadcast)---------------------------
    TIP 3: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to [email]majordomo@postgresql.org[/email] so that your
    message can get through to the mailing list cleanly

    PFC Guest

  13. #12

    Default Re: visualizing B-tree index coverage

    >>
    Rather than having the most selective item first, put the most
    frequently used item first.

    So (for instance) if your database is an organic chemistry database and
    the columns were elements, then carbon might be a logical first choice,
    hydrogen second, etc. for your index.

    What you want in creating the index is to choose an index where the
    items requested for filtering (in either WHERE clause or join columns)
    are very sure to exist somewhere in the early part of the index.

    A big problem will arise if (for all indexes) there is no instance where
    the first column of the index is referenced in the where clause.

    That means you will do a complete table scan.

    So create an index so for queries the most probable things to ask for
    are earliest in the index. More important than selectivity is the
    probability of the column being included in the filter part of the
    request.
    <<
    -----Original Message-----
    From: [email]pgsql-general-owner@postgresql.org[/email]
    [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell
    Sent: Thursday, January 27, 2005 7:38 AM
    To: PFC
    Cc: [email]pgsql-general@postgresql.org[/email]
    Subject: Re: [GENERAL] visualizing B-tree index coverage

    I realize that using OR will not result in an index scan.
    I will never be interested in a OR condition for the kinds
    of searches I use. In my Select statements, I always name
    every column of the multi-column index in same order that
    they were named when creating the index. I always use
    the >= condition, and very rarely, the = condition.

    However, I am concerned that I must place
    the most selective column first in my index. I cannot tell,
    a priori, which column will be most selective. That depends on the
    nature of search, which can vary widely each time.
    Are you saying that if my first column is not selective, even though the
    remaining
    columns are, the planner may choose not to use the index after
    seeing that the first column is not very selective?
    That seems like an oversight, IMHO. Shouldn't the overall effect of
    using all the columns be considered before choosing not to use an
    index scan?

    Since I'm using every column of my multi-column index for every search,
    and I always use >=, Explain Analyze always shows that every column
    is considered in the index scan. However, that is only when the
    index scan is used. Sometimes, Explain Analyze shows it is not used.
    That appears to happen when my search condition is very general.
    This it to be expected, so I am not worried. Most of my searches will
    be intermediate, namely not VERY selective, but also not VERY general.
    So the idea of the multi-column index is to "characterize" each row
    sufficiently, even when it is a perfectly ordinary row with no ONE
    feature being distinctive, but rather several features together giving
    it it's distinctive character. That is my interpretation of the
    multi-column index.

    TJ


    PFC wrote:
    >
    > I think you missed an important "feature" of multicolumn indexes,
    > that you better not use 'OR' in your expressions. You seem to want
    only
    > to use '>=' so this should be OK.
    >
    > Suppose you have 3 columns a,z,e containing values linearly
    > distributed between ...
    >
    > select min(a),max(a),min(z),max(z),min(e),max(e) from test;
    > min | max | min | max | min | max
    > -----+-----+-----+-----+-----+-----
    > 0 | 13 | 0 | 99 | 0 | 99
    >
    >
    > For instance the following query is indexed :
    > explain analyze select * from test where a>=0 and z>=90 and e>=0;
    > QUERY PLAN
    >
    ------------------------------------------------------------------------
    -------------------------------------------------
    >
    > Index Scan using testa on test (cost=0.00..1637.56 rows=11345
    > width=16) (actual time=0.085..51.441 rows=13000 loops=1)
    > Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0))
    > Total runtime: 56.307 ms
    >
    > The following is only partially indexed :
    >
    > explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8)
    > and e>=0;
    > QUERY PLAN
    >
    ------------------------------------------------------------------------
    ----------------------------------------------------
    >
    > Index Scan using testa, testa on test (cost=0.00..3269.06 rows=346
    > width=16) (actual time=0.328..52.961 rows=400 loops=1)
    > Index Cond: ((a = 1) OR (a = 2))
    > Filter: (((z = 1) OR (z = 8)) AND (e >= 0))
    > Total runtime: 53.297 ms
    >
    > You see the 'index cond' field which is what determines the
    fetched
    > rows, which are then fetched and filtered with the 'filter'
    expression.
    > Having the most selective index cond is important because it will
    > diminish the number of rows to be fetched. However, in your case the
    > filter expression is also beneficial because any row eliminated by
    the
    > filter will not need to go through your expensive matching function.
    >
    > In this case :
    >
    > SELECT count(*) FROM test;
    > => 131072
    >
    > SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2));
    > => 20000
    >
    > SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and
    e>=0;
    > => 400
    >
    > In this case the index fetches 20k rows out of 131072 but only 400 are
    > used...
    >
    > If you don't use OR, index use is more likely :
    >
    > explain analyze select * from test where (a,z,e) >= (0,50,80);
    > QUERY PLAN
    >
    ------------------------------------------------------------------------
    -------------------------------------------------
    >
    > Index Scan using testa on test (cost=0.00..1669.78 rows=12627
    > width=16) (actual time=0.087..58.316 rows=13000 loops=1)
    > Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80))
    > Total runtime: 63.049 ms
    >
    > Here you have a full index scan.
    >
    > To determine the efficiency of your indexes, you can thus use this
    > method, and look at the 'index cond' and 'filter' expressions, and
    > counting the rows matched by each.
    >
    >
    >> particular number of columns
    >> for indexing. I don't want to use too many, nor too few columns. I
    also
    >> want to optimize the nature(which atom types, bond types, etc.)
    >> of the count columns. While I could do this
    >> and use the speedup as the measure of success, I think
    >> that if my B-tree were "covering" the data well, I would get the best
    >> results.
    >> Covering means finding that optimal situation where there is not one
    >> index for all rows
    >> and also not a unique index for every row - something inbetween would
    >> be ideal,
    >> or is that basically a wrong idea?
    ---------------------------(end of broadcast)---------------------------
    TIP 4: Don't 'kill -9' the postmaster

    ---------------------------(end of broadcast)---------------------------
    TIP 3: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to [email]majordomo@postgresql.org[/email] so that your
    message can get through to the mailing list cleanly

    Dann Corbit Guest

  14. #13

    Default Does indexing help >= as well as = for integer columns?

    I have a table of about 5 million rows, 24 columns.
    Integer column _c is BTREE indexed (as is _n, _o and 3 others).

    This I understand and like:
    Explain Analyze Select count(smiles) from structure where _c = 30
    Aggregate (cost=105595.11..105595.11 rows=1 width=32) (actual time=17.722..17.724 rows=1 loops=1)
    -> Index Scan using "Nc" on structure (cost=0.00..105528.89 rows=26486 width=32) (actual
    time=0.098..16.095 rows=734 loops=1)
    Index Cond: (_c = 30)
    Total runtime: 18.019 ms

    This I don't get. Why is an index scan not used? Isn't an index supposed
    to help when using > < >= <= too?
    Explain Analyze Select count(smiles) from structure where _c >= 30
    Aggregate (cost=196033.74..196033.74 rows=1 width=32) (actual time=42133.432..42133.434 rows=1
    loops=1)
    -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32) (actual
    time=8050.437..42117.062 rows=1569 loops=1)
    Filter: (_c >= 30)
    Total runtime: 42133.746 ms


    TJ



    ---------------------------(end of broadcast)---------------------------
    TIP 9: the planner will ignore your desire to choose an index scan if your
    joining column's datatypes do not match

    TJ O'Donnell Guest

  15. #14

    Default Re: Does indexing help >= as well as = for integer columns?

    > This I don't get. Why is an index scan not used? Isn't an index
    > supposed
    > to help when using > < >= <= too?
    It should !
    > Explain Analyze Select count(smiles) from structure where _c >= 30
    > Aggregate (cost=196033.74..196033.74 rows=1 width=32) (actual
    > time=42133.432..42133.434 rows=1
    > loops=1)
    > -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669
    > width=32) (actual
    > time=8050.437..42117.062 rows=1569 loops=1)
    > Filter: (_c >= 30)
    > Total runtime: 42133.746 ms

    See these :

    -> Index Scan using "Nc" on structure (cost=0.00..105528.89 rows=26486
    width=32) (actualtime=0.098..16.095 rows=734 loops=1)
    -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32)
    (actual time=8050.437..42117.062 rows=1569 loops=1)

    In the index scan case, Planner thinks it'll get "rows=26486" but in
    reality only gets 734 rows.
    In the seq scan case, Planner thinks it'll get "rows=1765669" but in
    reality only gets 1569 rows.

    The two are way off-mark. 26486 still makes it choose an index scan
    because it's a small fraction of the table, but 1765669 is not.

    Analyze, use more precise statistics (alter table set statistics),
    whatever... but you gotta get the planner correctly estimating these
    rowcounts.

    ---------------------------(end of broadcast)---------------------------
    TIP 1: subscribe and unsubscribe commands go to [email]majordomo@postgresql.org[/email]

    PFC Guest

  16. #15

    Default Re: Does indexing help >= as well as = for integer columns?

    "TJ O'Donnell" <tjo@acm.org> writes:
    > This I don't get. Why is an index scan not used? Isn't an index supposed
    > to help when using > < >= <= too?
    > Explain Analyze Select count(smiles) from structure where _c >= 30
    > Aggregate (cost=196033.74..196033.74 rows=1 width=32) (actual time=42133.432..42133.434 rows=1
    > loops=1)
    > -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32) (actual
    > time=8050.437..42117.062 rows=1569 loops=1)
    > Filter: (_c >= 30)
    Have you ANALYZEd the table lately? That rowcount estimate is off by
    about three orders of magnitude :-(

    regards, tom lane

    ---------------------------(end of broadcast)---------------------------
    TIP 6: Have you searched our list archives?

    [url]http://archives.postgresql.org[/url]

    Tom Lane Guest

  17. #16

    Default Re: Does indexing help >= as well as = for integer columns?


    > "TJ O'Donnell" <tjo@acm.org> writes:
    >
    > > -> Seq Scan on structure (cost=0.00..191619.56 rows=1765669 width=32) (actual
    > > time=8050.437..42117.062 rows=1569 loops=1)
    ^^^^^^^^

    And do you vacuum regularly? Have you done batch updates or deletes and then
    never inserted enough records to reuse that dead space? You might have to
    vacuum full (or alternatively use the cluster command) to reclaim this dead
    space.

    I don't recall what the original motivation to rewrite the analyze sampling
    was, did having lots of dead tuples cause bad estimates in 7.4?

    --
    greg


    ---------------------------(end of broadcast)---------------------------
    TIP 3: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to [email]majordomo@postgresql.org[/email] so that your
    message can get through to the mailing list cleanly

    Greg Stark Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139