Professional Web Applications Themes

JOINed ORDERed LIMITed query not optimized enough - need help - MySQL

Hi all, I need to retrieve 13 rows from a 100,000-row JOIN, starting with a row with given values. All required indexes are available AFAICT, yet EXPLAIN tells me mysql is "Using index; Using temporary; Using filesort" on the first table it's touching. I don't know where I'm going wrong, or whether I'm hitting a limitation of mysql's query optimizer. Any help appreciated. DETAILS mysql version is 5.0.15. Tables are set up like this: -- Main categories. An ID and a name. CREATE TABLE main ( id int(9), name varchar(40), PRIMARY KEY (id), KEY name (name) ) TYPE = InnoDB; ...

  1. #1

    Default JOINed ORDERed LIMITed query not optimized enough - need help

    Hi all,

    I need to retrieve 13 rows from a 100,000-row JOIN, starting with a row
    with given values.
    All required indexes are available AFAICT, yet EXPLAIN tells me mysql is
    "Using index; Using temporary; Using filesort" on the first table it's
    touching.

    I don't know where I'm going wrong, or whether I'm hitting a limitation
    of mysql's query optimizer. Any help appreciated.

    DETAILS

    mysql version is 5.0.15.

    Tables are set up like this:

    -- Main categories. An ID and a name.
    CREATE TABLE main (
    id int(9),
    name varchar(40),
    PRIMARY KEY (id),
    KEY name (name)
    ) TYPE = InnoDB;
    -- Sample categories. Choose random numbers so that id order and name
    -- order are different.
    INSERT INTO main VALUES
    (0, CONCAT(RAND(), ' main 0')),
    ...
    (9, CONCAT(RAND(), ' main 9'));
    -- Just not to leave this stone unturned:
    YZE TABLE main;

    -- Subcategories. ID-plus-name.
    -- An additional foreign key that links up with the 'main' table.
    CREATE TABLE sub (
    id int(9),
    name varchar(40),
    main_id int(9),
    PRIMARY KEY (id),
    KEY name_main (name, main_id),
    KEY main_name (main_id, name)
    ) TYPE = InnoDB;
    INSERT INTO sub VALUES
    (000, CONCAT(RAND(), ' main 0 sub 000'), 0),
    ...
    (999, CONCAT(RAND(), ' main 0 sub 099'), 9);
    YZE TABLE sub;

    -- Finally, detail records, all 100,000 of them.
    CREATE TABLE detail (
    id int(9),
    name varchar(40),
    sub_id int(9),
    PRIMARY KEY (id),
    KEY name_sub (name, sub_id),
    KEY sub_name (sub_id, name)
    ) TYPE = InnoDB;
    INSERT INTO detail VALUES
    (00000, CONCAT(RAND(), ' sub 000 detail 00000'), 000),
    ...
    (99999, CONCAT(RAND(), ' sub 999 detail 99999'), 999);
    YZE TABLE detail;


    Now I want to present a detail-sub-main list to the end user:

    SELECT * FROM
    main
    JOIN sub ON main.id = sub.main_id
    JOIN detail ON sub.id = detail.sub_id

    I want this to be sorted by detail name, then sub name, finally main name:

    SELECT * FROM
    main
    JOIN sub ON main.id = sub.main_id
    JOIN detail ON sub.id = detail.sub_id
    ORDER by detail.name, sub.name, main.name

    The user isn't supposed to see the entire list, he's scrolling through
    it. Let's say he's at the record with detail name '0.5x' and needs to
    see the next 12 records, then I have

    SELECT * FROM
    main
    JOIN sub ON main.id = sub.main_id
    JOIN detail ON sub.id = detail.sub_id
    WHERE detail.name > '0.5'
    ORDER by detail.name, sub.name, main.name
    LIMIT 12

    This takes just 0.46-0.54 seconds to complete on my machine, but the
    real thing with 150,000 detail records and four additional JOINs to
    produce auxiliary information is unbearably slow: 25-40 seconds.

    Now EXPLAIN tells me mysql started with the 'main' table - that's just
    backwards, it should have started with 'detail' to find just the first
    twelve rows, then work up through 'sub' to 'main'!
    It's no surprise that it then needed to "use temporary; use filesort".

    Well, to speed up this particular case, I might be able to use a
    STRAIGHT JOIN. However, that would require reordering the JOIN part of
    the query, and I'd like to avoid that if possible - I'd have to do some
    serious rewriting of the dynamic SQL part of the application.

    Hints, tips, improvements very much appreciated.

    Regards,
    Jo
    Joachim Guest

  2. #2

    Default Re: JOINed ORDERed LIMITed query not optimized enough - need help

    > Well, to speed up this particular case, I might be able to use a 

    Update: I just tried STRAIGHT_JOIN.
    It doesn't help either.

    Strangely enough, the mysql manual says this on
    http://dev.mysql.com/doc/refman/5.0/en/limit-optimization.html :
    "If you use LIMIT row_count with ORDER BY, MySQL ends the sorting as
    soon as it has found the first row_count rows of the sorted result,
    rather than sorting the entire result. If ordering is done by using an
    index, this is very fast."

    That's exactly what I hoped for.
    However, the text goes on to say:

    "If a filesort must be done, all rows that match the query without the
    LIMIT clause must be selected, and most or all of them must be sorted,
    before it can be ascertained that the first row_count rows have been found."

    Now EXPLAIN indeed says that it's using a filesort:

    id: 1
    select_type: SIMPLE
    table: detail
    type: range
    possible_keys: name_sub, sub_name
    key: name_sub
    key_len: 43
    ref: NULL
    rows: 50231 (that's probably the number of rows that are >= '0.5')
    Extra: Using where; Using index; Using temporary; Using filesort


    .... but why?


    Regards,
    Jo
    Joachim Guest

  3. #3

    Default Re: JOINed ORDERed LIMITed query not optimized enough - need help

    Joachim Durchholz <org> wrote: 

    table main: 10 rows
    table sub: 1.000 rows
    table detail: 100.000 rows

    This is potentially a 1.000.000.000 row JOIN result.
     
     

    If you ORDER BY columns from different tables, then MySQL always uses
    external sorting of the result set:
    http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
     

    LIMIT doesn't help here. Before LIMIT can be applied, sorting has to be
    done. Before sorting can be done, the complete result set must be known.
     

    Nope. The first 12 rows in table detail may not have corresponding rows
    in the sub table and may thus not be part of the unLIMITed join result.
    If the first N rows from detail all have the same value in name, then
    the sort order is unknown unless MySQL looks at the other JOINed tables.

    LIMIT is done last, with one exception: if MySQL can retrieve rows in
    sort order, that is: if there is an index that can be used for finding
    rows in the specified result order - then MySQL will stop retrieving
    rows when the LIMIT is reached. This is not the case here.


    Also the decision of the optimizer depends on many things. The main
    goal of the optimizer is to keep the intermediate JOIN result small.
    If MySQL JOINs in order main -> sub -> detail there are [10], [at most
    10.000], [final] rows in the intermediate results table.

    For order sub -> detail -> main this is [100.000 * filter selectivity],
    [at most 100.000 * filter selectivity], [final].

    Now all depends on how selective your filter on the details table is
    (or rather: what the optimizer thinks how selective it is). I guess
    detail.name > '0.5' is not very selective. But even if your filter
    would yield just 10.000 rows from the detail table, the overall query
    cost would still be higher than with the original plan.

    It would be interesting to see the query plan when the filter finds
    only very few (say 10) rows from the detail table.

     

    This is quite old. I recommend to upgrade to latest 5.0. There have
    been many bugfixes and probably optimizer improvements since.


    XL
    --
    Axel Schwenke, Support Engineer, MySQL AB

    Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
    MySQL User Forums: http://forums.mysql.com/
    Axel Guest

  4. #4

    Default Re: JOINed ORDERed LIMITed query not optimized enough - need help

    Axel Schwenke schrieb: 
    >
    > table main: 10 rows
    > table sub: 1.000 rows
    > table detail: 100.000 rows
    >
    > This is potentially a 1.000.000.000 row JOIN result.[/ref]

    I have
    ...
    JOIN detail ON sub.id = ...
    where sub.id is the primary key. So it's already known that there's at
    most 1 sub record per detail record (and, likewise, at most 1 main
    record per sub record).
    In summary, we can infer that the result is no more records than were
    selected from the detail table.

    If I had added referential integrity constraints, we'd also be able to
    infer that there is always a sub record for each main record. (This is
    InnoDb, so this would have been a possibility.)

    The interesting question here is, of course, whether the optimizer is
    smart enough to do the same kind of inference that we can.
     
    >
    > LIMIT doesn't help here. Before LIMIT can be applied, sorting has to be
    > done. Before sorting can be done, the complete result set must be known.[/ref]

    Well, my (admittedly slightly short-sighted) expectation would have been
    that mysql would read 12 detail rows and add the sub and main rows after
    that, finally sorting the result.
     
    >
    > Nope. The first 12 rows in table detail may not have corresponding rows
    > in the sub table and may thus not be part of the unLIMITed join result.[/ref]

    Agreed.

    Would adding referential integrity constraints have made a difference?
     

    Right. That case could actually happen (not really with the example data
    which has randomized 'name' data, but with production data, real names
    tend to have duplicates).
    Though I would have expected that in this case, mysql would read the
    first 12 rows via main's name index, then continue reading on that index
    as long as there are rows with the same name. Since mysql can inspect
    the key distribution, it would even have at least a rough estimate of
    how may rows that would be even before reading them (something that I
    cannot easily do from the application level). I.e. the selectivity would
    not only be far better than 50%, mysql would even have an estimate.
     

    The real problem seems to be that the ORDER BY clause happens on columns
    from multiple tables. At least that's what I gathered from the 5.0
    doentation (a day after posting, as usual).

    I have googled around a bit. Paging in huge lists seems to be something
    that few databases can do well - at least that's was people said in a
    pgsql mailing list said.
     

    Actually EXPLAIN showed that it would start with detail.
    I guess the ORDER BY clause gave the incentive (without the ORDER BY, it
    indeed started with main). Possibly plus the knowledge that each JOIN
    has a fan-out of at most 1.
     

    Indeed. I chose '0.5' because the standard usage would be paging through
    the list with a browser, and the default selectivity would then be
    roughly 50%.

    I had hoped that the LIMIT clause would imply enough selectivity, which
    is essentially
    12/100,000*(factor for non-equidistribution of 'name' key)
    which would have ensured the expected behavior even if mysql's key
    statistics had been very co-grained.
     

    SELECT * FROM
    main
    JOIN sub ON main.id = sub.main_id
    JOIN detail ON sub.id = detail.sub_id
    WHERE detail.name LIKE '0.5000%'
    ORDER by detail.name, sub.name, main.name

    0.0041 sec, which is what I'd expect.

    The JOIN order is still detail-sub-main, with "Using where; Using index;
    Using temporary; Using filesort" on 'detail'.

    On second thought, I don't find the JOIN order very surprising: the
    selectivity is on detail anyway, so any other join order would have been
    suicidal.

    Since the LIKE clause happens to restrict the result to exactly 12 rows,
    it's no surprise that the filesort is blindingly fast in this case.

    The problem is that at the application level, the code doesn't know
    enough about the key distribution to decide how long the LIKE prefix
    should be.
    .... hmm... well, I *could* try starting with
    SELECT COUNT(*) FROM detail
    LIKE '0.50001357076691 sub 826 detail 8265%'
    and shortening the part before the % until I get too few rows back, then
    selecting with the last LIKE expression that returned enough. A
    repeated-bisection approach using BETWEEN would work for numeric columns.
    Unfortunately, the communications overhead between client and server
    makes this all far more inefficient than the optimizer figuring this out
    by itself...
     
    >
    > This is quite old. I recommend to upgrade to latest 5.0. There have
    > been many bugfixes and probably optimizer improvements since.[/ref]

    It's a local XAMPP installation. A newer XAMPP with a newer mysql would
    come with a newer PHP, which in turn would require a newer DBG module,
    but DBG always lags a year or so behind PHP (unless you have money to
    spare, which I don't, unfortunately), so I'm essentially stuck with what
    I have right now.
    I could set up something with an old XAMPP and a newer mysql, of course.
    I would be spending a lot of work, and with questionable effect: I'm
    developing for web administrators, which are usually stuck with whatever
    mysql version the webhosting provider gives them, and that's usually a
    few versions behind what mysql.com has to offer. In other words, my code
    would run just well in the testing environment, but not productively...

    Anyway, the mysql docs flatly state that a filesort is always needed if
    the ORDER BY uses fields from more than one JOINed table. This is
    unchanged between 5.0 and 5.1, so I wouldn't expect that a newer 5.0
    version would help. (Am I right with that?)

    Regards,
    Jo
    Joachim Guest

Similar Threads

  1. #39879 [NEW]: strval not optimized
    By roberto at spadim dot com dot br in forum PHP Bugs
    Replies: 3
    Last Post: December 19th, 02:34 PM
  2. Freehand Optimized?
    By Joel in forum Macromedia Freehand
    Replies: 8
    Last Post: September 22nd, 02:57 PM
  3. modem thru-put optimized yet slow
    By chris in forum Windows Networking
    Replies: 0
    Last Post: July 2nd, 02:01 AM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139