Professional Web Applications Themes

Intersection of two lists - MySQL

I have a table that includes an INT CourseID and a TEXT BooksReferenced. The BooksReferenced is a comma separated list of ISBNs. I want to write a query that uses the intersection of two values of BooksReferenced. Let's say I have Course 10 with BooksReferenced of "40,45,50,55" and Course 11 with BooksReferenced of "45,55,65". I'd like to calculate the intersection of those two lists "40,45,50" intersection "45,55,65" = "45,55" without hitting the tables in my huge database first. Is there an SQL operation in MySQL to do this? Please note that I don't want to do SELECT (...whatever...) WHERE Book ...

  1. #1

    Default Intersection of two lists

    I have a table that includes an INT CourseID and a TEXT
    BooksReferenced. The BooksReferenced is a comma separated list of
    ISBNs. I want to write a query that uses the intersection of two values
    of BooksReferenced.

    Let's say I have Course 10 with BooksReferenced of "40,45,50,55" and
    Course 11 with BooksReferenced of "45,55,65". I'd like to calculate the
    intersection of those two lists "40,45,50" intersection "45,55,65" =
    "45,55" without hitting the tables in my huge database first. Is there
    an SQL operation in MySQL to do this?

    Please note that I don't want to do SELECT (...whatever...) WHERE Book
    IN (SELECT Book WHERE Course = 10) and (SELECT Book WHERE Course = 11)
    because the sub-queries would be too slow. I know in my program which
    books go with the two courses and I want to intersect them and then do
    "SELECT (...whatever...) WHERE Book IN (ListA INTERSECTION ListB)".
    That way I hit my huge database only once per query.

    Any help or direction with this will be appreciated. Thank you.

    ggle75@carpelibris.com Guest

  2. #2

    Default Re: Intersection of two lists


    com wrote: 

    A COMMA SEPARATED WHAT!?!

    strawberry Guest

  3. #3

    Default Re: Intersection of two lists

    com wrote: 

    Google for "database normalization". Then correct your schema to allow
    for a proper many-to-many relationship between course and booksreferenced.

    --
    ==================
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    net
    ==================
    Jerry Guest

  4. #4

    Default Re: Intersection of two lists


    Jerry Stuckle wrote: 
    >
    > Google for "database normalization". Then correct your schema to allow
    > for a proper many-to-many relationship between course and booksreferenced.
    >
    > --
    > ==================
    > Remove the "x" from my email address
    > Jerry Stuckle
    > JDS Computer Training Corp.
    > net
    > ==================[/ref]

    It's not a normalization issue - it's an efficiency issue. If you
    google for "database normalization performance denormalization" you'll
    see why I prefer to not "correct" my schema. I can get what books are
    used in a course through an operation other than SELECT course where
    book = 10. Given that I have the list of books in course 10 and the
    books in course 11, I want to get the intersection set of those two
    sets BEFORE I hit the huge database with the SELECT I quoted (which is
    something like "SELECT stuff WHERE Book IN (intersection list goes
    here)".

    I don't need a many to many relationship - I believe my book <-> course
    tuples are a normalized structure. I just have to minimize the queries
    on this enormous table. If I didn't care about performance, I'd be done
    already. It's the creative shortcut I'm looking for.

    Any suggestions?

    ggle75@carpelibris.com Guest

  5. #5

    Default Re: Intersection of two lists

    com wrote: 
    >>
    >>Google for "database normalization". Then correct your schema to allow
    >>for a proper many-to-many relationship between course and booksreferenced.
    >>
    >>--
    >>==================
    >>Remove the "x" from my email address
    >>Jerry Stuckle
    >>JDS Computer Training Corp.
    >>net
    >>==================[/ref]
    >
    >
    > It's not a normalization issue - it's an efficiency issue. If you
    > google for "database normalization performance denormalization" you'll
    > see why I prefer to not "correct" my schema. I can get what books are
    > used in a course through an operation other than SELECT course where
    > book = 10. Given that I have the list of books in course 10 and the
    > books in course 11, I want to get the intersection set of those two
    > sets BEFORE I hit the huge database with the SELECT I quoted (which is
    > something like "SELECT stuff WHERE Book IN (intersection list goes
    > here)".
    >
    > I don't need a many to many relationship - I believe my book <-> course
    > tuples are a normalized structure. I just have to minimize the queries
    > on this enormous table. If I didn't care about performance, I'd be done
    > already. It's the creative shortcut I'm looking for.
    >
    > Any suggestions?
    >[/ref]

    No, they aren't. First Normal form includes: "Atomicity: Each attribute
    must contain a single value, not a set of values."

    You have a set of values (a comma separated list), which does not adhere
    to first normal form.

    And this is exactly the type of problem it will cause you. Rather, you
    should split this out into a third table containing two elements - a
    CourseID and a single ISBN (or perhaps an id value from BooksReferenced
    - it would be faster).

    Normalization is not always about performance.

    --
    ==================
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    net
    ==================
    Jerry Guest

  6. #6

    Default Re: Intersection of two lists


    Jerry Stuckle wrote: 
    > >
    > >
    > > It's not a normalization issue - it's an efficiency issue. If you
    > > google for "database normalization performance denormalization" you'll
    > > see why I prefer to not "correct" my schema. I can get what books are
    > > used in a course through an operation other than SELECT course where
    > > book = 10. Given that I have the list of books in course 10 and the
    > > books in course 11, I want to get the intersection set of those two
    > > sets BEFORE I hit the huge database with the SELECT I quoted (which is
    > > something like "SELECT stuff WHERE Book IN (intersection list goes
    > > here)".
    > >
    > > I don't need a many to many relationship - I believe my book <-> course
    > > tuples are a normalized structure. I just have to minimize the queries
    > > on this enormous table. If I didn't care about performance, I'd be done
    > > already. It's the creative shortcut I'm looking for.
    > >
    > > Any suggestions?
    > >[/ref]
    >
    > No, they aren't. First Normal form includes: "Atomicity: Each attribute
    > must contain a single value, not a set of values."
    >
    > You have a set of values (a comma separated list), which does not adhere
    > to first normal form.
    >
    > And this is exactly the type of problem it will cause you. Rather, you
    > should split this out into a third table containing two elements - a
    > CourseID and a single ISBN (or perhaps an id value from BooksReferenced
    > - it would be faster).
    >
    > Normalization is not always about performance.
    >
    > --
    > ==================
    > Remove the "x" from my email address
    > Jerry Stuckle
    > JDS Computer Training Corp.
    > net
    > ==================[/ref]

    Okay. Let me be clearer. My object is to be able to quickly list the
    books that are shared by two courses.

    My ReadingList table has indices on both the BookID and the CourseID.
    The simplest query would be a self join ("SELECT BookID from
    ReadingList AS RL1 LEFT JOIN ReadingList as RL2 on RL1.bookID =
    RL2.bookID WHERE RL1.courseID =10 and RL2.courseID = 11"). That is
    clearly going to be too slow. I can program around it if the database
    doesn't have a faster way.

    The table with the lists is a derived table; I know that one isn't
    normalized. The original, normalized table has atomicity. It is the one
    that has the BookID and the CourseID pairs. I can find out what books
    are referenced for a given course by doing "SELECT BookID FROM
    ReadingList WHERE CourseID = 10".

    I can find the books that have courses referencing two other books by
    doing "SELECT BookID from ReadingList WHERE BookID IN (SELECT BookID
    from ReadingList where CourseID = 10) AND BookID IN (SELECT BookID from
    ReadingList where CourseID = 11)". That would take too long.

    I can preload the lists of Courses from my denomalized table quickly in
    another part of my program. That lets me do "SELECT BookID from
    ReadingList WHERE BookID IN (SELECT list FROM DeNormalizedTable WHERE
    CourseID = 10) AND BookID IN (SELECT list FROM DeNormalizedTable WHERE
    CourseID = 11))". Even though the queries are very quick to get the
    denormalized list of comma separated values from the table, that is
    still too slow. I am stuck with how to intersect the lists effectively
    before I go to the database for a lengthly query. Since it seems I
    can't do it in the database, I'll write the code to do it. I thought
    I'd check if the database could do it before I did that, that's all.

    By the way, thank you for your answers - I'm sorry I'm not explaining
    things clearly enough.

    ggle75@carpelibris.com Guest

  7. #7

    Default Re: Intersection of two lists


    com wrote: 
    > >
    > > No, they aren't. First Normal form includes: "Atomicity: Each attribute
    > > must contain a single value, not a set of values."
    > >
    > > You have a set of values (a comma separated list), which does not adhere
    > > to first normal form.
    > >
    > > And this is exactly the type of problem it will cause you. Rather, you
    > > should split this out into a third table containing two elements - a
    > > CourseID and a single ISBN (or perhaps an id value from BooksReferenced
    > > - it would be faster).
    > >
    > > Normalization is not always about performance.
    > >
    > > --
    > > ==================
    > > Remove the "x" from my email address
    > > Jerry Stuckle
    > > JDS Computer Training Corp.
    > > net
    > > ==================[/ref]
    >
    > Okay. Let me be clearer. My object is to be able to quickly list the
    > books that are shared by two courses.
    >
    > My ReadingList table has indices on both the BookID and the CourseID.
    > The simplest query would be a self join ("SELECT BookID from
    > ReadingList AS RL1 LEFT JOIN ReadingList as RL2 on RL1.bookID =
    > RL2.bookID WHERE RL1.courseID =10 and RL2.courseID = 11"). That is
    > clearly going to be too slow. I can program around it if the database
    > doesn't have a faster way.
    >
    > The table with the lists is a derived table; I know that one isn't
    > normalized. The original, normalized table has atomicity. It is the one
    > that has the BookID and the CourseID pairs. I can find out what books
    > are referenced for a given course by doing "SELECT BookID FROM
    > ReadingList WHERE CourseID = 10".
    >
    > I can find the books that have courses referencing two other books by
    > doing "SELECT BookID from ReadingList WHERE BookID IN (SELECT BookID
    > from ReadingList where CourseID = 10) AND BookID IN (SELECT BookID from
    > ReadingList where CourseID = 11)". That would take too long.
    >
    > I can preload the lists of Courses from my denomalized table quickly in
    > another part of my program. That lets me do "SELECT BookID from
    > ReadingList WHERE BookID IN (SELECT list FROM DeNormalizedTable WHERE
    > CourseID = 10) AND BookID IN (SELECT list FROM DeNormalizedTable WHERE
    > CourseID = 11))". Even though the queries are very quick to get the
    > denormalized list of comma separated values from the table, that is
    > still too slow. I am stuck with how to intersect the lists effectively
    > before I go to the database for a lengthly query. Since it seems I
    > can't do it in the database, I'll write the code to do it. I thought
    > I'd check if the database could do it before I did that, that's all.
    >
    > By the way, thank you for your answers - I'm sorry I'm not explaining
    > things clearly enough.[/ref]

    Maybe I'm missing something.

    Given that your stated objective is 'to be able to quickly list the
    books that are shared by two courses', I really think you should give
    further consideration to Jerry's solution.

    I don't know what the most efficient query would look like - but, in
    Jerry's schema, an example of a query to find ALL books shared by two
    courses could be:

    SELECT t1.book_id AS shared_books
    FROM courses_books t1
    LEFT JOIN courses_books t2 ON t1.book_id = t2.book_id
    WHERE t1.course_id =1
    AND t2.course_id =2
    ORDER BY shared_books;

    strawberry Guest

  8. #8

    Default Re: Intersection of two lists

    com wrote: 

    Why do you think it will be too slow? Have you measured the
    performance? Or are you just making a reasoned guess? Have you used
    EXPLAIN to make sure indexes are being used? Have you tuned cache buffers?

    Also, I'm not sure why you are using an outer join if you want the books
    that are shared by both courses. The result of the LEFT JOIN above is
    _all_ books used by course 10, regardless of whether it is used by
    course 11. In this case, to get books shared by both courses, use an
    inner join.

    An outer JOIN is often slower than an inner join over the same tables &
    columns. So use outer joins only when you need the result of an outer join.
     

    Try using a JOIN instead of a subquery here. For example:

    SELECT BookID
    FROM ReadingList AS r1 JOIN ReadingList AS r2
    ON (r1.BookID = r2.BookID AND r1.CourseID != r2.CourseID)
    WHERE r1.CourseID = 10 AND r2.CourseID = 11;

    Put a compound primary key on this table over (CourseID, BookID), as
    well as foreign keys on each column, referencing the primary keys of the
    courses table and the books table respectively.
     

    But that simply doesn't work. The following two tests are quite different:

    BookID IN (1,2,3,4,5,6,7,8,9,10)

    BookID IN ('1,2,3,4,5,6,7,8,9,10')

    Your subquery with the denormalized table performs a test equivalent to
    the second line above. That is, it compared BookID to a single string,
    which is bound to fail, because 10 is not equal to that string.

    Other problems with using the denormalized table design you describe
    include:
    - No referential integrity possible, to ensure that the entries in the
    list are valid entries. Nothing prevents the list from containing
    integers that don't correspond to any book in the database.
    - No data typing possible, to ensure that the entries in the list are
    integers at all. Nothing prevents the list from containing, '1,2,banana,4'.
    - Queries can't use an index to search for values. You have to resort
    to ugly and inefficient regular expressions like '[[:<>]]10[[:>:]]'.
    - It's expensive to insert, delete, or count values. Basically you have
    to fetch the lists into your app and do the work in application code.
    - The number of items in the list is subject to a hard limit, defined by
    the length of the string column.
    - It's less efficient to store integers in text format than in a natural
    integer representation.
    - Your solution requires periodically refreshing the denormalized table,
    so the denormalized data is at risk of getting out of sync. There's no
    way to know whether the denormalized table is in up to date, or requires
    refreshing. To test whether it's up to date is practically as expensive
    as repopulating the whole table. This is a ridiculous strategy if you
    are concerned about performance.

    Please listen to the consensus of the folks here who advise you to use a
    normalized table for both storing and querying data in many-to-many
    relationships. All of the above problems are solved easily by sticking
    with a normalized design.

    Solve performance issues another way, like with proper indexing and caching.

    Regards,
    Bill K.
    Bill Guest

  9. #9

    Default Re: Intersection of two lists

    "com" <com> wrote: [/ref][/ref]

    ....
     [/ref][/ref]

    I second that.
     [/ref][/ref]

    Can you please prove this? Did you measure performance of the
    normalized vs. the non-normalized schema *for your application*?

    Schema optimization is often a tradeoff because speeding up certain
    operations may slowdown others. In fact you experience this right now.
    A collapsed list of books (potentially) speeds up a lookup but will
    slowdown other operations like intersecting lists or removing a book
    from a list.
     

    Again: can you prove this?

    Besides that your query is wrong. To find shared books you should do
    an INNER JOIN using the book id.
     

    So you have a second problem here. How do you manage to keep both
    tables synchronized? How much costs that?
     

    I guess you can't prove this either.

    Looks like you're doing premature optimization.
    Please google for premature optimization and why it is bad.


    XL
    --
    Axel Schwenke, Senior Software Developer, MySQL AB

    Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
    MySQL User Forums: http://forums.mysql.com/
    Axel Guest

  10. #10

    Default Re: Intersection of two lists

    "com" <com> wrote: 
    >>
    >> Google for "database normalization". Then correct your schema to allow
    >> for a proper many-to-many relationship between course and booksreferenced.
    >>
    >> --
    >> ==================
    >> Remove the "x" from my email address
    >> Jerry Stuckle
    >> JDS Computer Training Corp.
    >> net
    >> ==================[/ref]
    >
    > It's not a normalization issue - it's an efficiency issue. If you
    > google for "database normalization performance denormalization" you'll
    > see why I prefer to not "correct" my schema. I can get what books are
    > used in a course through an operation other than SELECT course where
    > book = 10. Given that I have the list of books in course 10 and the
    > books in course 11, I want to get the intersection set of those two
    > sets BEFORE I hit the huge database with the SELECT I quoted (which is
    > something like "SELECT stuff WHERE Book IN (intersection list goes
    > here)".
    >
    > I don't need a many to many relationship - I believe my book <-> course
    > tuples are a normalized structure. I just have to minimize the queries
    > on this enormous table. If I didn't care about performance, I'd be done
    > already. It's the creative shortcut I'm looking for.
    >
    > Any suggestions?
    >[/ref]

    --
    Axel Schwenke, Senior Software Developer, MySQL AB

    Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
    MySQL User Forums: http://forums.mysql.com/
    Axel Guest

  11. #11

    Default Re: Intersection of two lists


    com wrote: 

    Well that last line certainly isn't true.

    Some very experienced SQL designers and developers have given you some
    very good advice and in every case, you have argued that they are wrong
    and you are right.

    I will add my weight to them and say that you are wrong and they are
    right. Take a look back in this newsgroup and you will see just how
    right Jerry's answers tend to be.

    Done that?

    Right, now accept the help and direction and change the design
    accordingly!

    Captain Guest

  12. #12

    Default Re: Intersection of two lists

    As you violate 1-st form, don't expect SQL can help you with your task.
    You will have to to it in the client application by means of ordinary
    programming languages.

    MasterZiv Guest

  13. #13

    Default Re: Intersection of two lists

    com wrote: 
    >>
    >>No, they aren't. First Normal form includes: "Atomicity: Each attribute
    >>must contain a single value, not a set of values."
    >>
    >>You have a set of values (a comma separated list), which does not adhere
    >>to first normal form.
    >>
    >>And this is exactly the type of problem it will cause you. Rather, you
    >>should split this out into a third table containing two elements - a
    >>CourseID and a single ISBN (or perhaps an id value from BooksReferenced
    >>- it would be faster).
    >>
    >>Normalization is not always about performance.
    >>
    >>--
    >>==================
    >>Remove the "x" from my email address
    >>Jerry Stuckle
    >>JDS Computer Training Corp.
    >>net
    >>==================[/ref]
    >
    >
    > Okay. Let me be clearer. My object is to be able to quickly list the
    > books that are shared by two courses.
    >
    > My ReadingList table has indices on both the BookID and the CourseID.
    > The simplest query would be a self join ("SELECT BookID from
    > ReadingList AS RL1 LEFT JOIN ReadingList as RL2 on RL1.bookID =
    > RL2.bookID WHERE RL1.courseID =10 and RL2.courseID = 11"). That is
    > clearly going to be too slow. I can program around it if the database
    > doesn't have a faster way.
    >
    > The table with the lists is a derived table; I know that one isn't
    > normalized. The original, normalized table has atomicity. It is the one
    > that has the BookID and the CourseID pairs. I can find out what books
    > are referenced for a given course by doing "SELECT BookID FROM
    > ReadingList WHERE CourseID = 10".
    >
    > I can find the books that have courses referencing two other books by
    > doing "SELECT BookID from ReadingList WHERE BookID IN (SELECT BookID
    > from ReadingList where CourseID = 10) AND BookID IN (SELECT BookID from
    > ReadingList where CourseID = 11)". That would take too long.
    >
    > I can preload the lists of Courses from my denomalized table quickly in
    > another part of my program. That lets me do "SELECT BookID from
    > ReadingList WHERE BookID IN (SELECT list FROM DeNormalizedTable WHERE
    > CourseID = 10) AND BookID IN (SELECT list FROM DeNormalizedTable WHERE
    > CourseID = 11))". Even though the queries are very quick to get the
    > denormalized list of comma separated values from the table, that is
    > still too slow. I am stuck with how to intersect the lists effectively
    > before I go to the database for a lengthly query. Since it seems I
    > can't do it in the database, I'll write the code to do it. I thought
    > I'd check if the database could do it before I did that, that's all.
    >
    > By the way, thank you for your answers - I'm sorry I'm not explaining
    > things clearly enough.
    >[/ref]

    You're explaining things enough. You're just not listening to the
    answers. You can't do it in the database because your design is incorrect.

    As I said - normalize your data. Let SQL do the job it's designed to do.

    Normalization does not necessary mean decreased performance. It can, in
    fact, lead to increased performance.

    This is a perfect example. You are trying to get SQL to p a comma
    separated list for two courses, then compare the results. Not something
    SQL was designed to do because you're violating one of the most basic
    rules of database design. Sure, you might be able to force SQL to do
    it, but it's not designed to do that.

    A link table with integer id's is the correct way to do it.

    --
    ==================
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    net
    ==================
    Jerry Guest

Similar Threads

  1. FH: intersection points
    By Kaffee365 webforumsuser@macromedia.com in forum Macromedia Freehand
    Replies: 2
    Last Post: January 3rd, 10:41 PM
  2. Model within model transform.position, intersection, overlapping models
    By Zafada webforumsuser@macromedia.com in forum Macromedia Director 3D
    Replies: 0
    Last Post: August 30th, 12:30 AM
  3. lists, attaching behaviour dynamiclly, update lists
    By crazy big fun in forum Macromedia Director Basics
    Replies: 0
    Last Post: June 28th, 06:20 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139