Professional Web Applications Themes

String best match - MySQL

Hi all, I've a program wich send us some text datas. I've a table field in wich I've text and I've to find the "best matching result" using the send data. Here are some examples: program sends: 'start a new projet' my table's field has those records on it: 'starting a new project' 'start your project' 'start a project now' 'your project start' Now since this is an automatic treatment, I've to find out the "best" result using the given sentence since I know in most cases I won't have a 100% matching text. So could someone point me on ...

  1. #1

    Default String best match

    Hi all,

    I've a program wich send us some text datas. I've a table field in wich I've
    text and I've to find the "best matching result" using the send data.

    Here are some examples:
    program sends: 'start a new projet'

    my table's field has those records on it:
    'starting a new project'
    'start your project'
    'start a project now'
    'your project start'

    Now since this is an automatic treatment, I've to find out the "best" result
    using the given sentence since I know in most cases I won't have a 100%
    matching text.
    So could someone point me on a way to do so ?

    Thanks for helping.

    Bob



    Bob Guest

  2. #2

    Default Re: String best match

    On Apr 10, 1:54 pm, "Bob Bedford" <com> wrote: 

    What's your definition of a best match?

    strawberry Guest

  3. #3

    Default Re: String best match

    Hi strawberry, thanks for your answer,
     

    Well I've to provide a result and not having a general idea of what can be
    done, the "best match" is the closer result.

    program sends: 'start a new projet' 

    in my mind, the "closest" result should be the first one: it has the same
    words "new" and "project" and "start" is close of "starting".

    I think I've to find the closer result having in order:
    1- the max number of matching words
    2- the same order for matching words
    3- an other order for matching words
    4- "close" words (start-starting).

    The query must be kept simple, I won't manage a dictionary of close words.
    Maybe somebody may see a "better" closer function, but for me the closest is
    the first one.

    Thanks for helping



    Bob Guest

  4. #4

    Default Re: String best match

    On 10.04.2007 15:43, Bob Bedford wrote: 
    >
    > Well I've to provide a result and not having a general idea of what can be
    > done, the "best match" is the closer result.
    >
    > program sends: 'start a new projet' 
    >
    > in my mind, the "closest" result should be the first one: it has the same
    > words "new" and "project" and "start" is close of "starting".
    >
    > I think I've to find the closer result having in order:
    > 1- the max number of matching words
    > 2- the same order for matching words
    > 3- an other order for matching words
    > 4- "close" words (start-starting).
    >
    > The query must be kept simple, I won't manage a dictionary of close words.
    > Maybe somebody may see a "better" closer function, but for me the closest is
    > the first one.[/ref]

    I think what the fruit (sorry for that, strawberry :-)) really wanted to
    hear is a more formal (ideally mathematical) definition of closeness.
    Without that you won't be able to build *any* computerized system that
    does what you require.

    Kind regards

    robert
    Robert Guest

  5. #5

    Default Re: String best match

    Hi Robert,
     

    Probably that's where I don't know how to start from...
    Where to start, what to write down to be used later ???

    I've no idea. My goal is to find the best match, wathever the matematical
    formula is used. That's probably where I need help
    Bob
    Bob Guest

  6. #6

    Default Re: String best match

    On Apr 10, 3:51 pm, "Bob Bedford" <com> wrote: 
    >
    > Probably that's where I don't know how to start from...
    > Where to start, what to write down to be used later ???
    >
    > I've no idea. My goal is to find the best match, wathever the matematical
    > formula is used. That's probably where I need help[/ref]

    I'm not suggesting that this IS how to do it, but, subject to the
    envisaged size of the database, here's how I might begin to approach
    it.
    On a large dataset, the maths really would become key - and the query
    I've provided doesn't forgive spelling errors!
    I *would* build a keywords table and and a keyword_aliases table
    because I think the efficiency of such a design would outweigh the
    management overhead.
    I've built some tables to illustrate the example...


    CREATE TABLE `actions` (
    `action_id` smallint(6) NOT NULL auto_increment,
    `action` varchar(100) collate latin1_general_ci NOT NULL,
    PRIMARY KEY (`action_id`)
    ) ENGINE=MyISAM DEFAULT CHT=latin1 COLLATE=latin1_general_ci
    AUTO_INCREMENT=5 ;


    INSERT INTO `actions` VALUES (1, 'start a new project');
    INSERT INTO `actions` VALUES (2, 'modify an existing project');
    INSERT INTO `actions` VALUES (3, 'delete an old project');
    INSERT INTO `actions` VALUES (4, 'recover everything I just lost');


    CREATE TABLE `keywords` (
    `id` smallint(6) NOT NULL auto_increment,
    `keyword` varchar(20) collate latin1_general_ci NOT NULL,
    PRIMARY KEY (`id`)
    ) ENGINE=MyISAM DEFAULT CHT=latin1 COLLATE=latin1_general_ci
    AUTO_INCREMENT=9 ;


    INSERT INTO `keywords` VALUES (1, 'start');
    INSERT INTO `keywords` VALUES (2, 'new');
    INSERT INTO `keywords` VALUES (3, 'project');
    INSERT INTO `keywords` VALUES (4, 'delete');
    INSERT INTO `keywords` VALUES (5, 'old');
    INSERT INTO `keywords` VALUES (6, 'modify');
    INSERT INTO `keywords` VALUES (7, 'existing');
    INSERT INTO `keywords` VALUES (8, 'recover');
    CREATE TABLE `keyword_aliases` (
    `keyword_id` smallint(6) NOT NULL auto_increment,
    `alias` varchar(20) collate latin1_general_ci NOT NULL,
    PRIMARY KEY (`keyword_id`,`alias`)
    ) ENGINE=MyISAM DEFAULT CHT=latin1 COLLATE=latin1_general_ci
    AUTO_INCREMENT=5 ;


    INSERT INTO `keyword_aliases` VALUES (1, 'begin');
    INSERT INTO `keyword_aliases` VALUES (1, 'starting');
    INSERT INTO `keyword_aliases` VALUES (1, 'create');
    INSERT INTO `keyword_aliases` VALUES (1, 'start');
    INSERT INTO `keyword_aliases` VALUES (2, 'fresh');
    INSERT INTO `keyword_aliases` VALUES (2, 'new');
    INSERT INTO `keyword_aliases` VALUES (3, 'task');
    INSERT INTO `keyword_aliases` VALUES (3, 'undertaking');
    INSERT INTO `keyword_aliases` VALUES (4, 'delete');
    INSERT INTO `keyword_aliases` VALUES (1, 'deleted');
    INSERT INTO `keyword_aliases` VALUES (4, 'erase');


    User submits "begin a fresh undertaking". This is exploded in php, and
    each of the significant terms is fed into a query:

    SELECT a.*,count(*) score FROM actions a
    JOIN
    (SELECT *
    FROM `keyword_aliases` ka
    LEFT JOIN keywords ON keywords.id = ka.keyword_id
    WHERE alias LIKE "begin"
    UNION ALL SELECT *
    FROM `keyword_aliases` ka
    LEFT JOIN keywords ON keywords.id = ka.keyword_id
    WHERE alias LIKE "fresh"
    UNION ALL SELECT *
    FROM `keyword_aliases` ka
    LEFT JOIN keywords ON keywords.id = ka.keyword_id
    WHERE alias LIKE "undertaking")b
    ON a.action LIKE CONCAT("%",b.keyword,"%")
    GROUP BY a.action_id
    ORDER BY score DESC;

    1 start a new project 3
    3 delete an old project 1
    2 modify an existing project 1

    Obviously, this is a very contrived example that provides no weight to
    the individual words, just to their ulative impact.

    strawberry Guest

  7. #7

    Default Re: String best match


    "strawberry" <com> a écrit dans le message de news:
    googlegroups.com... 
    >>
    >> Probably that's where I don't know how to start from...
    >> Where to start, what to write down to be used later ???
    >>
    >> I've no idea. My goal is to find the best match, wathever the matematical
    >> formula is used. That's probably where I need help[/ref]
    >
    > I'm not suggesting that this IS how to do it, but, subject to the
    > envisaged size of the database, here's how I might begin to approach
    > it.
    > On a large dataset, the maths really would become key - and the query
    > I've provided doesn't forgive spelling errors!
    > I *would* build a keywords table and and a keyword_aliases table
    > because I think the efficiency of such a design would outweigh the
    > management overhead.
    > I've built some tables to illustrate the example...
    >
    >
    > CREATE TABLE `actions` (
    > `action_id` smallint(6) NOT NULL auto_increment,
    > `action` varchar(100) collate latin1_general_ci NOT NULL,
    > PRIMARY KEY (`action_id`)
    > ) ENGINE=MyISAM DEFAULT CHT=latin1 COLLATE=latin1_general_ci
    > AUTO_INCREMENT=5 ;
    >
    >
    > INSERT INTO `actions` VALUES (1, 'start a new project');
    > INSERT INTO `actions` VALUES (2, 'modify an existing project');
    > INSERT INTO `actions` VALUES (3, 'delete an old project');
    > INSERT INTO `actions` VALUES (4, 'recover everything I just lost');
    >
    >
    > CREATE TABLE `keywords` (
    > `id` smallint(6) NOT NULL auto_increment,
    > `keyword` varchar(20) collate latin1_general_ci NOT NULL,
    > PRIMARY KEY (`id`)
    > ) ENGINE=MyISAM DEFAULT CHT=latin1 COLLATE=latin1_general_ci
    > AUTO_INCREMENT=9 ;
    >
    >
    > INSERT INTO `keywords` VALUES (1, 'start');
    > INSERT INTO `keywords` VALUES (2, 'new');
    > INSERT INTO `keywords` VALUES (3, 'project');
    > INSERT INTO `keywords` VALUES (4, 'delete');
    > INSERT INTO `keywords` VALUES (5, 'old');
    > INSERT INTO `keywords` VALUES (6, 'modify');
    > INSERT INTO `keywords` VALUES (7, 'existing');
    > INSERT INTO `keywords` VALUES (8, 'recover');
    > CREATE TABLE `keyword_aliases` (
    > `keyword_id` smallint(6) NOT NULL auto_increment,
    > `alias` varchar(20) collate latin1_general_ci NOT NULL,
    > PRIMARY KEY (`keyword_id`,`alias`)
    > ) ENGINE=MyISAM DEFAULT CHT=latin1 COLLATE=latin1_general_ci
    > AUTO_INCREMENT=5 ;
    >
    >
    > INSERT INTO `keyword_aliases` VALUES (1, 'begin');
    > INSERT INTO `keyword_aliases` VALUES (1, 'starting');
    > INSERT INTO `keyword_aliases` VALUES (1, 'create');
    > INSERT INTO `keyword_aliases` VALUES (1, 'start');
    > INSERT INTO `keyword_aliases` VALUES (2, 'fresh');
    > INSERT INTO `keyword_aliases` VALUES (2, 'new');
    > INSERT INTO `keyword_aliases` VALUES (3, 'task');
    > INSERT INTO `keyword_aliases` VALUES (3, 'undertaking');
    > INSERT INTO `keyword_aliases` VALUES (4, 'delete');
    > INSERT INTO `keyword_aliases` VALUES (1, 'deleted');
    > INSERT INTO `keyword_aliases` VALUES (4, 'erase');
    >
    >
    > User submits "begin a fresh undertaking". This is exploded in php, and
    > each of the significant terms is fed into a query:
    >
    > SELECT a.*,count(*) score FROM actions a
    > JOIN
    > (SELECT *
    > FROM `keyword_aliases` ka
    > LEFT JOIN keywords ON keywords.id = ka.keyword_id
    > WHERE alias LIKE "begin"
    > UNION ALL SELECT *
    > FROM `keyword_aliases` ka
    > LEFT JOIN keywords ON keywords.id = ka.keyword_id
    > WHERE alias LIKE "fresh"
    > UNION ALL SELECT *
    > FROM `keyword_aliases` ka
    > LEFT JOIN keywords ON keywords.id = ka.keyword_id
    > WHERE alias LIKE "undertaking")b
    > ON a.action LIKE CONCAT("%",b.keyword,"%")
    > GROUP BY a.action_id
    > ORDER BY score DESC;
    >
    > 1 start a new project 3
    > 3 delete an old project 1
    > 2 modify an existing project 1
    >
    > Obviously, this is a very contrived example that provides no weight to
    > the individual words, just to their ulative impact.[/ref]
    Impossible....think at a big book store with thausand of existing books. Now
    the program has book titles from different sources and some are not so
    accurate (see example). I should create the same structure as yours for so
    many terms, titles and possibilities...impossible...
    Sorry for the time you spent creating all the structure, but this can't be
    done.
    I really have to match the closest using the best matching words. For now
    the best would be to get the title with the maximum exact terms in the
    title.

    Bob



    Bob Guest

  8. #8

    Default Re: String best match

    On Apr 10, 4:43 pm, "Bob Bedford" <com> wrote: [/ref]
    > [/ref]
    > [/ref]
    > [/ref]










    >
    > Impossible....think at a big book store with thausand of existing books. Now
    > the program has book titles from different sources and some are not so
    > accurate (see example). I should create the same structure as yours for so
    > many terms, titles and possibilities...impossible...
    > Sorry for the time you spent creating all the structure, but this can't be
    > done.
    > I really have to match the closest using the best matching words. For now
    > the best would be to get the title with the maximum exact terms in the
    > title.
    >
    > Bob[/ref]

    Well, the query's still the same:

    SELECT book_id,title , count( book_id ) score
    FROM (

    SELECT book_id
    FROM `books`
    WHERE title LIKE "%start%"
    UNION ALL
    SELECT book_id
    FROM `books`
    WHERE title LIKE "%new%"
    UNION ALL
    SELECT book_id
    FROM `books`
    WHERE title LIKE "%project%"
    )b
    GROUP BY book_id
    ORDER BY score DESC

    strawberry Guest

  9. #9

    Default Re: String best match

    Well, the query's still the same:

    SELECT book_id,title , count( book_id ) score
    FROM (

    SELECT book_id
    FROM `books`
    WHERE title LIKE "%start%"
    UNION ALL
    SELECT book_id
    FROM `books`
    WHERE title LIKE "%new%"
    UNION ALL
    SELECT book_id
    FROM `books`
    WHERE title LIKE "%project%"
    )b
    GROUP BY book_id
    ORDER BY score DESC

    Thanks for the tip, will try to get it to work.

    Bob


    Bob Guest

  10. #10

    Default Re: String best match

    "Bob Bedford" <com> wrote: 
    wich I've text and I've to find the "best matching result" using the send
    data.
    ===

    It seems people have implemented the levenhstein distance to be used within
    mysql.

    Have a look there, they give a way to install the thing. Never tried these
    kinds of thing, but it might be a good way for you. But i guess that if you
    have a huge table it might be a bit slow. If you try this, please post some
    details about the performance, it might interest me in the future.

    http://dev.mysql.com/doc/refman/5.0/en/udf-compiling.html
    (and auto-search in the comments for levenshtein)

    ---
    You got to realize you can kill somebody with kindness too.
    - Bob Dylan
    (There is no z in my email address)

    subtenante Guest

  11. #11

    Default Re: String best match

    On Apr 10, 9:54 pm, "Bob Bedford" <com> wrote: 

    It sounds to me like you might want to use full text searching...

    http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

    Tigger Guest

Similar Threads

  1. Replies: 0
    Last Post: November 17th, 11:26 PM
  2. Replies: 2
    Last Post: November 12th, 06:07 PM
  3. #26171 [NEW]: using /(a)+/ to match a long string makes php crash
    By xuefer at 21cn dot com in forum PHP Development
    Replies: 0
    Last Post: November 8th, 03:55 AM
  4. exact string match?
    By Rkl in forum PERL Beginners
    Replies: 8
    Last Post: September 3rd, 05:28 AM
  5. failed /g match on empty string does not reset pos()
    By Steve Grazzini in forum PERL Miscellaneous
    Replies: 2
    Last Post: July 31st, 08:46 AM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139