Professional Web Applications Themes

How to group Typo entries in a database - MySQL

Hi, I have a big database of about 14000 records. The entries have a lot of typos like punctuation. ie. same name has punctuation in one entry and no punctuation in another. These are showing up as unique records when I do a search. Is there a way to do it using PERL/Mysql where I can check for a percentage match of the strings and then do a replace? I would like to have all these records to show up as a group instead of unique records. Manual correction is not too easy. Thanks....

  1. #1

    Default How to group Typo entries in a database

    Hi,
    I have a big database of about 14000 records. The entries have a lot
    of typos like punctuation. ie. same name has punctuation in one entry
    and no punctuation in another. These are showing up as unique records
    when I do a search. Is there a way to do it using PERL/Mysql where I
    can check for a percentage match of the strings and then do a replace?
    I would like to have all these records to show up as a group instead of
    unique records.

    Manual correction is not too easy.

    Thanks.

    premgrps@gmail.com Guest

  2. #2

    Default Re: How to group Typo entries in a database

    [email]premgrps[/email] wrote:
    > I have a big database of about 14000 records. The entries have a lot
    > of typos like punctuation. ie. same name has punctuation in one entry
    > and no punctuation in another. These are showing up as unique records
    > when I do a search.
    This is a very common problem. Database workers are frequently called
    upon to perform laborious "data cleanup" projects that take months, and
    never quite resolve 100% of cases.
    > Is there a way to do it using PERL/Mysql where I
    > can check for a percentage match of the strings and then do a replace?
    Look into the SOUNDEX() function. It should help with certain types of
    string comparisons. It won't help find cases like "Bob Smith" should be
    treated the same as "Robert C. Smith", but it'll help in some types of
    cases like punctuation differences.

    [url]http://dev.mysql.com/doc/refman/5.0/en/string-functions.html[/url]

    Regards,
    Bill K.
    Bill Karwin Guest

  3. #3

    Default Re: How to group Typo entries in a database

    Without knowing a bit more about the kind of entries I can't really say
    - but I think some manual correction is likely to be unavoidable!

    That aside, have a look at the soundex() function. Depending on the
    size of the entries it may be offer a pracitcal solution - at least for
    grouping similar entries - something like;

    SELECT DISTINCT(SOUNDEX(t1.entry)),t2.entry
    FROM table t1
    LEFT JOIN table t2
    ON SOUNDEX(t2.entry) = SOUNDEX(t1.entry)
    LIMIT 10;

    The LIMIT clause may be significant here because this kind of
    comparison (without an index) could take a very long time! You could
    also select the reuslts into a temporary table first - and then create
    the indexes necessary to speed up the comparison.

    [email]premgrps[/email] wrote:
    > Hi,
    > I have a big database of about 14000 records. The entries have a lot
    > of typos like punctuation. ie. same name has punctuation in one entry
    > and no punctuation in another. These are showing up as unique records
    > when I do a search. Is there a way to do it using PERL/Mysql where I
    > can check for a percentage match of the strings and then do a replace?
    > I would like to have all these records to show up as a group instead of
    > unique records.
    >
    > Manual correction is not too easy.
    >
    > Thanks.
    strawberry Guest

Similar Threads

  1. Replies: 1
    Last Post: July 10th, 11:08 AM
  2. Replies: 1
    Last Post: April 3rd, 05:37 PM
  3. BLANK ENTRIES IN ACCESS DATABASE
    By Black People Online in forum Coldfusion Database Access
    Replies: 1
    Last Post: May 30th, 06:21 AM
  4. Flash forms and blank database entries
    By Black People Online in forum Coldfusion - Getting Started
    Replies: 1
    Last Post: May 25th, 10:41 PM
  5. Duplicate Database Entries
    By kabalah in forum Dreamweaver AppDev
    Replies: 1
    Last Post: April 7th, 01:28 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139