Professional Web Applications Themes

Best practice to optimise big updates (repost, rewritten with example) - Microsoft SQL / MS SQL Server

George wrote: > The old thread is irrelevant to my question. The old > thread branched off to whether a left join is appropriate > or not. > > My question is about how to update 1 million rows, 20,000 > at a time. > Right! and we're still discussing it over there....

  1. #1

    Default Re: Best practice to optimise big updates (repost, rewritten with example)

    George wrote:
    > The old thread is irrelevant to my question. The old
    > thread branched off to whether a left join is appropriate
    > or not.
    >
    > My question is about how to update 1 million rows, 20,000
    > at a time.
    >
    Right! and we're still discussing it over there.


    Bob Barrows Guest

  2. #2

    Default Re: Best practice to optimise big updates (repost, rewritten with example)


    Of course, you're right. I would much rather have clean
    data in the first place. I'm just starting in this place,
    and there are so many things I want to change...

    Are you saying that SQL is not appropriate to perform
    several updates on all the rows of a big table?




    >-----Original Message-----
    >>> have an Orders table with 1 million rows, and I need
    to validate the
    >values in every column <<
    >
    >Considering that we have no idea waht the validation
    rules are for all
    >the columns and *all combinations* of columns, I would
    not do this in
    >SQL. Find a data scrubbing tool and use it.
    >
    >Your example of having an invalid order_type means that
    nobody put a
    >CHECK() or REFERENCES clause on that column. Somebody did
    not do their
    >job when the schema was designed. First thing, find and
    shoot guy who
    >allowed the database to accept the bad data in the first
    place. This
    >will help prevent future problems. Scrub the database
    with a tool. And
    >start adding constraints, checks and defaults to the
    schema to prevent
    >having a repeat of this in the future.
    >
    >--CELKO--
    > ===========================
    > Please post DDL, so that people do not have to guess
    what the keys,
    >constraints, Declarative Referential Integrity,
    datatypes, etc. in your
    >schema are.
    >
    >*** Sent via Developersdex [url]http://www.developersdex.com[/url]
    ***
    >Don't just participate in USENET...get rewarded for it!
    >.
    >
    George Guest

  3. #3

    Default Re: Best practice to optimise big updates (repost, rewritten with example)


    OK, and how do you get those value ranges? In my case,
    there is an indexed int column that I could you for that
    purpose.

    How do I come up with ranges that will contain up to
    20,000 rows each?

    >-----Original Message-----
    >I use one of two approaches for what I consider "massive"
    updates:
    >
    >1: Partition the updates by value range of a reasonably
    well distributed
    >index. This can be a name, an identity column, etc. I
    do this for
    >relatively simple updates where a limited number of
    partitions will cover
    >the entire range and each partition has a limited number
    of updates.
    >
    >2: Use a firehose cursor to p the rows, and use
    separate (and
    >independent) UPDATE (or INSERT/DELETE) statements as
    needed.
    >
    >In your case, it sounds as if approach 2 could be useful -
    it would allow
    >you to make all the changes in one pass of the data.
    >
    >I find that massive updates otherwise tend to trigger
    other severe side
    >effects, such as exploding log files, blocking
    situations, etc. Using
    >approach #2 keeps the performance within a manageable and
    predicatble limit.
    >
    >Also, I tend to do a preliminary count and measure the
    running/elapsed time
    >while parsing the firehose cursor. Every few thousand
    rows I will print a
    >message with the current time, elapsed time, the number
    of rows processed,
    >total rows to process, and an estimated completion time.
    It adds a little
    >bit of overhead, but seeing the messages every now an
    then shows that it is
    >indeed progressing, and lets me know approximately how
    long I should expect
    >for it to run. Otherwise, you'll ask yourself "is it
    hanging?".
    >
    >HTH,
    >Tore.
    >
    >
    >"George" <georgehotmail.com> wrote in message
    >news:31d101c33fe9$6ad1f3a0$a601280aphx.gbl...
    >>
    >> Of course, you're right. I would much rather have clean
    >> data in the first place. I'm just starting in this
    place,
    >> and there are so many things I want to change...
    >>
    >> Are you saying that SQL is not appropriate to perform
    >> several updates on all the rows of a big table?
    >>
    >>
    >>
    >>
    >>
    >> >-----Original Message-----
    >> >>> have an Orders table with 1 million rows, and I
    need
    >> to validate the
    >> >values in every column <<
    >> >
    >> >Considering that we have no idea waht the validation
    >> rules are for all
    >> >the columns and *all combinations* of columns, I would
    >> not do this in
    >> >SQL. Find a data scrubbing tool and use it.
    >> >
    >> >Your example of having an invalid order_type means that
    >> nobody put a
    >> >CHECK() or REFERENCES clause on that column. Somebody
    did
    >> not do their
    >> >job when the schema was designed. First thing, find
    and
    >> shoot guy who
    >> >allowed the database to accept the bad data in the
    first
    >> place. This
    >> >will help prevent future problems. Scrub the database
    >> with a tool. And
    >> >start adding constraints, checks and defaults to the
    >> schema to prevent
    >> >having a repeat of this in the future.
    >> >
    >> >--CELKO--
    >> > ===========================
    >> > Please post DDL, so that people do not have to guess
    >> what the keys,
    >> >constraints, Declarative Referential Integrity,
    >> datatypes, etc. in your
    >> >schema are.
    >> >
    >> >*** Sent via Developersdex [url]http://www.developersdex.com[/url]
    >> ***
    >> >Don't just participate in USENET...get rewarded for it!
    >> >.
    >> >
    >
    >
    >.
    >
    George Guest

  4. #4

    Default Re: Best practice to optimise big updates (repost, rewritten with example)

    "Tore Bostrup" <newspost_at_bostrup.us> wrote in message
    news:#lWWCz#PDHA.2432TK2MSFTNGP10.phx.gbl...
    ..
    > I find that massive updates otherwise tend to trigger other severe side
    > effects, such as exploding log files, blocking situations, etc. >
    >
    > "George" <georgehotmail.com> wrote in message
    > news:31d101c33fe9$6ad1f3a0$a601280aphx.gbl...
    > >
    > > Of course, you're right. I would much rather have clean
    > > data in the first place. I'm just starting in this place,
    > > and there are so many things I want to change...
    > >
    > > Are you saying that SQL is not appropriate to perform
    > > several updates on all the rows of a big table?
    Well...you could think out of the 'box' so to speak.
    Download MySql and transfer your data.In MySql you have
    more choices of how the update can be done than server.
    If you use a MySql myisam table you can avoid the overhead
    of a transaction log and the locking issues in server.
    And it's free.
    Just a thought:~)

    RAC v2.2 and QALite released for MS Sqlserver.
    [url]www.rac4sql.net[/url]



    Groucho Guest

  5. #5

    Default Re: Best practice to optimise big updates (repost, rewritten with example)

    Trial and error and some simple calculations.

    If I have an identity column, for instance, I will assume the range from 1
    (or maybe I have purged up to some other starting value) up to its Max value
    is reasonably well distributed, and run a few sample count(*) with the WHERE
    clause that will identify rows that need to be updated plus "...AND <Column>
    BETWEEN <From> AND <To>". If the results vary widely on the same range size
    of the column in different parts, I would make the range equal to the
    maximum number of rows I want updated in one operation. I wouldn'ty worry
    if one range wound up actually updating 12000 and another 19000 rows.

    So you didn't like the firehose cursor script approach? I have come to love
    it. You spend a couple of hours up front when you implement your first such
    script. The next time, you can use the old one as a template. I'm not sure
    if I have an old one lying around...

    HTH,
    Tore.


    "George" <georgehotmail.com> wrote in message
    news:017101c34060$4a2a2560$a001280aphx.gbl...
    >
    > OK, and how do you get those value ranges? In my case,
    > there is an indexed int column that I could you for that
    > purpose.
    >
    > How do I come up with ranges that will contain up to
    > 20,000 rows each?
    >
    >
    > >-----Original Message-----
    > >I use one of two approaches for what I consider "massive"
    > updates:
    > >
    > >1: Partition the updates by value range of a reasonably
    > well distributed
    > >index. This can be a name, an identity column, etc. I
    > do this for
    > >relatively simple updates where a limited number of
    > partitions will cover
    > >the entire range and each partition has a limited number
    > of updates.
    > >
    > >2: Use a firehose cursor to p the rows, and use
    > separate (and
    > >independent) UPDATE (or INSERT/DELETE) statements as
    > needed.
    > >
    > >In your case, it sounds as if approach 2 could be useful -
    > it would allow
    > >you to make all the changes in one pass of the data.
    > >
    > >I find that massive updates otherwise tend to trigger
    > other severe side
    > >effects, such as exploding log files, blocking
    > situations, etc. Using
    > >approach #2 keeps the performance within a manageable and
    > predicatble limit.
    > >
    > >Also, I tend to do a preliminary count and measure the
    > running/elapsed time
    > >while parsing the firehose cursor. Every few thousand
    > rows I will print a
    > >message with the current time, elapsed time, the number
    > of rows processed,
    > >total rows to process, and an estimated completion time.
    > It adds a little
    > >bit of overhead, but seeing the messages every now an
    > then shows that it is
    > >indeed progressing, and lets me know approximately how
    > long I should expect
    > >for it to run. Otherwise, you'll ask yourself "is it
    > hanging?".
    > >
    > >HTH,
    > >Tore.
    > >
    > >
    > >"George" <georgehotmail.com> wrote in message
    > >news:31d101c33fe9$6ad1f3a0$a601280aphx.gbl...
    > >>
    > >> Of course, you're right. I would much rather have clean
    > >> data in the first place. I'm just starting in this
    > place,
    > >> and there are so many things I want to change...
    > >>
    > >> Are you saying that SQL is not appropriate to perform
    > >> several updates on all the rows of a big table?
    > >>
    > >>
    > >>
    > >>
    > >>
    > >> >-----Original Message-----
    > >> >>> have an Orders table with 1 million rows, and I
    > need
    > >> to validate the
    > >> >values in every column <<
    > >> >
    > >> >Considering that we have no idea waht the validation
    > >> rules are for all
    > >> >the columns and *all combinations* of columns, I would
    > >> not do this in
    > >> >SQL. Find a data scrubbing tool and use it.
    > >> >
    > >> >Your example of having an invalid order_type means that
    > >> nobody put a
    > >> >CHECK() or REFERENCES clause on that column. Somebody
    > did
    > >> not do their
    > >> >job when the schema was designed. First thing, find
    > and
    > >> shoot guy who
    > >> >allowed the database to accept the bad data in the
    > first
    > >> place. This
    > >> >will help prevent future problems. Scrub the database
    > >> with a tool. And
    > >> >start adding constraints, checks and defaults to the
    > >> schema to prevent
    > >> >having a repeat of this in the future.
    > >> >
    > >> >--CELKO--
    > >> > ===========================
    > >> > Please post DDL, so that people do not have to guess
    > >> what the keys,
    > >> >constraints, Declarative Referential Integrity,
    > >> datatypes, etc. in your
    > >> >schema are.
    > >> >
    > >> >*** Sent via Developersdex [url]http://www.developersdex.com[/url]
    > >> ***
    > >> >Don't just participate in USENET...get rewarded for it!
    > >> >.
    > >> >
    > >
    > >
    > >.
    > >

    Tore Bostrup Guest

Similar Threads

  1. How to optimise query?
    By YogeshM in forum Coldfusion Database Access
    Replies: 1
    Last Post: October 27th, 10:42 AM
  2. How to best optimise gradients with text?
    By Francesca NY webforumsuser@macromedia.com in forum Macromedia Fireworks
    Replies: 1
    Last Post: October 6th, 02:33 PM
  3. Optimise HD?
    By Peter KERR in forum Mac Applications & Software
    Replies: 5
    Last Post: July 7th, 03:40 PM
  4. Re (repost): Is this a bug?
    By Aryeh Friedman in forum Ruby
    Replies: 2
    Last Post: July 5th, 06:29 PM
  5. How to optimise many updates on 1Mio rows?
    By George in forum Microsoft SQL / MS SQL Server
    Replies: 8
    Last Post: July 1st, 04:03 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139