Professional Web Applications Themes

random record from small set - PostgreSQL / PGSQL

I am trying to retrieve a random record (according to a chance attribute) from a small set of records, each with a "chance" attribute. This may eventually be somwhat of a performance concern, so I'd like to make sure I'm doing this right. Here's what I have so far: create table r1 ( i int, chance numeric ) create or replace function randrec() returns int as $$ $res = spi_exec_query('select i,chance from r1'); $r = rand; $ac = 0; $i = 0; while($ac < $r) { $ac += $res->{rows}[$i++]->{chance} } return $res->{rows}[$i-1]->{i}; $$ language plperl; test=# select * from r1; ...

  1. #1

    Default random record from small set

    I am trying to retrieve a random record (according to a chance
    attribute) from a small set of records, each with a "chance" attribute.
    This may eventually be somwhat of a performance concern, so I'd like to
    make sure I'm doing this right.

    Here's what I have so far:

    create table r1 (
    i int,
    chance numeric
    )
    create or replace function randrec() returns int as $$
    $res = spi_exec_query('select i,chance from r1');
    $r = rand;
    $ac = 0;
    $i = 0;
    while($ac < $r) {
    $ac += $res->{rows}[$i++]->{chance}
    }
    return $res->{rows}[$i-1]->{i};
    $$ language plperl;

    test=# select * from r1;
    i | chance
    ---+--------
    1 | 0.25
    2 | 0.20
    3 | 0.15
    4 | 0.10
    5 | 0.30


    That seems to work, in that out of 10k times, I got the following
    numbers of each:
    1 2479
    2 1959
    3 1522
    4 950
    5 3090

    But I have a few questions:
    * Am I right to use NUMERIC for the chance attribute?
    * Does perl's arithmetic leave me with the chance that those numeric
    values don't add up to 1.00 (and in this case that could mean an
    infinite loop)?
    * In my design I'll need a constraint trigger making sure that the
    numbers add up to 1.00. Will that be a performance problem for
    operations on the table that don't modify the chance attribute?
    * Is there a better way?
    * Does spi_exec_query pull the entire result set into memory at once? Is
    there a point at which performance could be a serious problem if there
    are a large number of items to select among?

    Regards,
    Jeff Davis



    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to org)

    Jeff Guest

  2. #2

    Default Re: random record from small set

    On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote: 

    I ran tests with numeric, real, and double precision; double precision
    was consistently about 10% faster than the others. I used the
    sample data you posted and the PL/pgSQL function shown later in
    this message.
     

    I'd suggest looping through the records so you can't possibly end
    up in an infinite loop.
     

    If the sum must be exactly 1.00 then be careful if you use double
    precision -- if you test with the equality operator then the check
    might fail because the sum is 0.9999999987.
     

    Any trigger that you didn't otherwise need will cause a performance
    hit. I'd expect a statement-level AFTER trigger to have the lowest
    impact since it would run only once per statement, whereas a row-level
    trigger might run multiple times per statement. On the other hand,
    if you make a lot of updates that don't modify the chance attribute,
    then you might want to try a row-level trigger that skips the check
    when NEW.chance = OLD.chance. I'd suggesting testing different
    methods under expected conditions and see which has the lowest impact.
     

    I think it does. I ran some tests with the following PL/pgSQL
    function and got significantly faster times than with PL/Perl,
    especially as the data set grew:

    CREATE FUNCTION randrec() RETURNS integer AS $$
    DECLARE
    r double precision := random();
    ac double precision := 0.0;
    row record;
    BEGIN
    FOR row IN SELECT * FROM r1 LOOP
    ac := ac + row.chance;
    IF ac >= r THEN
    EXIT;
    END IF;
    END LOOP;

    RETURN row.i;
    END;
    $$ LANGUAGE plpgsql VOLATILE;

    SELECT * FROM r1;
    i | chance
    ---+--------
    1 | 0.25
    2 | 0.20
    3 | 0.15
    4 | 0.10
    5 | 0.30

    SELECT i, count(*)
    FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
    GROUP BY i
    ORDER by i;
    i | count
    ---+-------
    1 | 2467
    2 | 1939
    3 | 1536
    4 | 1016
    5 | 3042
    (5 rows)
    Time: 3300.710 ms

    Here are the results using the PL/Perl function you posted:

    SELECT i, count(*)
    FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
    GROUP BY i
    ORDER by i;
    i | count
    ---+-------
    1 | 2501
    2 | 2040
    3 | 1463
    4 | 994
    5 | 3002
    (5 rows)
    Time: 8765.584 ms

    I ran each query several times and those times were typical of both.
    With a data set of 100 records, the PL/pgSQL function ran in about
    14 seconds, while the PL/Perl function took around 65 seconds.

    --
    Michael Fuhr
    http://www.fuhr.org/~mfuhr/

    ---------------------------(end of broadcast)---------------------------
    TIP 1: subscribe and unsubscribe commands go to org

    Michael Guest

  3. #3

    Default Re: random record from small set

    Thanks very much for the information. I had very similar results on my
    machine. I will take your advice and use the double-precision values,
    since it doesn't affect the probability significantly anyway. As far as
    the constraint trigger, I will see if it becomes a problem before I
    worry about its performance. As far as whether those values add up to
    1.0, I'll just check to make sure it's fairly close to 1.0 :)

    The only real difference that I saw was that I didn't notice much
    difference if the underlying table's chance attribute was double
    precision vs. numeric. I used your generate_series()-based query.

    Perhaps that was because I was using ALTER TABLE to modify r1's "chance"
    attribute. One thing that I noticed there was that I had to CREATE OR
    REPLACE the plpgsql function for that to work. Perhaps it was a bug?
    Here's a test case:

    test=# create table t1(i int);
    CREATE TABLE
    test=# insert into t1 values(1);
    INSERT 17775 1
    test=# create or replace function err() returns int as $$ DECLARE ac
    double precision := 0.0; row record; BEGIN FOR row IN SELECT i FROM t1
    LOOP ac := ac + row.i; END LOOP; RETURN row.i; END; $$ language
    plpgsql;
    CREATE FUNCTION
    test=# insert into t1 values(2); INSERT 17778 1
    test=# select err();
    err
    -----
    2
    (1 row)
    test=# alter table t1 alter column i type numeric;
    ALTER TABLE
    test=# select err();
    err
    -----
    10
    (1 row)

    And if you keep playing around with the type and values you can get
    other errors like:
    ERROR: type "double precision" value out of range: underflow
    CONTEXT: PL/pgSQL function "err" line 1 at assignment
    Or:
    ERROR: invalid memory alloc request size 4294967290
    CONTEXT: PL/pgSQL function "randrec" line 7 at assignment

    If any more information would be helpful someone let me know. It looks a
    little like a bug; perhaps we should throw an error when dependent
    functions are called after the underlying types have changed? Or I
    suppose if we can recognize that, it might as well recompile the
    function and proceed without error.

    Regards,
    Jeff Davis


    On Mon, 2005-02-14 at 22:18 -0700, Michael Fuhr wrote: 
    >
    > I ran tests with numeric, real, and double precision; double precision
    > was consistently about 10% faster than the others. I used the
    > sample data you posted and the PL/pgSQL function shown later in
    > this message.

    >
    > I'd suggest looping through the records so you can't possibly end
    > up in an infinite loop.

    >
    > If the sum must be exactly 1.00 then be careful if you use double
    > precision -- if you test with the equality operator then the check
    > might fail because the sum is 0.9999999987.

    >
    > Any trigger that you didn't otherwise need will cause a performance
    > hit. I'd expect a statement-level AFTER trigger to have the lowest
    > impact since it would run only once per statement, whereas a row-level
    > trigger might run multiple times per statement. On the other hand,
    > if you make a lot of updates that don't modify the chance attribute,
    > then you might want to try a row-level trigger that skips the check
    > when NEW.chance = OLD.chance. I'd suggesting testing different
    > methods under expected conditions and see which has the lowest impact.

    >
    > I think it does. I ran some tests with the following PL/pgSQL
    > function and got significantly faster times than with PL/Perl,
    > especially as the data set grew:
    >
    > CREATE FUNCTION randrec() RETURNS integer AS $$
    > DECLARE
    > r double precision := random();
    > ac double precision := 0.0;
    > row record;
    > BEGIN
    > FOR row IN SELECT * FROM r1 LOOP
    > ac := ac + row.chance;
    > IF ac >= r THEN
    > EXIT;
    > END IF;
    > END LOOP;
    >
    > RETURN row.i;
    > END;
    > $$ LANGUAGE plpgsql VOLATILE;
    >
    > SELECT * FROM r1;
    > i | chance
    > ---+--------
    > 1 | 0.25
    > 2 | 0.20
    > 3 | 0.15
    > 4 | 0.10
    > 5 | 0.30
    >
    > SELECT i, count(*)
    > FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
    > GROUP BY i
    > ORDER by i;
    > i | count
    > ---+-------
    > 1 | 2467
    > 2 | 1939
    > 3 | 1536
    > 4 | 1016
    > 5 | 3042
    > (5 rows)
    > Time: 3300.710 ms
    >
    > Here are the results using the PL/Perl function you posted:
    >
    > SELECT i, count(*)
    > FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
    > GROUP BY i
    > ORDER by i;
    > i | count
    > ---+-------
    > 1 | 2501
    > 2 | 2040
    > 3 | 1463
    > 4 | 994
    > 5 | 3002
    > (5 rows)
    > Time: 8765.584 ms
    >
    > I ran each query several times and those times were typical of both.
    > With a data set of 100 records, the PL/pgSQL function ran in about
    > 14 seconds, while the PL/Perl function took around 65 seconds.
    >[/ref]


    ---------------------------(end of broadcast)---------------------------
    TIP 3: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to org so that your
    message can get through to the mailing list cleanly

    Jeff Guest

  4. #4

    Default Re: random record from small set

    And what about another data representation like

    create table r1 (
    i int,
    chance_from numeric,
    chance_to numeric
    )
    , you can select one random row in one select, for instance
    select * from r1 where chance_from <= $rnd and chance_to > $rnd;

    I see these advantages
    - Only one select.
    - Indices can improve performance if r1 has many rows.

    and disadvantage
    - Tricky update


    Jeff Davis wrote:
     

    ---------------------------(end of broadcast)---------------------------
    TIP 1: subscribe and unsubscribe commands go to org

    Jan Guest

  5. #5

    Default Re: random record from small set

    Or
    create table r1 (
    i int,
    chance_from numeric
    )
    and
    select * from r1 where chance_from <= $rnd order by chance_from desc
    limit 1;
    which can be easier updated...
    Just ideas, I has never tested it...

    Jan Poslusny wrote:
     
    >[/ref]

    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to org)

    Jan Guest

  6. #6

    Default Re: random record from small set


    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

     

    If you go that route, make sure you check for edge cases, such
    as reaching the end of the rows without hitting your number:

    while($ac < $r) {
    die qq{Ran out of rows!\n} if ! defined $res->{rows}[$i];

    Also, your query should be "select i,chance from r1 ORDER BY random()"
    else you are getting back the same order each time (until a row is
    changed) which certainly reduces the randomness.

    Anyway, here's another solution, which shifts as much work as possible
    off of the actual random row call, and uses a trigger to keep things
    in sync. I switched the 'chance' from 0.25 to 25 (numeric to int) to make
    things easier to read.

    UPDATE r1 SET chance = chance*100;
    ALTER TABLE r1 ALTER COLUMN chance TYPE INTEGER;

    CREATE TABLE r2(integer);

    CREATE OR REPLACE FUNCTION r1_cleanup() RETURNS trigger LANGUAGE plpgsql AS
    $$
    DECLARE
    mychance integer;
    BEGIN
    IF TG_OP = 'DELETE' THEN
    DELETE FROM r2 WHERE id = OLD.i;
    ELSE
    IF TG_OP = 'UPDATE' THEN
    DELETE FROM r2 WHERE id = OLD.i or id = NEW.i;
    END IF;
    SELECT chance FROM r1 WHERE i=NEW.i INTO mychance;
    LOOP
    mychance := mychance - 1;
    EXIT WHEN mychance < 0;
    INSERT INTO r2 VALUES (NEW.i);
    END LOOP;
    END IF;
    RETURN NULL;
    END;
    $$;

    CREATE TRIGGER r1_trigger AFTER INSERT or UPDATE or DELETE ON r1
    FOR EACH ROW EXECUTE PROCEDURE r1_cleanup();

    UPDATE r1 SET i=i; -- To initially populate r2

    SELECT id FROM r2 ORDER BY random() LIMIT 1; -- repeat as needed


    - --
    Greg Sabino Mullane com
    PGP Key: 0x14964AC8 200502152252
    http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

    -----BEGIN PGP SIGNATURE-----

    iD8DBQFCEsOvvJuQZxSWSsgRAjysAJ9X3JpMfuXV2ST049bhCW uJOp6Y1ACg/sNx
    PXqxVlfvlsKMTBDDhsh3BmU=
    =7/IE
    -----END PGP SIGNATURE-----



    ---------------------------(end of broadcast)---------------------------
    TIP 3: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to org so that your
    message can get through to the mailing list cleanly

    Greg Guest

Similar Threads

  1. Random record retrieval problem
    By leesiulung in forum Coldfusion Database Access
    Replies: 5
    Last Post: August 17th, 11:28 PM
  2. Random record
    By barbedwire103 in forum Dreamweaver AppDev
    Replies: 4
    Last Post: June 13th, 12:29 PM
  3. Choose a random record
    By J?J in forum Macromedia ColdFusion
    Replies: 0
    Last Post: March 27th, 07:55 PM
  4. How to solve this? Random record in dataset or AdRotator
    By Miguel Dias Moura in forum ASP.NET Data Grid Control
    Replies: 1
    Last Post: September 23rd, 12:59 AM
  5. GoTo Random Record script??
    By Joseph Witkin in forum FileMaker
    Replies: 5
    Last Post: October 19th, 09:59 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139