Professional Web Applications Themes

Database Representation of Undirected Graph (Peer Nodes) - Microsoft SQL / MS SQL Server

I'm trying to figure out the best way to represent a network of nodes in an undirected graph. Each node can have n peers. I'd like to be able to derive the following: 1) Count all nodes in a given network 2) Retrieve all nodes closer than X steps away from a given node 3) Retrieve all nodes further than X steps away from a given node What's the best way to represent this?...

  1. #1

    Default Database Representation of Undirected Graph (Peer Nodes)

    I'm trying to figure out the best way to represent a
    network of nodes in an undirected graph.
    Each node can have n peers.

    I'd like to be able to derive the following:

    1) Count all nodes in a given network
    2) Retrieve all nodes closer than X steps away from a
    given node
    3) Retrieve all nodes further than X steps away from a
    given node

    What's the best way to represent this?

    Dave Guest

  2. #2

    Default Re: Database Representation of Undirected Graph (Peer Nodes)

    If your graph is a straightforward hierarchy then see these articles:

    http://www.intelligententerprise.com/001020/celko1_1.shtml
    http://www.dbazine.com/tropashko4.html

    If it's a reconvergent graph there is a discussion in this thread which
    gives some pointers:

    http://groups.google.com/groups?selm=O4d55UwzCHA.2368%40TK2MSFTNGP12

    --
    David Portas
    ------------
    Please reply only to the newsgroup
    --



    David Guest

  3. #3

    Default Re: Database Representation of Undirected Graph (Peer Nodes)

    >> .. represent a network of nodes in an undirected graph. <<

    The best way I have found in SQL is an adjacency list.

    CREATE TABLE DirectedGraph
    (source CHAR(3) NOT NULL,
    destination CHAR(3) NOT NULL,
    cost INTEGER NOT NULL,
    CHECK(source < node_2),
    PRIMARY KEY (source, destination));

    then use

    CREATE VIEW Graph (node_1, node_2, cost)
    AS SELECT source, destination, cost FROM DirectedGraph
    UNION ALL
    SELECT destination, source, cost FROM DirectedGraph;

    1) Count all nodes in a given network

    SELECT COUNT(DISTINCT node_1)
    FROM Graph;

    2) Retrieve all nodes closer than X steps away from a
    given node.

    This is very ugly, but it is the best I have. The best way is probably
    to do the Floyd-Warshall or Johnson algorithm in a procedural language
    and load a table with the results. But I want to do this in pure SQL as
    an exercise. Let's start with a simple graph and represent it as an
    adjacency list with weights on the edges.

    CREATE TABLE Graph
    (source CHAR(2) NOT NULL,
    destination CHAR(2) NOT NULL,
    cost INTEGER NOT NULL,
    PRIMARY KEY (source, destination));

    I got data for this table from the book Introduction to Algorithms by
    Cormen, Leiserson and Rivest (ISBN 0-262-03141-8), page 518. This book
    is very popular in college courses in the United States. I made one
    decision that will be important later; I added self-traversal edges; the
    node is both the source and the destination so the cost of those paths
    is zero.

    INSERT INTO Graph
    VALUES ('s', 's', 0),
    ('s', 'u', 3),
    ('s', 'x', 5),
    ('u', 'u', 0),
    ('u', 'v', 6),
    ('u', 'x', 2),
    ('v', 'v', 0),
    ('v', 'y', 2),
    ('x', 'u', 1),
    ('x', 'v', 4),
    ('x', 'x', 0),
    ('x', 'y', 6),
    ('y', 's', 3),
    ('y', 'v', 7),
    ('y', 'y', 0);

    I am not happy about this approach, because I have to decide the maximum
    number of edges in path before I start looking for an answer. But this
    will work and I know that a path will have no more than the total number
    of nodes in the graph. Let's create a table to hold the paths:

    CREATE TABLE Paths
    (step_1 CHAR(2) NOT NULL,
    step_2 CHAR(2) NOT NULL,
    step_3 CHAR(2) NOT NULL,
    step_4 CHAR(2) NOT NULL,
    step_5 CHAR(2) NOT NULL,
    total_cost INTEGER NOT NULL,
    path_length INTEGER NOT NULL,
    PRIMARY KEY (step_1, step_2, step_3, step_4, step_5));


    The step_1 node is where I begin the path. The other columns are the
    second step, third step, fourth step, and so forth. The last step
    column is the end of the journey. The total_cost column is the total
    cost, based on the sum of the weights of the edges, on this path. The
    path length column is harder to explain, but for now, let's just say
    that it is a count of the nodes visited in the path.

    To keep things easier, let's look at all the paths from "s" to "y" in
    the graph. The INSERT INTO statement for construction that set looks
    like this:

    INSERT INTO Paths
    SELECT G1.source, -- it is 's' in this example
    G2.source,
    G3.source,
    G4.source,
    G4.destination, -- it is 'y' in this example
    (G1.cost + G2.cost + G3.cost + G4.cost),
    (CASE WHEN G1.source
    NOT IN (G2.source, G3.source, G4.source)
    THEN 1 ELSE 0 END
    + CASE WHEN G2.source
    NOT IN (G1.source, G3.source, G4.source)
    THEN 1 ELSE 0 END
    + CASE WHEN G3.source
    NOT IN (G1.source, G2.source, G4.source)
    THEN 1 ELSE 0 END
    + CASE WHEN G4.source
    NOT IN (G1.source, G2.source, G3.source)
    THEN 1 ELSE 0 END)
    FROM Graph AS G1, Graph AS G2, Graph AS G3, Graph AS G4
    WHERE G1.source = 's'
    AND G1.destination = G2.source
    AND G2.destination = G3.source
    AND G3.destination = G4.source
    AND G4.destination = 'y';

    I put in "s" and "y" as the source and destination of the path, and made
    sure that the destination of one step in the path was the source of the
    next step in the path. This is a combinatorial explosion, but it is
    easy to read and understand.

    The sum of the weights is the cost of the path, which is easy to
    understand. The path_length calculation is a bit harder. This sum of
    CASE expressions looks at each node in the path. If it is unique within
    the row, it is assigned a value of one, if it is not unique within the
    row, it is assigned a value of zero.

    All paths will have five steps in them because that is the way to table
    is declared. But what if a path exists between the two nodes which is
    shorter than five steps? That is where the self-traversal rows are used!
    Consecutive pairs of steps in the same row can be repetitions of the
    same node.

    Here is what the rows of the Paths table look like after this INSERT
    INTO statement, ordered by descending path_length, and then by ascending
    cost.

    Paths
    step_1 step_2 step_3 step_4 step_5 total_cost path_length
    ================================================== ====
    s s x x y 11 0
    s s s x y 11 1
    s x x x y 11 1
    s x u x y 14 2
    s s u v y 11 2
    s s u x y 11 2
    s s x v y 11 2
    s s x y y 11 2
    s u u v y 11 2
    s u u x y 11 2
    s u v v y 11 2
    s u x x y 11 2
    s x v v y 11 2
    s x x v y 11 2
    s x x y y 11 2
    s x y y y 11 2
    s x y v y 20 4
    s x u v y 14 4
    s u v y y 11 4
    s u x v y 11 4
    s u x y y 11 4
    s x v y y 11 4

    Many of these rows are equivalent to each other. For example, the paths
    ('s', 'x', 'v', 'v', 'y', 11, 2) and ('s', 'x', 'x', 'v', 'y', 11, 2)
    are both really the same path as ('s', 'x', 'v', 'y').

    In this example, the total_cost column defines the cost of a path, so we
    can eliminate some of the paths from the table with this statement, if
    we want the lowest cost.

    DELETE FROM Paths
    WHERE total_cost 
    FROM Paths);

    In this example, it got rid of 3 out of 22 possible paths. Let's
    consider another cost factor; the number of paths. People do not like
    to change airplanes or trains. If they can go from Amsterdam to New
    York City on one plane without changing planes for the same cost, they
    are happy. This is where that path_length column comes in. It is a
    quick way to remove the paths that have more edges than they need to get
    the job done.

    DELETE FROM Paths
    WHERE path_length 

    In this case, that last DELETE FROM statement will reduce the table to
    one row: ('s', 's', 'x', 'x', 'y', 11, 0) which reduces to ('s', 'x',
    'y'). This single remaining row is very convenient for my
    demonstration, but if you look at the table, you will see that there was
    also a subset of equivalent rows that had higher path_length numbers.

    ('s', 's', 's', 'x', 'y', 11, 1)
    ('s', 'x', 'x', 'x', 'y', 11, 1)
    ('s', 'x', 'x', 'y', 'y', 11, 2)
    ('s', 'x', 'y', 'y', 'y', 11, 2)

    Your task is to write code to handle equivalent rows. Hint: the
    duplicate nodes will always be contiguous across the row.



    3) Retrieve all nodes further than X steps away from a
    given node.

    Do NOT IN () predicate on the query in (2).


    --CELKO--
    ===========================
    Please post DDL, so that people do not have to guess what the keys,
    constraints, Declarative Referential Integrity, datatypes, etc. in your
    schema are.

    *** Sent via Developersdex http://www.developersdex.com ***
    Don't just participate in USENET...get rewarded for it!
    Joe Guest

Similar Threads

  1. Check for missing xml nodes or empty nodes
    By DuLaus in forum Macromedia ColdFusion
    Replies: 0
    Last Post: May 25th, 07:36 PM
  2. Realtime PHP graph from database
    By Pedro Ferreira in forum PHP Development
    Replies: 2
    Last Post: December 11th, 11:16 PM
  3. Replies: 1
    Last Post: June 30th, 09:40 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139