looking for some hints on interesting HACMP event

Ask a Question related to AIX, Design and Development.

  1. #1

    Default looking for some hints on interesting HACMP event

    We had fun on Monday afternoon, and I'm looking for some additional
    suggestions on what to look for -- the snap has already gone to IBM HACMP
    support.

    About 2:40 in the afternoon, we get some complaints that a couple of local
    users have been knocked out of their access databases. This includes our
    help desk people and THEIR database. While looking into this, we get some
    reports that SAP isn't responding.

    I check the computer room -- and BOTH nodes in the HA cluster are powered
    off -- as is the primary on my test cluster (HA wasn't running on the test
    cluster fallover at the time).

    The ONLY real info I can find so far is that topsvcs on all three systems
    complained about using 34000+ msec of cpu in 32000+ msec of wall-clock
    (the numbers differ from system to system, but all amount to roughly two
    seconds more cpu than wall clock) and is terminating. Then group services
    quit, and clustermgrES hit the big red switch on the way out.

    From start to finish, 5 seconds max per machine. But my fallover system
    went first; then, 15 seconds later, my test system. And then, a minute and
    18 seconds after that, my production SAP system. And so far, the ONLY
    notes I can find are the topsvcs logs and syslog entries -- with nothing
    before the excessive CPU usage note.

    All three systems are tied to a pair of CISCO switches by multiple 100
    Mbit interfaces. As are another 14 RS-6000 systems (mostly 7026-6C1s);
    nothing in ANY of their logs. The primary HA environment is a pair of
    7028-6M1s. Everything is AIX 5.1 ML 04 at 64-bit, and the HA code is (old)
    at 4.4.1 first fix level.

    All we've go to go on is the network staff plugging a CISCO switch into a
    gigabit fiber feed that ties to a CISCO 6009; the two switches mentioned
    above also connect (gig fiber) to the 6009. And the access database issue
    (also being fed from the two computer room switches).

    From start to finish, the event lasted just over two minutes wall-clock,
    as best we can tell.

    Nothing out of the ordinary in any of the switch logs, except that several
    ports indicate that the mac address changed rapidly for a few seconds --
    like HA was trying an adapter swap -- but there is no HA log message of
    adapter failure and takeover.

    Any thoughts on why topsvcs would go berserk? Suggestions on other items
    to look at?

    TIA

    Tom
    TomK Guest

  2. Similar Questions and Discussions

    1. Getting Error: Event Type 'flash.event:event' is unavailable ?????
      Hi, I am not using Cairngorm or anything, but trying to get an app built first without it then look into it. I am getting this error however...
    2. HACMP DR
      I've got a running cluster in one location that consists of two p660-6H1's. My question is... I need to recover the cluster at Sunguard. I have a...
    3. HACMP
      Just out of curiosity; If DMS kicked-off, and the network is down it has nowhere to go. Therefore is DMS possible without a network? TIA
    4. HACMP and rsh
      Hopefully, someone can help me here. I'm very new with HA so bear with me. I was having problems syncing ha because of rsh problems. I got rsh...
    5. Hacmp 4.4.1 on AIX 5.2
      tstclaire@hotmail.com (St. Claire) wrote in message news:<8581fcb0.0305280435.5da19af5@posting.google.com>... All I'm trying at present to...
  3. #2

    Default Re: looking for some hints on interesting HACMP event

    TomK <namffuak@NO.skyenet.SPAM.net> wrote in message news:<gt8knvs2o8s99ghi9v1a7jrbfgndfd3eaa@4ax.com>. ..
    > We had fun on Monday afternoon, and I'm looking for some additional
    > suggestions on what to look for -- the snap has already gone to IBM HACMP
    > support.
    >
    > About 2:40 in the afternoon, we get some complaints that a couple of local
    > users have been knocked out of their access databases. This includes our
    > help desk people and THEIR database. While looking into this, we get some
    > reports that SAP isn't responding.
    >
    > I check the computer room -- and BOTH nodes in the HA cluster are powered
    > off -- as is the primary on my test cluster (HA wasn't running on the test
    > cluster fallover at the time).
    >
    > The ONLY real info I can find so far is that topsvcs on all three systems
    > complained about using 34000+ msec of cpu in 32000+ msec of wall-clock
    > (the numbers differ from system to system, but all amount to roughly two
    > seconds more cpu than wall clock) and is terminating. Then group services
    > quit, and clustermgrES hit the big red switch on the way out.
    >
    > From start to finish, 5 seconds max per machine. But my fallover system
    > went first; then, 15 seconds later, my test system. And then, a minute and
    > 18 seconds after that, my production SAP system. And so far, the ONLY
    > notes I can find are the topsvcs logs and syslog entries -- with nothing
    > before the excessive CPU usage note.
    >
    > All three systems are tied to a pair of CISCO switches by multiple 100
    > Mbit interfaces. As are another 14 RS-6000 systems (mostly 7026-6C1s);
    > nothing in ANY of their logs. The primary HA environment is a pair of
    > 7028-6M1s. Everything is AIX 5.1 ML 04 at 64-bit, and the HA code is (old)
    > at 4.4.1 first fix level.
    >
    > All we've go to go on is the network staff plugging a CISCO switch into a
    > gigabit fiber feed that ties to a CISCO 6009; the two switches mentioned
    > above also connect (gig fiber) to the 6009. And the access database issue
    > (also being fed from the two computer room switches).
    >
    > From start to finish, the event lasted just over two minutes wall-clock,
    > as best we can tell.
    >
    > Nothing out of the ordinary in any of the switch logs, except that several
    > ports indicate that the mac address changed rapidly for a few seconds --
    > like HA was trying an adapter swap -- but there is no HA log message of
    > adapter failure and takeover.
    >
    > Any thoughts on why topsvcs would go berserk? Suggestions on other items
    > to look at?
    >
    > TIA
    >
    > Tom
    Tom

    Very interesting issue it would appear from the above that you may
    have had mutiple issues occur at the same time:

    1- The secondary server sounds like it lost contact with the primary
    in this instance it will have attempted to take on the resource group.
    When it then saw that the primary was up it would have performed a
    halt -q ( instant stop! . I have no idea why the primary shutdown
    maybe only IBM can check this out.

    I guess you've checked all the log files

    /tmp/hacmp.out
    /tmp/cm.log
    /var/adm/cluster.log
    errpt

    This may give you some more info? Have you checked that youve set the
    high and low water mark on sys0 should be 24 and 33?

    The other thing could be that your servers simply lost fall power the
    error log should be able to confirm this?

    Hope you solve this one.

    Regards

    Peter
    peter.glover@dsl.pipex.com Guest

  4. #3

    Default Re: looking for some hints on interesting HACMP event

    On 1 Oct 2003 13:11:44 -0700, [email]peter.glover@dsl.pipex.com[/email] wrote:
    >TomK <namffuak@NO.skyenet.SPAM.net> wrote in message news:<gt8knvs2o8s99ghi9v1a7jrbfgndfd3eaa@4ax.com>. ..
    >> We had fun on Monday afternoon, and I'm looking for some additional
    >> suggestions on what to look for -- the snap has already gone to IBM HACMP
    >> support.
    >>
    >> About 2:40 in the afternoon, we get some complaints that a couple of local
    >> users have been knocked out of their access databases. This includes our
    >> help desk people and THEIR database. While looking into this, we get some
    >> reports that SAP isn't responding.
    >>
    >> I check the computer room -- and BOTH nodes in the HA cluster are powered
    >> off -- as is the primary on my test cluster (HA wasn't running on the test
    >> cluster fallover at the time).
    >>
    >> The ONLY real info I can find so far is that topsvcs on all three systems
    >> complained about using 34000+ msec of cpu in 32000+ msec of wall-clock
    >> (the numbers differ from system to system, but all amount to roughly two
    >> seconds more cpu than wall clock) and is terminating. Then group services
    >> quit, and clustermgrES hit the big red switch on the way out.
    >>
    >> From start to finish, 5 seconds max per machine. But my fallover system
    >> went first; then, 15 seconds later, my test system. And then, a minute and
    >> 18 seconds after that, my production SAP system. And so far, the ONLY
    >> notes I can find are the topsvcs logs and syslog entries -- with nothing
    >> before the excessive CPU usage note.
    >>
    >> All three systems are tied to a pair of CISCO switches by multiple 100
    >> Mbit interfaces. As are another 14 RS-6000 systems (mostly 7026-6C1s);
    >> nothing in ANY of their logs. The primary HA environment is a pair of
    >> 7028-6M1s. Everything is AIX 5.1 ML 04 at 64-bit, and the HA code is (old)
    >> at 4.4.1 first fix level.
    >>
    >> All we've go to go on is the network staff plugging a CISCO switch into a
    >> gigabit fiber feed that ties to a CISCO 6009; the two switches mentioned
    >> above also connect (gig fiber) to the 6009. And the access database issue
    >> (also being fed from the two computer room switches).
    >>
    >> From start to finish, the event lasted just over two minutes wall-clock,
    >> as best we can tell.
    >>
    >> Nothing out of the ordinary in any of the switch logs, except that several
    >> ports indicate that the mac address changed rapidly for a few seconds --
    >> like HA was trying an adapter swap -- but there is no HA log message of
    >> adapter failure and takeover.
    >>
    >> Any thoughts on why topsvcs would go berserk? Suggestions on other items
    >> to look at?
    >>
    >> TIA
    >>
    >> Tom
    >
    >Tom
    >
    >Very interesting issue it would appear from the above that you may
    >have had mutiple issues occur at the same time:
    >
    >1- The secondary server sounds like it lost contact with the primary
    >in this instance it will have attempted to take on the resource group.
    >When it then saw that the primary was up it would have performed a
    >halt -q ( instant stop! . I have no idea why the primary shutdown
    >maybe only IBM can check this out.
    The pimary's /tmp/hacmp.out log has the complete debug for the 'remote
    down' event.

    As I say, the secondary has, both in syslog and topsvcs.log.<whatever>
    (I'm posting from home) the three lines about topsvcs exiting, then group
    services, and then clustermaneger. That's where I'm pulling the times. And
    the errlog has a few more entries for the above.
    >
    >I guess you've checked all the log files
    >
    >/tmp/hacmp.out
    >/tmp/cm.log
    >/var/adm/cluster.log
    >errpt
    >
    >This may give you some more info? Have you checked that youve set the
    >high and low water mark on sys0 should be 24 and 33?
    Yup. Nada.

    Fascinating.

    Level 3 will be looking at it tomorrow, they tell me.
    >
    >The other thing could be that your servers simply lost fall power the
    >error log should be able to confirm this?
    No, we did tht two weeks ago :-)

    This was definitely a case of clustemgrES deliberitly hitting the power
    switch on his way out -- he even logged it (last log entry in the topsvcs
    log).
    >
    >Hope you solve this one.
    >
    The most I'm hoping for is level 3 saying "it looks like this, and we put
    a fix in for that kind of event a few months ago . . ."

    The thirty-nine cent question is "Why did topsvcs get cpu bound, and why
    did he decide to call it a day?" All else follows from there.

    Thaks --
    Tom

    TomK Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139