Ask a Question related to AIX, Design and Development.
-
TomK #1
looking for some hints on interesting HACMP event
We had fun on Monday afternoon, and I'm looking for some additional
suggestions on what to look for -- the snap has already gone to IBM HACMP
support.
About 2:40 in the afternoon, we get some complaints that a couple of local
users have been knocked out of their access databases. This includes our
help desk people and THEIR database. While looking into this, we get some
reports that SAP isn't responding.
I check the computer room -- and BOTH nodes in the HA cluster are powered
off -- as is the primary on my test cluster (HA wasn't running on the test
cluster fallover at the time).
The ONLY real info I can find so far is that topsvcs on all three systems
complained about using 34000+ msec of cpu in 32000+ msec of wall-clock
(the numbers differ from system to system, but all amount to roughly two
seconds more cpu than wall clock) and is terminating. Then group services
quit, and clustermgrES hit the big red switch on the way out.
From start to finish, 5 seconds max per machine. But my fallover system
went first; then, 15 seconds later, my test system. And then, a minute and
18 seconds after that, my production SAP system. And so far, the ONLY
notes I can find are the topsvcs logs and syslog entries -- with nothing
before the excessive CPU usage note.
All three systems are tied to a pair of CISCO switches by multiple 100
Mbit interfaces. As are another 14 RS-6000 systems (mostly 7026-6C1s);
nothing in ANY of their logs. The primary HA environment is a pair of
7028-6M1s. Everything is AIX 5.1 ML 04 at 64-bit, and the HA code is (old)
at 4.4.1 first fix level.
All we've go to go on is the network staff plugging a CISCO switch into a
gigabit fiber feed that ties to a CISCO 6009; the two switches mentioned
above also connect (gig fiber) to the 6009. And the access database issue
(also being fed from the two computer room switches).
From start to finish, the event lasted just over two minutes wall-clock,
as best we can tell.
Nothing out of the ordinary in any of the switch logs, except that several
ports indicate that the mac address changed rapidly for a few seconds --
like HA was trying an adapter swap -- but there is no HA log message of
adapter failure and takeover.
Any thoughts on why topsvcs would go berserk? Suggestions on other items
to look at?
TIA
Tom
TomK Guest
-
Getting Error: Event Type 'flash.event:event' is unavailable ?????
Hi, I am not using Cairngorm or anything, but trying to get an app built first without it then look into it. I am getting this error however... -
HACMP DR
I've got a running cluster in one location that consists of two p660-6H1's. My question is... I need to recover the cluster at Sunguard. I have a... -
HACMP
Just out of curiosity; If DMS kicked-off, and the network is down it has nowhere to go. Therefore is DMS possible without a network? TIA -
HACMP and rsh
Hopefully, someone can help me here. I'm very new with HA so bear with me. I was having problems syncing ha because of rsh problems. I got rsh... -
Hacmp 4.4.1 on AIX 5.2
tstclaire@hotmail.com (St. Claire) wrote in message news:<8581fcb0.0305280435.5da19af5@posting.google.com>... All I'm trying at present to... -
peter.glover@dsl.pipex.com #2
Re: looking for some hints on interesting HACMP event
TomK <namffuak@NO.skyenet.SPAM.net> wrote in message news:<gt8knvs2o8s99ghi9v1a7jrbfgndfd3eaa@4ax.com>. ..
Tom> We had fun on Monday afternoon, and I'm looking for some additional
> suggestions on what to look for -- the snap has already gone to IBM HACMP
> support.
>
> About 2:40 in the afternoon, we get some complaints that a couple of local
> users have been knocked out of their access databases. This includes our
> help desk people and THEIR database. While looking into this, we get some
> reports that SAP isn't responding.
>
> I check the computer room -- and BOTH nodes in the HA cluster are powered
> off -- as is the primary on my test cluster (HA wasn't running on the test
> cluster fallover at the time).
>
> The ONLY real info I can find so far is that topsvcs on all three systems
> complained about using 34000+ msec of cpu in 32000+ msec of wall-clock
> (the numbers differ from system to system, but all amount to roughly two
> seconds more cpu than wall clock) and is terminating. Then group services
> quit, and clustermgrES hit the big red switch on the way out.
>
> From start to finish, 5 seconds max per machine. But my fallover system
> went first; then, 15 seconds later, my test system. And then, a minute and
> 18 seconds after that, my production SAP system. And so far, the ONLY
> notes I can find are the topsvcs logs and syslog entries -- with nothing
> before the excessive CPU usage note.
>
> All three systems are tied to a pair of CISCO switches by multiple 100
> Mbit interfaces. As are another 14 RS-6000 systems (mostly 7026-6C1s);
> nothing in ANY of their logs. The primary HA environment is a pair of
> 7028-6M1s. Everything is AIX 5.1 ML 04 at 64-bit, and the HA code is (old)
> at 4.4.1 first fix level.
>
> All we've go to go on is the network staff plugging a CISCO switch into a
> gigabit fiber feed that ties to a CISCO 6009; the two switches mentioned
> above also connect (gig fiber) to the 6009. And the access database issue
> (also being fed from the two computer room switches).
>
> From start to finish, the event lasted just over two minutes wall-clock,
> as best we can tell.
>
> Nothing out of the ordinary in any of the switch logs, except that several
> ports indicate that the mac address changed rapidly for a few seconds --
> like HA was trying an adapter swap -- but there is no HA log message of
> adapter failure and takeover.
>
> Any thoughts on why topsvcs would go berserk? Suggestions on other items
> to look at?
>
> TIA
>
> Tom
Very interesting issue it would appear from the above that you may
have had mutiple issues occur at the same time:
1- The secondary server sounds like it lost contact with the primary
in this instance it will have attempted to take on the resource group.
When it then saw that the primary was up it would have performed a
halt -q ( instant stop! . I have no idea why the primary shutdown
maybe only IBM can check this out.
I guess you've checked all the log files
/tmp/hacmp.out
/tmp/cm.log
/var/adm/cluster.log
errpt
This may give you some more info? Have you checked that youve set the
high and low water mark on sys0 should be 24 and 33?
The other thing could be that your servers simply lost fall power the
error log should be able to confirm this?
Hope you solve this one.
Regards
Peter
peter.glover@dsl.pipex.com Guest
-
TomK #3
Re: looking for some hints on interesting HACMP event
On 1 Oct 2003 13:11:44 -0700, [email]peter.glover@dsl.pipex.com[/email] wrote:
The pimary's /tmp/hacmp.out log has the complete debug for the 'remote>TomK <namffuak@NO.skyenet.SPAM.net> wrote in message news:<gt8knvs2o8s99ghi9v1a7jrbfgndfd3eaa@4ax.com>. ..>>> We had fun on Monday afternoon, and I'm looking for some additional
>> suggestions on what to look for -- the snap has already gone to IBM HACMP
>> support.
>>
>> About 2:40 in the afternoon, we get some complaints that a couple of local
>> users have been knocked out of their access databases. This includes our
>> help desk people and THEIR database. While looking into this, we get some
>> reports that SAP isn't responding.
>>
>> I check the computer room -- and BOTH nodes in the HA cluster are powered
>> off -- as is the primary on my test cluster (HA wasn't running on the test
>> cluster fallover at the time).
>>
>> The ONLY real info I can find so far is that topsvcs on all three systems
>> complained about using 34000+ msec of cpu in 32000+ msec of wall-clock
>> (the numbers differ from system to system, but all amount to roughly two
>> seconds more cpu than wall clock) and is terminating. Then group services
>> quit, and clustermgrES hit the big red switch on the way out.
>>
>> From start to finish, 5 seconds max per machine. But my fallover system
>> went first; then, 15 seconds later, my test system. And then, a minute and
>> 18 seconds after that, my production SAP system. And so far, the ONLY
>> notes I can find are the topsvcs logs and syslog entries -- with nothing
>> before the excessive CPU usage note.
>>
>> All three systems are tied to a pair of CISCO switches by multiple 100
>> Mbit interfaces. As are another 14 RS-6000 systems (mostly 7026-6C1s);
>> nothing in ANY of their logs. The primary HA environment is a pair of
>> 7028-6M1s. Everything is AIX 5.1 ML 04 at 64-bit, and the HA code is (old)
>> at 4.4.1 first fix level.
>>
>> All we've go to go on is the network staff plugging a CISCO switch into a
>> gigabit fiber feed that ties to a CISCO 6009; the two switches mentioned
>> above also connect (gig fiber) to the 6009. And the access database issue
>> (also being fed from the two computer room switches).
>>
>> From start to finish, the event lasted just over two minutes wall-clock,
>> as best we can tell.
>>
>> Nothing out of the ordinary in any of the switch logs, except that several
>> ports indicate that the mac address changed rapidly for a few seconds --
>> like HA was trying an adapter swap -- but there is no HA log message of
>> adapter failure and takeover.
>>
>> Any thoughts on why topsvcs would go berserk? Suggestions on other items
>> to look at?
>>
>> TIA
>>
>> Tom
>Tom
>
>Very interesting issue it would appear from the above that you may
>have had mutiple issues occur at the same time:
>
>1- The secondary server sounds like it lost contact with the primary
>in this instance it will have attempted to take on the resource group.
>When it then saw that the primary was up it would have performed a
>halt -q ( instant stop! . I have no idea why the primary shutdown
>maybe only IBM can check this out.
down' event.
As I say, the secondary has, both in syslog and topsvcs.log.<whatever>
(I'm posting from home) the three lines about topsvcs exiting, then group
services, and then clustermaneger. That's where I'm pulling the times. And
the errlog has a few more entries for the above.Yup. Nada.>
>I guess you've checked all the log files
>
>/tmp/hacmp.out
>/tmp/cm.log
>/var/adm/cluster.log
>errpt
>
>This may give you some more info? Have you checked that youve set the
>high and low water mark on sys0 should be 24 and 33?
Fascinating.
Level 3 will be looking at it tomorrow, they tell me.No, we did tht two weeks ago :-)>
>The other thing could be that your servers simply lost fall power the
>error log should be able to confirm this?
This was definitely a case of clustemgrES deliberitly hitting the power
switch on his way out -- he even logged it (last log entry in the topsvcs
log).The most I'm hoping for is level 3 saying "it looks like this, and we put>
>Hope you solve this one.
>
a fix in for that kind of event a few months ago . . ."
The thirty-nine cent question is "Why did topsvcs get cpu bound, and why
did he decide to call it a day?" All else follows from there.
Thaks --
Tom
TomK Guest



Reply With Quote

