I am posting this to the group because [email]belalsco.com[/email] bounces...
>> Bela said...
>> >Well, we now have four suggestions competing to be tried:
>> >
>> > - install various supplements
>> > - increase the pit_mincnt floor
>> > - turn off short timers entirely (disable_short_timers=1)
>> > - provide an external heartbeat
>> >...
>> >I guess I would recommend:
>> >
>> > 1. install oss648a & oss651a; wait until you can tell whether the
>> > problems are fixed.
>> >
>> > 2. If not fixed, turn off short timers (disable_short_timers=1 in
>> > /etc/conf/pack.d/clock/space.c, relink kernel). Wait.
>> >
>> >If not fixed after those two steps, the problem is being misdiagnosed
>> >and you need to go back to initial discovery steps.
>> Bela,
>> Thank you for your continued assistance.
>> The updates were installed as at Step 1 above on Thursday 10th July.
>> We had no further problems on Thursday or Friday, and the overnight
>> scheduled backups worked fine, however - on Saturday I came in to the
>> office to check on the backup and found that all of the Specialix
>> connected serial terminals were dead, and console activity would not
>> resurrect them.
>> I rebooted the machine, and all has been well so far today (Monday
>> 2.35pm).
>> It appears that the timer problem may have been fixed (should I give it
>> a few more days and then reduce the pit_mincnt value to standard?), but
>> that we may still have a problem with the Specialix adapter. (Driver is
>> up to date).
>Neither of those supplements (oss648a & oss651a) fixes the timer
>problem, to my knowledge. However, oss651a contains code that is
>"opaque" to us -- Intel provides microcode updates in binary form and
>will not answer questions about what errata are corrected.
>So, that is: as far as I know, you must continue to use a high
>pit_mincnt, or risk periodic hangs which can be restarted by hitting a
>key on the console. It hadn't occurred to me until no wthat this
>behavior might be rooted in an erratum that the microcode updates would
>fix. If you are willing, I _would_ like to know what happens now if you
>set it back to normal. BUT, since I don't know of any such erratum, I
>suspect what will happen is you'll start getting those hangs again.
>So if you're willing to risk another hang or two (before you raise
>pit_mincnt again), I would definitely like to know. I just don't have
>much hope of hearing that the microcode actually helped.
>Meanwhile, yes, you may still have a problem with the Specialix driver.
>After you're done fiddling with the idea of microcode having helped the
>hangs [and by "after you're done" I mean, either after you try lowering
>pit_mincnt again; or immediately, after you decide it isn't worth the
>hassle] -- after that detour is complete, I think your next step should
>be to fully disable short timers, like I previously suggested:
> 2. If not fixed, turn off short timers (disable_short_timers=1 in
> /etc/conf/pack.d/clock/space.c, relink kernel). Wait.
>I say this because (1) the remaining hangs _could_ still be a timer
>spin, though I doubt it; and (2) they could also be caused by the
>Specialix driver simply not expecting us to do short timers. These are
>both low probability ideas. #2 is low because Specialix has had since
>we shipped 506, in mid-2000, to deal with the change). #1 is low
>because the remaining hangs are different (console keyboard doesn't wake
>the system). Low, but still high enough to be worth a simple probe.
The system has behaved itself for over two weeks.

We disabled the short timers as you suggested - unfortunately I haven't
had any opportunity to experiment with the level of pit_mincnt.

We do seem to have another inexplicable problem, however, in that the
'hardware' system time and the UNIX clock seem to get confused.

E.g. - the command 'w' shows a different up-time to what one would
expect based on reality, and compared to 'who -b' ...
$$ w
10:19am up 34 mins, 12 users, load average: 0.00, 0.00, 0.00
User Tty Login Idle JCPU PCPU What

andrew ttyp1 10:19am - - - w
$$ who -b
. system boot Jul 30 07:13
If we reboot and hit return when asked for the date, the times remain
out of synch, but entering the correct time at reboot sorts it

Thank you for your help.

Andrew Barnett - SGS Limited