I am posting this to the group because [email]belalsco.com[/email] bounces...
>> Bela said...
>>
>> >Well, we now have four suggestions competing to be tried:
>> >
>> > - install various supplements
>> > - increase the pit_mincnt floor
>> > - turn off short timers entirely (disable_short_timers=1)
>> > - provide an external heartbeat
>> >...
>>
>> >I guess I would recommend:
>> >
>> > 1. install oss648a & oss651a; wait until you can tell whether the
>> > problems are fixed.
>> >
>> > 2. If not fixed, turn off short timers (disable_short_timers=1 in
>> > /etc/conf/pack.d/clock/space.c, relink kernel). Wait.
>> >
>> >If not fixed after those two steps, the problem is being misdiagnosed
>> >and you need to go back to initial discovery steps.
>>
>> Bela,
>>
>> Thank you for your continued assistance.
>>
>> The updates were installed as at Step 1 above on Thursday 10th July.
>>
>> We had no further problems on Thursday or Friday, and the overnight
>> scheduled backups worked fine, however - on Saturday I came in to the
>> office to check on the backup and found that all of the Specialix
>> connected serial terminals were dead, and console activity would not
>> resurrect them.
>>
>> I rebooted the machine, and all has been well so far today (Monday
>> 2.35pm).
>>
>> It appears that the timer problem may have been fixed (should I give it
>> a few more days and then reduce the pit_mincnt value to standard?), but
>> that we may still have a problem with the Specialix adapter. (Driver is
>> up to date).
>
>Neither of those supplements (oss648a & oss651a) fixes the timer
>problem, to my knowledge. However, oss651a contains code that is
>"opaque" to us -- Intel provides microcode updates in binary form and
>will not answer questions about what errata are corrected.
>
>So, that is: as far as I know, you must continue to use a high
>pit_mincnt, or risk periodic hangs which can be restarted by hitting a
>key on the console. It hadn't occurred to me until no wthat this
>behavior might be rooted in an erratum that the microcode updates would
>fix. If you are willing, I _would_ like to know what happens now if you
>set it back to normal. BUT, since I don't know of any such erratum, I
>suspect what will happen is you'll start getting those hangs again.
>
>So if you're willing to risk another hang or two (before you raise
>pit_mincnt again), I would definitely like to know. I just don't have
>much hope of hearing that the microcode actually helped.
>
>Meanwhile, yes, you may still have a problem with the Specialix driver.
>After you're done fiddling with the idea of microcode having helped the
>hangs [and by "after you're done" I mean, either after you try lowering
>pit_mincnt again; or immediately, after you decide it isn't worth the
>hassle] -- after that detour is complete, I think your next step should
>be to fully disable short timers, like I previously suggested:
>
> 2. If not fixed, turn off short timers (disable_short_timers=1 in
> /etc/conf/pack.d/clock/space.c, relink kernel). Wait.
>
>I say this because (1) the remaining hangs _could_ still be a timer
>spin, though I doubt it; and (2) they could also be caused by the
>Specialix driver simply not expecting us to do short timers. These are
>both low probability ideas. #2 is low because Specialix has had since
>we shipped 506, in mid-2000, to deal with the change). #1 is low
>because the remaining hangs are different (console keyboard doesn't wake
>the system). Low, but still high enough to be worth a simple probe.
>
>>Bela<
The system has behaved itself for over two weeks.

We disabled the short timers as you suggested - unfortunately I haven't
had any opportunity to experiment with the level of pit_mincnt.

We do seem to have another inexplicable problem, however, in that the
'hardware' system time and the UNIX clock seem to get confused.

E.g. - the command 'w' shows a different up-time to what one would
expect based on reality, and compared to 'who -b' ...
>
$$ w
10:19am up 34 mins, 12 users, load average: 0.00, 0.00, 0.00
User Tty Login Idle JCPU PCPU What

andrew ttyp1 10:19am - - - w
$$ who -b
. system boot Jul 30 07:13
$$
>
If we reboot and hit return when asked for the date, the times remain
out of synch, but entering the correct time at reboot sorts it
temporarily.

Thank you for your help.
--
Regards

Andrew Barnett - SGS Limited