ok, will do. first I have to find and restore the KLH10 instance from
backup, thanks to an unexpectedly violent storm that triggered tornado
warnings and consecutive brown-outs. thanks.
On 9/3/20 7:39 PM, Thomas DeBellis wrote:
Shortly after sending this, I wedged my development machine by
mistakenly beating on the file system; this time by running SPEAR to
pull out events around the DTEKPA BUGCHK.? There was too much activity
(I have a very large ERROR.SYS, thanks to DECnet) and I got a DTEKPA.?
Once this happens, the machine hangs shortly afterwards.? This finally
caused me to have a look at DTESRV.
KPALIV is a variable that is incremented by Tops-20 in a number of
circumstances by SCHED, APRSRV and (oddly) CFSSRV.? It's a keep alive
counter that both the front end and Tops-20 pay attention to.? An
examination of the live monitor shows that it is monotonically increasing:
1,,COMBAS+5[?? 417,,424521
1,,COMBAS+5[?? 417,,426524
1,,COMBAS+5[?? 417,,510532
It is updated approximately every 500 milliseconds; let's call that a
keep-alive tick.? If it isn't updated in two ticks, the front end is
declared down and reload action is initiated.? A number of things are
done and it appears that KLH10 is not properly handling them.? Since
the KLH10 DTE service is not running in a separate process (there are
vestigial hooks to do this), it does not handle a ten triggered reload.
Tops-20 waits for the reload to complete, KLH10 does nothing and
you're hung.
Fortunately, there is some code for the master DTE which checks a
variable called FEDBSW, Front End Debugging Switch.? If this is
non-zero, then the keep-alive count is incremented, but it's never
checked.? So I set it to -1 (it was zero) and then proceeded to beat
on the file system with wild abandon.
For periods of intense disk activity, the machine appeared to hang.?
After about 10 to 20 seconds, it came right back as if nothing had
never happened.? Interesting...
Right now, my working assumption is that the PI system is getting
saturated so that the clock interrupt somehow isn't making it
through.? For now, I'm thinking of rewriting the service routine so
that instead of checking for two ticks, it checks elapsed time which
can then be set to some 'reasonable' value.
If you think this may be what is hanging you, then you can try it.?
For me, FEDBSW is at octal 1,,304544.? Thus far, I'm up 42:44:57 (1
Day, 18 Hours, 44 Minutes, 57 Seconds and 615 Milliseconds).
> ------------------------------------------------------------------------
>
> On 8/31/20 9:03 PM, Thomas DeBellis wrote:
>
> Do you know what program is displaying those three lines?
>
> I'm unaware of a PANDA distribution that didn't announce itself as a
> PANDA distribution in the system banner.?? The date and time display
> is odd.? Tops-20 native time output has been Y2K compliant since
> forever.? It's the Tops-10 programs (MACRO, CREF, Etc.), plus
> Tops-10'ish programs (GLXLIB, Quasar, Etc.) that needed Y2K patches.
>
> Tops-20 DAP needed a small modification to handle Y2K and to not
> break RSX.
>
> The Tops-10 system that I use has a number of non-Y2K times, which
> surprised me.? While I have had the freedom to remediate, I simply
> don't have the time.? But it's jarring.
>
> I also found it interesting that the banner says DEC10 Development;
> 20's were sometimes called DEC20's, but never DEC10's (well, 1031
> might have been an exception).
>
> I could have sworn you were showing us something off of a Tops-10 CTY...
>
>> ------------------------------------------------------------------------
>> On 8/31/20 7:13 PM, Supratim Sanyal wrote:
>>
>> I will keep digging - but it is possibly interesting this happens
>> between approx 52 and and indeterminate number of solid uptime
>>> ------------------------------------------------------------------------
>>>
>>> On Aug 31, 2020, at 5:00 PM, Thomas DeBellis
>>> <tommytimesharing at
gmail.com <mailto:tommytimesharing at
gmail.com>> wrote:
>>>
>>> If you are running a standard PANDA distribution, then DDT is in
>>> the monitor and you may fail to it.? Did it come up?? Did you do an
>>> examine from the KLH10 micro-engine to see what instruction it was
>>> failing on?? Did you see what module it is failing in?
>>>
>>> My monitor is modified from the base PANDA distribution to include
>>> several local enhancements, so when I looked at that address, it
>>> showed up as in the entry of CHKOPC, which is what is checking for
>>> differed closes on virtual circuits.? This is in PHYKLP which is
>>> the KLIPA driver (a.k.a. the CI). Since KLH10 (sadly) does not
>>> implement the CI, there is no way you should be executing in that
>>> module as there nothing for it to talk to.
>>>
>>> Moreover, there is no JRST 4 there.? So probably you have something
>>> else at that address.
>>>
>>> I have been running KLH10 for a /very/ long time; since late
>>> December 2002 and have made modifications there, too to fix an
>>> issue with locking memory and to better support Linux (recent
>>> Ubuntu). It is remarkably robust; despite intensive development, I
>>> have stayed up well over a year at a time (I.E., hit UP2LNG BUGHLT's)
>>>
>>> I have found one problem; if you are running it on an _extremely_
>>> fast machine with SSD storage (in other words, you're basically
>>> never waiting for anything) and you seriously beat on the file
>>> system, then the keep-alive counter can get out of sync with the 20
>>> thinking the front end has died and the KLH10 DTE simulator
>>> apparently not understanding what to do.
>>>
>>> The 20 typed an initial BUGCHK and then in the middle of the second
>>> one, it hangs waiting for the front end.
>>>
>>> It's on my list of things to investigate.
>>>
>>>> ------------------------------------------------------------------------
>>>> On 8/31/20 4:15 PM, Supratim Sanyal wrote:
>>>>
>>>> hi all - my panda distribution instance is halting after a couple
>>>> of days with the following message. is this a known problem for
>>>> which there is some workaround?
>>>>
>>>> Monitor RF434E DEC10 Development
>>>> System uptime 52:10:47
>>>> Current date/time Wednesday 29-Jul-120 6:01:04
>>>>
>>>> [HALTED: Program Halt, PC = 22013]
>>>>
>>>> thanks
>>>>
>>>> Supratim
>>>>
--
Supratim Sanyal, W1XMT
39.19151 N, 77.23432 W
QCOCAL::SANYAL via HECnet