I discovered a pernicious boot race condition in FAL. Early in system
start up, only the boot structure (which may not be the login
structure), is recognized by Tops-20. MOUNTR has to start and identify
those other structures which should be made available and their
settings, which it does by looking at where they are persisted in
SYSTEM:DEVICE-STATUS.BIN file. Until MOUNTR has completed structure
identification, any RCDIR% might fail for a non-boot structure. This is
exactly what happens for FAL, which is running asynchronously and
sometimes crashed during start up.
One quick and dirty hack was to simply wait to start FAL, which I did,
but that has its own problems which I won't detail here. I tried some
other hacks, such as doing the mount from the EXEC, which had other
different drawbacks.
The real thing to do is to issue a mount request for the structure from
FAL itself and--once Galaxy is up and running--is processed, will make
RCDIR% safe to do (or safer, anyway). At any rate, there should be no
timing issue in this case. As Columbia's Galaxy programmer in the
1980's and having done some minor work in LCG's Galaxy group in the late
1970's, I actually knew how to do such a request and did so, having
written some sub-routines to issue queue requests for our ID system in
the 1980's.
So, I decided to fix FAL to 'do the right thing' and wound up down
another rabbit hole. I should emphasize "/knew/ how to do", as doing
that kind of Galaxy IPCF request is kind of hairy, particularly if you
haven't done it in four decades. After I looked at it for a while, I
grew despaired at having to relearn it again.
Incredibly, DEC defined a new JSYS called QUEUE% which seemed to do the
trick. Briefly, it performs certain Galaxy requests on behalf of the
user, converting it's simpler calling format to an internal Galaxy
format, getting a PID, and returning the response. QUEUE% does a lot of
cool things, including mount requests. Or at least it's documented to...
Besides having its own bugs, there is no code in Galaxy handle a mount
request from QUEUE%... The simpler format that QUEUE% is using is
actually something that Quasar internally calls a 'short' queue request,
which I had known about back in the day. What happens is that the
QSRQUE module converts it into the regular ('long') format and away you
go. This works fine for printing, plotting, punching, and other kinds
of spooling requests, including batch jobs.
There is no short format for mounting, QSRQUE has no code for this
because mount requests are dispatched to MOUNTR... Nothing is persisted
in the failsoft file. So I wrote a conversion routine to do exactly
that before the request makes it to QSRMDA and thence to MOUNTR.
Naturally doing all that resulted in a hailstorm of other issues as I
pushed things a little farther than they had been designed for.
It happens. I'm working through everything, bit by bit.