I got annoyed at the thought of having to wait a few more months for the
error condition to show up and, instead of having the batch job run more
frequently (and thus beating on poor MIM::), I wrote another batch job
which took every single file that I have /ever/ downloaded from MIM::
and inserted it.? So that's 75 files and it failed on number 54.
15:36:51 USER?? SETNOD>**T OLDS:NODE-DATA.TXT.54*
15:36:52 USER
15:36:52 USER?? SETNOD>**List Total*
15:36:52 USER
15:36:52 USER
15:36:52 USER?? TOTAL NODES FOUND: 869
15:36:52 USER
15:36:52 USER?? SETNOD>**Insert*
15:36:52 USER
15:36:52 USER?? ?SETNOD: Failed at node RSX11M (2.298), Item 650 of 869,
Error: _-11_
15:36:52 USER?? SETNOD>
It is interesting that it is failing on node 2.298, but this is before
that number had been reassigned to REACH::. The negative 11 error
returns means "Component in Wrong State" (aka NF.CWS), which I didn't
find immediately informative.? However, now I've got something to look
around for.
I still can't imagine why there would be anything particularly
diabolical about the number 2.298.
------------------------------------------------------------------------
On 5/5/21 12:38 AM, Thomas DeBellis wrote:
I finished the modifications to SCLINK to properly return error values
which are negative and JNTMAN to return the error value in AC3 if
.NDINT doesn't succeed inserting all the nodes.? Then I modified
SETNOD to get this extended error information and print it.? I put the
new monitor and SETNOD up, rebooted *?AND*?
SETNOD>set nod 2.298 name REACH SETNOD>ins SETNOD>
It works perfectly because, of course it does?
So, as usual, Johnny's guess is pretty close to the mark, even if he
isn't a 36 bit'er.? "Slightly broken"?? Yeah, 'slightly' enough
so
that it can't be easily reproduced?
The only thing I can think of is that the system had been up over 15
weeks when I saw this.? I had looked at the storage space utilization
with SYSDPY and didn't notice anything maxing out.? I restarted the
GETNOD batch job on VENTI2::.? Maybe in another 15 weeks, it will
break again.
/Annoyed/?
> ------------------------------------------------------------------------
> On 5/4/21 10:31 PM, Thomas DeBellis wrote:
>
> Personally, I don't see how it could /possibly/ be anything to do
> with the REACH:: node definition, but I have been known to
> occasionally overlook the utterly obvious, particularly when it's
> near night-night.? Maybe not this time.
>
> Right now, the way to figure it out is to get the minor error data
> and see where that takes things.? So I'm making a change to JNTMAN to
> have .NDINT to return the lower level code on an incomplete insert.
> SCLINK appears to have a problem that it is mangling return values,
> which I'm currently investigating.
>
> You can't just blithely assuming somebody got it wrong and 'fix'
> things; sometimes it's a certain way for a reason.
>
> On 5/4/21 8:46 PM, Johnny Billquist wrote:
>> On 2021-05-05 00:54, Mike Kostersitz wrote:
>>> Ouch that is one of my nodes ? @Johnny Billquist
>>> <mailto:bqt at softjar.se> anything you could think of since we just
>>> renamed my old RSX11M node to REACH.
>>
>> Well. It is something slightly broken in Tops-20, so there isn't
>> really anything we can do about it.
>>
>> Except hope that Thomas can figure it out and fix it.
>>
>> ?Johnny
>>
>>>
>>> Mike
>>>
>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>> Windows 10
>>>
>>> *From: *Thomas DeBellis <mailto:tommytimesharing at gmail.com>
>>> *Sent: *Tuesday, May 4, 2021 15:16
>>> *To: *HECnet <mailto:hecnet at update.uu.se>
>>> *Subject: *[HECnet] Re: Tops-20 SETNOD Failure
>>>
>>> I fixed a few things in SETNOD to get some more information about
>>> the error.? In particular,
>>>
>>> ? * Allow listing of AREA 1 (this was specifically disallowed, I don't
>>> ??? know why)
>>> ? * More consistent error reporting (via ESOUT%)
>>> ? * List more than one node when doing an area list (it would only
>>> list
>>> ??? a single node)
>>> ? * List nodes with more than three digits in the node number when
>>> doing
>>> ??? columnar output
>>>
>>> So now you get the expected results:
>>>
>>> ??? SETNOD>lis a 1
>>> ??? [Area 1]
>>> ??? A1RTR?? 1023??? ATHENA?? 620??? ATLE???? 605??? AURORA 606??
>>> ??? BANAI??? 770
>>> ??? BANX25?? 771??? BEA?????? 19??? BIZET??? 800 BJARNE???? 7?? ???
>>> BLINKY?? 266
>>> ??? CATWZL?? 302??? CLYDE??? 269??? COOPER?? 263??? CRISPS 201??
>>> ??? CYGNUS?? 259
>>> ??? DAVROS?? 254??? DBIT???? 351??? DE1RSX?? 450??? DE1RSY 452??
>>> ??? DOCTOR?? 252
>>> ??? ELIN???? 616??? ELMER??? 617??? ERNIE????? 2??? ERSATZ 350??
>>> ??? FLETCH?? 100
>>> ??? FNATTE???? 3??? FREJ???? 608??? GAXP???? 730 GNAT????? 16?? ???
>>> GNOME????? 6
>>> ??? GOBLIN???? 4??? GVAX???? 731??? HAGMAN?? 262??? HARPER 261??
>>> ??? HORSE??? 150
>>> ??? HUGIN??? 602??? HYUNA??? 500??? INKY???? 268??? JIMIN 501?? ???
>>> JOCKE???? 21
>>> ??? JOSSE???? 17??? KLIO???? 451??? KRILLE???? 8??? LOKE 607?? ???
>>> MACARO?? 303
>>> ??? MACRA??? 258??? MAGICA???? 1??? MASTER?? 251 MIM?????? 13?? ???
>>> MUNIN??? 603
>>> ??? NIPPER?? 202??? NOMAD??? 610??? NOXBIT?? 720??? ORACLE 301??
>>> ??? PACMAN?? 265
>>> ??? PAI????? 541??? PALLAS?? 621??? PAMINA??? 18??? PIDP11 560??
>>> ??? PINKY??? 267
>>> ??? PISTON?? 520??? PLINTH?? 200??? PMAVS2?? 510 PONDUS??? 15?? ???
>>> PONY????? 12
>>> ??? PUFF????? 22??? QEMUNT?? 151??? REI????? 540 ROCKY???? 11?? ???
>>> ROJIN??? 542
>>> ??? RSX124?? 306??? RSX145?? 304??? RSX170?? 305??? RSX184 307??
>>> ??? RUTAN??? 255
>>> ??? SHARPE?? 260??? SIDRAT?? 253??? SIGGE???? 10 SPEEDY??? 24?? ???
>>> TARDIS?? 250
>>> ??? TEMPO????? 9??? THOROS?? 257??? TINA????? 14??? TIPSY 604?? ???
>>> TONGUE?? 264
>>> ??? TOPSY??? 601??? VALAR??? 400??? VAROS??? 256 WXP?????? 20?? ???
>>> WXP2????? 23
>>> ??? YMER???? 609??? ZEKE?????? 5
>>> ??? Total nodes in area 1: 92
>>> ??? SETNOD>exit
>>>
>>> Regarding the error, I have reproduced it with a single entry, viz:
>>>
>>> ??? !setnod
>>> ??? SETNOD>_set nod 2.298 name REACH_
>>> ??? SETNOD>_insert_
>>> ??? ?SETNOD: Failed at node REACH (2.298), Item 0 of 1
>>> ??? SETNOD>
>>>
>>> The high level code to do the entry is in JNTMAN.? It loops through
>>> the table passed to it via .NDINT, calling a lower level routine
>>> called SCTAND in SCLINK.? An error here is passed up to JNTMAN, but
>>> it is not passed back to the user. There are some other problems in
>>> SCLINK pertaining to negative return values, so some minor work is
>>> necessary there, also.
>>>
>>> I'll make some changes to these two modules, generate a new monitor
>>> for VENTI2 and see what happens in a few days.
>>>
>>> Right now, if any Tops-20 using is using SETNOD to update DECnet
>>> tables, this appears to fail.? If anybody else is seeing it or can
>>> reproduce it, I'd like to hear about it.
>>>
>>> ??? On 5/4/21 11:15 AM, Thomas DeBellis wrote:
>>>
>>> ??? Has anybody ever seen SETNOD fail to insert the entire node
>>> list?? I
>>> ??? just did.
>>>
>>> ??? Shortly after I put my 20's up on HECnet, I wrote a reoccurring
>>> ??? batch job that fires once a week on Sundays to pull the latest
>>> node
>>> ??? list (T20.FIX) from MIM::.? I use the highly venerable FILCOM
>>> ??? program to do a difference of it with the previous week's list.? I
>>> ??? don't do anything in particular with the output except save it in
>>> ??? case I feel like looking at it for some reason.
>>>
>>> ??? The batch job always inserts the entire list, rewriting whatever
>>> ??? might be in the monitor's data base.? I have always been
>>> unsatisfied
>>> ??? with doing things that way because it seemed to me to be
>>> inefficient
>>> ??? as the node list grew.?? The HECnet node list count was 716 on
>>> ??? 9-Jun-19 and it's now up to 884 as of the latest version that
I've
>>> ??? pulled, 30-Apr-21.? The other problem is the microscopic
>>> possibility
>>> ??? that a node is in Tops-20's monitor database (a hash table) that
>>> ??? isn't in the HECnet node list.
>>>
>>> ??? Nodes can get removed, although I think that infrequent.? Nodes
>>> ??? could get inserted outside of the batch job, but I think that most
>>> ??? unlikely in my situation.? Nodes can get renamed, as evidenced by
>>> ??? 2.299 below, which went from THEPIT to THEARK.? None of this
>>> should
>>> ??? or has broken anything.
>>>
>>> ??? However, it's been in the back of my mind to do two enhancements,
>>> ??? one to Tops-20 and one to SETNOD.? The NODE% JSYS should have an
>>> ??? additional feature to return the current monitor data base.? The
>>> ??? SETNOD program should be enhanced to take that to compute the set
>>> ??? difference with the new list.? This would show additions, renames
>>> ??? and deletions.? That would bring the update operation down from
>>> some
>>> ??? hundred items to less than ten, on average.? This would obviously
>>> ??? make more of a difference on huge DECnet's in the tens of
>>> thousands
>>> ??? of nodes.? Another NODE% feature should probably be to whack the
>>> ??? entire monitor database except for the local node, which would be
>>> ??? useful for trouble shooting.
>>>
>>> ??? Last Sunday, the batch job failed with the following error:
>>>
>>> ??? 18:33:40 USER?? SETNOD>*TAKE SYSTEM:NODE-DATA.TXT.0
>>> ??? 18:33:40 USER
>>> ??? 18:33:40 USER?? [Fork SETNOD opening <SYSTEM>NODE-DATA.TXT.1 for
>>> ??? reading]
>>> ??? 18:33:41 USER?? SETNOD>*SAVE
>>> ??? 18:33:41 USER
>>> ??? 18:33:41 USER?? [Fork SETNOD opening <SYSTEM>NODE-DATA.BIN.74 for
>>> ??? reading, writing]
>>> ??? 18:33:41 USER?? SETNOD>*INSERT
>>> ??? 18:33:41 USER
>>> ??? 18:33:41 USER *?SETNOD: Failed at node REACH*
>>> ??? 18:33:41 USER?? SETNOD>
>>>
>>> ??? I had a look at the SETNOD source and the HECnet node list and
>>> have
>>> ??? discovered and concluded a few things.? First, there doesn't
>>> seem to
>>> ??? be anything syntactically wrong with REACH::'s definition: "set
>>> nod
>>> ??? 2.298 name REACH".? Second, there don't appear to be any
semantic
>>> ??? issues.? 2.298 wasn't in use and it shouldn't matter if it was.
>>>
>>> ??? In the case of INSERT, there are two kinds of errors from NODE%, a
>>> ??? general failure of the JSYS and an incomplete insertion.?? The
>>> error
>>> ??? is from the second case.? Unfortunately, SETNOD isn't reporting
>>> ??? enough information about the error, so I have to make some changes
>>> ??? there.? It's also possible that SETNOD is building an inconsistent
>>> ??? database for the monitor to swallow; at least the LIST command is
>>> ??? giving me some odd results, viz:
>>>
>>> ??????? SETNOD>list arEA 2
>>>
>>> ??????? [AREA 2]
>>> ??????? A2RTR
>>>
>>> ??????? TOTAL NODES FOUND: 1
>>>
>>> ??????? SETNOD>
>>>
>>> ??? That's clearly wrong, viz:
>>>
>>> ??????? !i dec
>>> ???????? ?Local DECNET node: VENTI2.? Nodes reachable: 7.
>>> ???????? ?Accessible DECNET nodes are:??? A2RTR??? BOINGO LEGATO??
>>> ??????? TOMMYT??? VENTI2??? VENTI??? ZITI
>>>
>>> ??? The Exec output should probably be changed to say, "Nodes
>>> reachable
>>> ??? in local area" and "Online nodes in area are:"
>>>
>>> ??? Anybody have any ideas?? Hunches?? Clues?
>>>
>>> File 1) OLDF:[4,120]??? created: 1241 15-Apr-21
>>> File 2) NEWF:[1,1]????? created: 0102 30-Apr-21
>>>
>>> 1)1???? set nod 44.9 name OSMIUM
>>> ****
>>> 2)1???? set nod 2.292 name OSIRIS
>>> 2)????? set nod 44.9 name OSMIUM
>>> **************
>>> 1)1???? set nod 13.3 name RED
>>> ****
>>> 2)1 *set nod 2.298 name REACH *
>>> 2)????? set nod 13.3 name RED
>>> **************
>>> 1)1???? set nod 2.298 name RSX11M
>>> 1)????? set nod 1.306 name RSX124
>>> ****
>>> 2)1???? set nod 1.306 name RSX124
>>> **************
>>> 1)1???? set nod 42.5 name SPARKY
>>> ****
>>> 2)1???? set nod 2.291 name SPARK
>>> 2)????? set nod 42.5 name SPARKY
>>> **************
>>> 1)1???? set nod 2.299 name THEPIT
>>> 1)????? set nod 35.70 name THOMAS
>>> ****
>>> 2)1???? set nod 2.299 name THEARK
>>> 2)????? set nod 35.70 name THOMAS
>>> **************
>>>
>>