[HECnet] Thousands of DECnet errors on Tops-20

12 Jan 2021

On 2021-01-12 02:39, Paul Koning wrote:
...
  I know some DECnet implementations make that
distinction, but not all 
 do. ?RSTS just has a datalink buffer size and derives the other numbers 
 from that. ?PyDECnet does too, but there the size is (currently) 
 hardcoded at 576 for point to point and 591 for Ethernet. ?The 576 was 
 taken from the VMS standard size as I remember it (on the DEC 
 engineering network). 
591 is a very strange value. Where did you find that?
In RSX, by default one is derived from the other, but you can manipulate 
them separately if wanted.

It's a bit unfortunate, but in general I would like to deal with them 
separate in RSX.

The reason is that there is a microcode "bug" in the Qbus ethernet 
controllers which can stop the controller from receiving packets when I 
have IP running, as IP usually have an MTU of 1500 for ethernet. If the 
system buffer size if 576, larger packets are split into parts, and this 
is where the controller can get lost. But for DECnet, you obviously want 
to just have 576 as the segment size.

So I in general recommend, especially for Qbus systems, that the system 
buffer size is set to 1500. But if using multinet to a VMS machine in 
the same area, you will then not get adjacency up, since RSX in the 
DDCMP communication is using 1500 byte frames.

The second alternative is then to set the ethernet MTU for IP in RSX to 
576. Which obviously is very detrimental to performance, and does not 
fully solve the problem. But since TCP MSS is based on MTU, TCP 
connections usually will limit themselves to an MSS that cause no 
packets split into multiple buffers. But for other programs/protocols 
which are not running over TCP, the risk is still there that you'll hang 
the network.

...
  As I said, I could add a circuit parameter to use a
different size. 
  ?Depending on the configuration that can create packet loss due to 
 oversized packets being discarded in transit. 
If we're talking about the buffer size thing, then this is only having 
an effect on adjacent nodes, so the only type of packet loss you could 
face would be if the other direct nodes can't handle this. For actual 
DECnet traffic, the segment buffer size is what is used, so even if the 
link buffer is larger, you won't be seeing larger packets.

...
  On the earlier question how this can be happening
suddenly: it isn't 
 area routing messages that are too large, that's not possible because 
 they can only contain 63 entries (one per area), each entry being 2 
 bytes. ?But the L2 messages will be large in the case of the periodic 
 messages, since those report every address in the area so typically 
 there will be 1024 entries. ?That's too much, of course, so the L1 
 routing message is sent in several parts, but each part is normally made 
 as ?big as the datalink buffer size permits. ?So given the PyDECnet 
 setting of 591 for Ethernet, an L1 routing message can be up to that 
 size. ?If the recipient has 576 as its buffer size that's not going to work. 
Yes. That is what I suspect is going on here. As mentioned in another 
mail, I have only seen problems around this for links within one area, 
so L1 messages.
L2 messages do not have problems.
(And I still think 591 is a very strange number...)

   Johnny

...

 paul

  On Jan 11, 2021, at 8:24 PM, Johnny Billquist
<bqt at softjar.se 
 <mailto:bqt at softjar.se>> wrote:

 Thomas, I wonder if you might experience the effects of that ethernet 
 packet size might be different than the DECnet segment buffer size.
 This is a little hard to explain, as I don't have all the proper 
 DECnet naming correct.

 But, based on RSX, there is two sizes relevant. One is the actual 
 buffer size the line is using. The other is the DECnet segment buffer 
 size.

 The DECnet segment buffer size is the maximum size of packets you can 
 ever expect DECnet itself to ever use.
 However, at least with RSX, when it comes to the exchange of 
 information at the line level, which includes things like hello 
 messages, RSX is actually using a system buffer size setting, which 
 might be very different from the DECnet segment buffer size.

 I found out that VMS have a problem here in that if the hello packets 
 coming in are much larger than the DECnet segment buffer size, you 
 never even get adjacency up, while RSX can deal with this just fine.

 It sounds like you might be seeing something similar in Tops-20. In 
 which case you would need to tell the other end to reduce the size of 
 these hello and routing information packets for Tops-20 to be happy, 
 or else find a way to accept larger packets.

 After all, ethernet packets can be up to 1500 bytes of payload.

 And to explain it a bit more from an RSX point of view. RSX will use 
 the system buffer size when creating these hello messages. So, if that 
 is set to 1500, you will get hello packets up to 1500 bytes in size, 
 which contain routing vectors and so on.

 But actual DECnet communication will be limited to what the DECnet 
 segment buffer size say, so once you have adjacency up, when a 
 connection is established between two programs, those packets will 
 never be larger than the DECnet segment buffer size, which is commonly 
 576 bytes.

 ?Johnny

 On 2021-01-11 23:43, Thomas DeBellis wrote:
  Paul,
 Lots of good information.? For right now, I did an experiment and  
 went into MDDT and stubbed out the XWD UNLER%,^D5 entry in the 
 NIEVTB: table in the running monitor on VENTI2.? Since then (about an 
 hour or so ago), TOMMYT 's ERROR.SYS file has been increasing as 
 usual (a couple of pages an hour) while VENTI2's hasn't changed at 
 all.? So that particular fire hose is plugged for the time being.
 I don't believe I have seen this particular error before, however, 
 there are probably some great reasons for that.? In the 1980's, CCnet 
 may not have had Level-2 routers on it while Columbia's 20's were 
 online.? We did have a problem with the 20's complaining about long 
 Ethernet frames from an early version BSD 4.2 that was being run on 
 some VAX 11/750's in the Computer Science department's research lab.  
 They got taught how to not do that and all was well.
 Tops-20's multinet implementation was first done at BBN and then 
 later imported.? I am not sure that it will allow me to change the 
 frame size.? 576 was what was used for the Internet, so I don't know 
 where that might be hardwired.? I'll check.
 I think there are two forensics to perform here:
 1. Investigate when the errors started happening; whether they predate
 ???Bob adopting PyDECnet
 2. Investigate what the size difference is; I don't believe that is
 ???going into the error log, but I'll have to look more carefully with
 ???SPEAR.
 A *warning* for anyone also looking to track this down: if you do the 
 retrieve in SPEAR on KLH10 and you don't have have my time out 
 changes for DTESRV, you will probably crash your 20.? This will 
 happen both with a standard DEC monitor and PANDA.
> ------------------------------------------------------------------------
> On 1/11/21 4:41 PM, Paul Koning wrote:
>
>> On Jan 11, 2021, at 4:22 PM, Thomas 
>> DeBellis<tommytimesharing at gmail.com 
>> <mailto:tommytimesharing at gmail.com>> ?wrote:
>>
>> OK, I guess that's probably a level 2 router broadcast coming over 
>> the bridge. ?There is no way Tops-10 or Tops-20 could currently be 
>> generating that because there is no code to do so; they're level 1, 
>> only
> Yes, unfortunately originally both multicasts used the same address. 
> ?That was changed in Phase IV Plus, but that still sends to the old 
> address for backwards compatibility and it isn't universally 
> implemented.
>
>> I started looking at the error; it starts out in DNADLL when it is 
>> detected on a frame that has come back from NISRV (the Ethernet 
>> Interface driver). ?The error is then handed off to NTMAN where the 
>> actual logging is done. ?So, there are two quick hacks to stop all 
>> the errors:
>>
>> ? I could stub out the length error entry (XWD UNLER%,^D5) in the 
>> NIEVTB: table in DNADLL.MAC.
>> ? I could put in a filter ($NOFIL) for event class 5 in the NMXFIL: 
>> table in NTMAN.MAC.
>>
>> That will stop the deluge for the moment. ?Meanwhile, I have to 
>> understand what's actually being detected; even the full SPEAR 
>> entry is short on details (like how long the frame was).
> The thing to look for is the buffer size (frame size) setting of the 
> stations on the Ethernet. ?It should match; if not someone may send 
> a frame small enough by its settings but too large for someone else 
> who has a smaller value. ?Routing messages tend to cause that 
> problem because they are variable length; the Phase IV rules have 
> the routers send them (the periodic ones) as large as the line 
> buffer size permits.
>
> Note that DECnet by convention doesn't use the full max Ethernet 
> frame size in DECnet, because DECnet has no fragmentation so the 
> normal settings are chosen to make for consistent NSP packet sizes 
> throughout the network. ??The router sending the problematic 
> messages is 2.1023 (not 63.whatever, Rob, remember that addresses 
> are little endian) which has its Ethernet buffer size set to 591. 
> ?That matches the VMS conventional default of 576 when accounting 
> for the "long header" used on Ethernet vs. the "short header" on

> point to point (DDCMP etc.) links). ?But VENTI2 has its block size 
> set to 576. ?If you change it to 591 it should start working.
>
> Perhaps I should change PyDECnet to have a way to send shorter than 
> max routing messages.
>
> paul
> 
 --
 Johnny Billquist ?????????????????|| "I'm on a bus
 ?????????????????????????????????|| ?on a psychedelic trip
 email:bqt at softjar.se <mailto:bqt at softjar.se>????????????|| ?Reading 
 murder books
 pdp is alive! ????????????????????|| ?tryin' to stay hip" - B. Idol  

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

[HECnet] Thousands of DECnet errors on Tops-20