Hi. A couple of comments.
On 2022-09-08 16:04, Paul Koning wrote:
I just had a conversation with one of the operators of
a HECnet L2 router who has demoted it to L1 because of connectivity issues to other areas.
That prompted some thinking about the DECnet requirements for fault tolerance.
The basic principle of DECnet Phase IV is that L1 routing (within an area) involves only
routers of that area, and L2 routing (across areas) involves only L2 routers. Phase V
changes that to some extent, but HECnet is not Phase V and isn't likely to be. :-)
In addition, L1 routers send out of area traffic to some L2 router in their area, but
without any awareness of the L2 topology.
Agreed on all of the things mentioned.
This has several consequences:
1. If some of the L2 routers in your area can't see the destination area but others
can, you may not be able to communicate even though it would seem that there is a way to
get there from here.
The is basically a broken network. *All* L2 routers should be able to
communicate with all other L2 routers without anything but L2 routers in
between. If that is broken, then the whole DECnet is split.
The solution to this is really to have multiple connections. You could
argue that a full connection mesh would be ideal, but that means way too
many links. But ideally every L2 router should have at least connections
to two other L2 routers. That way you should never have a single point
of failure. Some more links is probably good. And a suggestion is really
to have them to different areas in different places, to
distribute/disperse the whole connectivity graph.
This does not hurt the slightest from a performance point of view.
DECnet will always route along the most efficient path, even if other
alternatives exist.
As a supplemental comment to that, "efficient path" is always a bit
subjective. But if people adhere to the cost suggestions I have written,
packets will be traveling along a good path.
2. If your area is split, i.e., some of its L2 routers
can see one subset of the nodes in the area and other L2 routers can see a different
subset, then out of area traffic inbound to that area may not reach its destination -- if
it enters at the "wrong" L2 entry point.
An area split actually have nothing to do with L2 routers, but what you
describe is a fallout from the area split. But an area split happens if
the L1 routers inside one area do not have connectivity with all other
L1 routers within the area, without passing through anything else than
other L1 routers in the area.
It really is the same thing as with point 1 above, but within one area
instead of inter-area. And again, the best solution is really to have
multiple L1 routers, and have links to several other L1 routers within
the area, so that you avoid a single point of failure. More links are
better, but again, it has to be kept to reasonable numbers.
This will minimize the risks of ever getting an area split.
And there are no penalties with this. As long as costs are set properly,
packets will travel along a good path.
However, what people also need to understand is that when packets will
travel to another area, the risk that the packets take a slightly less
optimal path is unavoidable. This has to do with how DECnet is routing
packets, and it is not something that can be "improved". It also means
that packets might travel very different ways in the two directions.
If someone really wants more explanations about that topic, I'm happy to
explain why it is this way. I'm sure Paul already knows as well.
I believe the issue I mentioned at the top was #1: one
of the L2 routers went down and the remaining L2 routers of that area ended up at two
sides of a partitioned L2 network.
If the L2 network is partitioned, it is basically a broken DECnet in
general, and for every node, just a subset of the DECnet is reachable,
depending on which side they end up through this partitioning.
Obviously HECnet isn't a production network, but
still it would be nice for it to be tolerant of outages. Especially since we can insert
additional routers easily with PyDECnet or Robert Jarratt's C router. The HECnet map
can be set to show just the L2 network (using the layers menu, accessible via the layers
icon in the top right corner of the map). It's easy to see a number of L2 routers
that have only one connection to the rest of HECnet. It's also clear that a large
fraction of the connectivity is via Sweden, which certainly is a fine option but it's
a bit odd for a node in, say, western Canada to have only that one connection and none to
nodes much closer to it.
I would agree that my preference is that we should improve the
resilience of HECnet. In addition, I would in general recommend that L2
routers should have multiple connections, and they should have some to
nearby L2 routers in other areas, for multiple reasons.
Don't make sense to route through Sweden unless it's actually on the way
towards the destination.
The map display doesn't give a visual clue about
singly-connected area routers for which there is no location information in the database
(the ones plotted at Inaccessible Island). The data is there in the map data table; it
wouldn't be too hard to do some post-processing on that data to find cases of no
redundancy.
I think a bigger problem is that the Cisco routers links are not visible
at all on the map right now, making the picture rather skewed. Not sure
how well the bridge links are shown either...
I'm curious if people would be interested in
trying to make HECnet more fault tolerant. My router (PYTHON) can definitely help,
especially for North American nodes, and I'm sure there are a number of others that
feel the same.
Like I said. I'm all for it.
Johnny
--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt(a)softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol