I just had a conversation with one of the operators of a HECnet L2 router who has demoted
it to L1 because of connectivity issues to other areas. That prompted some thinking about
the DECnet requirements for fault tolerance.
The basic principle of DECnet Phase IV is that L1 routing (within an area) involves only
routers of that area, and L2 routing (across areas) involves only L2 routers. Phase V
changes that to some extent, but HECnet is not Phase V and isn't likely to be. :-)
In addition, L1 routers send out of area traffic to some L2 router in their area, but
without any awareness of the L2 topology.
This has several consequences:
1. If some of the L2 routers in your area can't see the destination area but others
can, you may not be able to communicate even though it would seem that there is a way to
get there from here.
2. If your area is split, i.e., some of its L2 routers can see one subset of the nodes in
the area and other L2 routers can see a different subset, then out of area traffic inbound
to that area may not reach its destination -- if it enters at the "wrong" L2
entry point.
I believe the issue I mentioned at the top was #1: one of the L2 routers went down and the
remaining L2 routers of that area ended up at two sides of a partitioned L2 network.
Obviously HECnet isn't a production network, but still it would be nice for it to be
tolerant of outages. Especially since we can insert additional routers easily with
PyDECnet or Robert Jarratt's C router. The HECnet map can be set to show just the L2
network (using the layers menu, accessible via the layers icon in the top right corner of
the map). It's easy to see a number of L2 routers that have only one connection to
the rest of HECnet. It's also clear that a large fraction of the connectivity is via
Sweden, which certainly is a fine option but it's a bit odd for a node in, say,
western Canada to have only that one connection and none to nodes much closer to it.
The map display doesn't give a visual clue about singly-connected area routers for
which there is no location information in the database (the ones plotted at Inaccessible
Island). The data is there in the map data table; it wouldn't be too hard to do some
post-processing on that data to find cases of no redundancy.
I'm curious if people would be interested in trying to make HECnet more fault
tolerant. My router (PYTHON) can definitely help, especially for North American nodes,
and I'm sure there are a number of others that feel the same.
paul