On Monday, September 28, 2020 at 10:23 AM, Vladimir Machulsky wrote:
I'm trying to boot latest SIMH (4.0-current) as
LAVC satellite node over link
with latency about 100ms.
Boot node is VAX/VMS 5.5.2 with latest PEDRIVER ECO.
Result: MOP part of boot sequence is work without a hitch, but SCS part is
failing miserably.
The most frequent result:
SIMH console filled with 10 x " %VAXcluster-W-NOCONN, No connection to
disk server " messages, halting with "%VAXcluster-F-CTRLERR, boot driver
virtual circuit to SCSSYSTEM 0000 Failed"
Sometime it goes little further:
...
%VAXcluster-W-RETRY, Attempting to reconnect to a disk server
%VAXcluster-W-NOCONN, No connection to disk server %VAXcluster-W-
RETRY, Attempting to reconnect to a disk server %VAXcluster-W-NOCONN,
No connection to disk server VULCAN %VAXcluster-W-RETRY, Attempting to
reconnect to a disk server %VAXcluster-W-NOCONN, No connection to disk
server %VAXcluster-W-RETRY, Attempting to reconnect to a disk server
%VAXcluster-I-CONN, Connected to disk server VULCAN %VAXcluster-W-
NOCONN, No connection to disk server VULCAN %VAXcluster-W-RETRY,
Attempting to reconnect to a disk server ...
And halting after minute or so of filling console with those messages.
Whenever I setup throttling in SIMH to 2500K ops/s, the node boots
successfully, joins cluster successfully and work flawlessly, but slow.
Boot process takes about half hour. After boot, changing throttle value to
3500K ops/s still works.
Increasing throttle value further broke system, with the same messages
about disk server.
Throttled SIMH performance is about 5VUPS.
The only information about maximum channel latency restrictions found in
"Guidelines for OpenVMS Cluster Configurations" manual is that:
"When an FDDI is used for OpenVMS Cluster communications, the ring
latency when the FDDI ring is idle should not exceed 400 ms."
So I suppose that 100ms latency link should be good enough for booting
satellite nodes over it.
My understanding of situation is that combination of PEDRIVER/[PEBTDRIVER
within NISCS_LOAD] with fast hardware and slow links is a primary reason of
such behavior. Please correct me if I'm wrong.
Do anyone have experience with booting VMS clusters over slow links? OS
version recommendations?
Probably some VMS tunable variables are exists for making PEDRIVER happy
on fast hardware?
Having PEDRIVER listings can shed lights for such PEDRIVER's buggy behavior.
Link details:
Two Cisco 1861 routers connected with Internet via ADSL on one side and 3G
HSDPA on other side.
TCP/IP between sites is routed over IPSec site-to-site VPN. Ping between
sites is about 100ms.
Over that VPN built DECnet family (eth.type = 0x6000..0x600F) bridge with
L2TPv3 VPN.
Fantastic detective work, and it is certainly amazing that you got this far.
If throttling affects the results, then it seems to me, that the SCS capabilities
(which PEDRIVER leverages) aren't using anything which tracks with the wall
clock to measure delays. I suspect that where the SCS layer is "thinking about"
timing stuff it is using the TIMEDWAIT macro in the kernel/driver code.
As I recall, TIMEDWAIT effectively uses a processor model specific value
spinning until the presumably appropriate amount of time has elapsed.
Since most simh host systems actually run significantly faster than the
original hardware did, these TIMEDWAIT macro invocations don't track
with wall clock time very well. The VAST majority of TIMEDWAIT uses
in VMS drivers relate to the CPU interacting with a device which is internal
to the simulator. SIMH interactions with devices is measured in instructions
(which aligns well with the internal implementation of the TIMEWAIT
macro).
I vaguely recall that DEC had conceptual support for clusters which were
located at physically different sites, but which were connected by relatively
high speed and low latency network connections. I believe that the
"supported" configurations were much faster and much lower latency
than your setup is seeing.
Good Luck,
- Mark