My apolgies for the confusion Peter. But you're right to assume I meant the
cl.gr.nr.
And yes, there is obviously something wrong here. IF an autogen was done I'd
say have a look at modparams.dat
- alloclass must be unique for each node
- same for tapealloclass
The alloclass is used in forming device names and lock resource names. If it
is not set correctly, there will be problems in these areas but they should not
prevent a node from joining a cluster. I would tend to leave it at zero unless
there were good reasons for changing it. There are myriads of little rules about
how various sysgen parameters should be set but most of them can be ignored
for the moment, partly because VMS is not so fragile that having any one of
a vast number of parameters wrong will prevent a system from booting and partly
because the system picks sensible defaults that will work in typical cases
for sysgen parameters.
More important sysgen parameters to check are SCSNODE and particularly
SCSSYSTEMID. If SCSSYSTEMID was inadvertently changed, this may well cause
difficulties. SCSSYSTEMID must be the same as the decnet area number *1024 plus
the decnet node number. This number is used to calculate the ethernet address
used for cluster communications. It is also used to uniquely identify cluster
nodes to other cluster nodes.
If SCSNODE is changed without changing SCSSYSTEMID or vice versa, that node
will have difficulties joining a cluster as the other nodes in the cluster
will remember the previous values and complain that the new ones are not
consistent. The solution here is to shut down all nodes in the cluster so that
they are all down at the same time and then reboot each.
And check the cluster license. AFAIK the cluster license must bu unique for
each node. Or one license with 0 units and in that case make sure all
nodenames are mentioned in the /INCLUDE list, again on all nodes where the
license was loaded.
I am not 100% sure on this but as far as I know, cluster licenses are not
checked when a node is attempting to join a cluster because the system is
operating at a very low level and may not be in a position to access the disk
yet where its licenses are held. I think the response to lack of cluster
license is whinges about it in the operator log rather than disallowing a
node from joining. If it were to prevent a node from joining, I would expect
to see a prominent error message mentioning a license problem.
Regards,
Peter Coghlan.
Hi,
So I just had my DS10 perfectly configured but on its side, and it fell over.
Now it just gives me 3 beeps.
What do I do, disassemble the whole thing and reseat the RAM or what?
My apolgies for the confusion Peter. But you're right to assume I meant the cl.gr.nr.
And yes, there is obviously something wrong here. IF an autogen was done I'd say have a look at modparams.dat
- alloclass must be unique for each node
- same for tapealloclass
And check the cluster license. AFAIK the cluster license must bu unique for each node. Or one license with 0 units and in that case make sure all nodenames are mentioned in the /INCLUDE list, again on all nodes where the license was loaded.
PS
I'm travelling now so can't check my mail too often
Hans
-----Original Message-----
From: Peter Coghlan <HECNET at beyondthepale.ie>
Sender: owner-hecnet at Update.UU.SE
Date: Fri, 16 Sep 2011 12:52:13
To: <hecnet at Update.UU.SE>
Reply-To: hecnet at Update.UU.SESubject: Re: Fwd: Re: [HECnet] More clustering fun
2) The cluster id and cluster password are different on both nodes.
Whats the cluster id?
The cluster id is 4.
Sorry - I should have said "What do you mean by the cluster id?"
I guess you mean the cluster group number but I wanted to be
sure we weren't referring to two different things.
I've confirmed that the authorize file is the same on both the alpha and
vax nodes:
$$ diff dsa0:[sys0.syscommon.sysexe]cluster_authorize.dat -
_$$ $4$dkb400:[sys0.syscommon.sysexe]cluster_authorize.dat
Number of difference sections found: 0
Number of difference records found: 0
DIFFERENCES /IGNORE=()/MERGED=1-
DSA0:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1-
$4$DKB400:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1
[snip]
It's weird that by changing the VOTES to 0 on the satellite and running an
AUTOGEN that this behaviour is experienced. Before that the satellite
boots off the served disk no problem.
I doubt that changing VOTES to 0 was responsible. To check this, you could
set VOTES back to 1 (probably manually using SYSGEN) and test again.
(A problem with running AUTOGEN for the first time is that it finds anything
that you didn't know was lurking in MODPARAMS.DAT and acts on it.)
Indeed, the 'master' disk which is a local drive in the VAXstation boots
into the cluster no problem, but again, with VOTES = 1.
Check that there are no files called SYS$SPECIFIC:[SYSEXE]CLUSTER_AUTHORIZE.DAT
on either node. If you find any, delete them.
You may have to reboot both nodes in order to get them to read the correct
CLUSTER_AUTHORIZE.DAT files after copying them or deleting errant ones.
Regards,
Peter Coghlan.
2) The cluster id and cluster password are different on both nodes.
Whats the cluster id?
The cluster id is 4.
Sorry - I should have said "What do you mean by the cluster id?"
I guess you mean the cluster group number but I wanted to be
sure we weren't referring to two different things.
I've confirmed that the authorize file is the same on both the alpha and
vax nodes:
$$ diff dsa0:[sys0.syscommon.sysexe]cluster_authorize.dat -
_$$ $4$dkb400:[sys0.syscommon.sysexe]cluster_authorize.dat
Number of difference sections found: 0
Number of difference records found: 0
DIFFERENCES /IGNORE=()/MERGED=1-
DSA0:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1-
$4$DKB400:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1
[snip]
It's weird that by changing the VOTES to 0 on the satellite and running an
AUTOGEN that this behaviour is experienced. Before that the satellite
boots off the served disk no problem.
I doubt that changing VOTES to 0 was responsible. To check this, you could
set VOTES back to 1 (probably manually using SYSGEN) and test again.
(A problem with running AUTOGEN for the first time is that it finds anything
that you didn't know was lurking in MODPARAMS.DAT and acts on it.)
Indeed, the 'master' disk which is a local drive in the VAXstation boots
into the cluster no problem, but again, with VOTES = 1.
Check that there are no files called SYS$SPECIFIC:[SYSEXE]CLUSTER_AUTHORIZE.DAT
on either node. If you find any, delete them.
You may have to reboot both nodes in order to get them to read the correct
CLUSTER_AUTHORIZE.DAT files after copying them or deleting errant ones.
Regards,
Peter Coghlan.
On Fri, 16 Sep 2011, Peter Coghlan wrote:
2) The cluster id and cluster password are different on both nodes.
Whats the cluster id?
The cluster id is 4.
I've confirmed that the authorize file is the same on both the alpha and vax nodes:
$$ diff dsa0:[sys0.syscommon.sysexe]cluster_authorize.dat -
_$$ $4$dkb400:[sys0.syscommon.sysexe]cluster_authorize.dat
Number of difference sections found: 0
Number of difference records found: 0
DIFFERENCES /IGNORE=()/MERGED=1-
DSA0:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1-
$4$DKB400:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1
The cluster group number and the cluster password must be the same
on all nodes in the cluster. The easiest way to achieve this is
to copy sys$common:[sysexe]cluster_authorize.dat from one to the
other.
I set the right values when I ran CLUSTER_CONFIG_LAN.COM on the VAX.
If the cluster group numbers are different, you will end up
trying to form two different clusters.
If the passwords are different, the results will probably be
something like you are seeing, nodes failing to completely
join the cluster. There may also be some messages in the
operator.log
It's weird that by changing the VOTES to 0 on the satellite and running an AUTOGEN that this behaviour is experienced. Before that the satellite boots off the served disk no problem.
Indeed, the 'master' disk which is a local drive in the VAXstation boots into the cluster no problem, but again, with VOTES = 1.
Regards, Mark
2) The cluster id and cluster password are different on both nodes.
Whats the cluster id?
The cluster group number and the cluster password must be the same
on all nodes in the cluster. The easiest way to achieve this is
to copy sys$common:[sysexe]cluster_authorize.dat from one to the
other.
If the cluster group numbers are different, you will end up
trying to form two different clusters.
If the passwords are different, the results will probably be
something like you are seeing, nodes failing to completely
join the cluster. There may also be some messages in the
operator.log
Regards,
Peter.
On 16/09/11 11:22, hvlems at zonnet.nl wrote:
Did you adjust EXPECTED_VOTES on the alpha server?
From: Mark Wickens <mark at wickensonline.co.uk>
Sender: owner-hecnet at Update.UU.SE
Date: Fri, 16 Sep 2011 11:09:04 +0100
To: <hecnet at Update.UU.SE>
ReplyTo: hecnet at Update.UU.SE
Subject: Fwd: Re: [HECnet] More clustering fun
Just to add to the picture - if I reduce VOTES on the satellite to 0 I get this happening:
-------- Original Message --------
Subject:
Re: [HECnet] More clustering fun
Date:
Fri, 16 Sep 2011 09:33:47 +0000
From:
hvlems at zonnet.nl
Reply-To:
hvlems at zonnet.nl
To:
Mark Wickens <mark at wickensonline.co.uk>
All I can think of is this:
1) Both slave and aleph both use the same VMSCLUSTER license
2) The cluster id and cluster password are different on both nodes.
On an alpha you can modify this in sysman (use help in sysman to find the correct command). On a Vax the command is burried in sysgen.
Hans
-----Original Message-----
From: Mark Wickens <mark at wickensonline.co.uk>
Date: Fri, 16 Sep 2011 10:11:41
Cc: <hvlems at zonnet.nl>
Subject: Re: [HECnet] More clustering fun
Hi Hans,
I didn't want to pre-empt what I thought happened last time as I wasn't
sure I'd got it right, but it has happened again so it's definitely an
issue.
I've updated the VOTES in ALEPH (the satellite) MODPARAMS.DAT, ran
AUTOGEN and rebooted the cluster.
Now when ALEPH attempts to join the cluster I get these messages repeatedly:
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
and I see this on SLAVE:
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) completed VMScluster state transition
This just repeats forever.
Some more information from SLAVE (the ALPHA server):
SHOW CLUSTER:
View of Cluster from system ID 4345 node:
SLAVE
16-SEP-2011 10:06:13
+--------------------------------------------------------+---------+
| SYSTEMS | MEMBERS |
+--------+--------------------------------+--------------+---------+
| NODE | HW_TYPE | SOFTWARE | STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE | AlphaServer 1000A 5/300 | VMS V8.3 | MEMBER |
| ALEPH | VAXstation 4000-VLC | VMS V7.3 | NEW |
+--------+--------------------------------+--------------+---------+
+------------------------------------------------------------------------------------+
|
CLUSTER |
+--------+-----------+----------+------------+-------------------+-------------------+
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS | FORMED |
LAST_TRANSITION |
+--------+-----------+----------+------------+-------------------+-------------------+
| 1 | 1 | 1 | 1 | 16-SEP-2011 09:56 |
16-SEP-2011 09:56 |
+--------+-----------+----------+------------+-------------------+-------------------+
SYSGEN> SHOW EXPECTED_VOTES
%CNXMAN, Completing VMScluster state transition
Parameter Name Current Default Min. Max. Unit
Dynamic
-------------- ------- ------- ------- ------- ----
-------
EXPECTED_VOTES 1 1 1 127 Votes
Any ideas why this is going wrong?
Thanks for the help, much appreciated,
Mark.
On 16/09/11 09:40, hvlems at zonnet.nl wrote:
> Regarding the alphaserver: check the value of expectedvotes in sysgen.
> In a cluster with non-voting satellites only, its value must be less than Votes+1
> Hans
> ------Origineel bericht------
> Van: Mark Wickens
> Afzender: owner-hecnet at Update.UU.SE
> Aan: hecnet at Update.UU.SE
> Beantwoorden: hecnet at Update.UU.SE
> Onderwerp: [HECnet] More clustering fun
> Verzonden: 16 september 2011 10:14
>
> I've now refreshed the VAX satellites system drive and installed it in the
> ALPHA server. The one problem I have remaining is that the VOTES the
> satellite is contributing to the cluster is 1. I believe for a proper
> satellite this should be 0.
>
> Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
> and reboot? Do I need to do anything with the ALPHA servers configuration?
>
> Presumably I will need to reboot the ALPHA server as well.
>
> Thanks for the help,
>
> Kind regards, Mark.
>
The EXPECTED_VOTES on the ALPHA is set to 1 - which I assume is the value we would anticipate would work?
Even when both the server and the satellite have VOTES set to 1 the expected votes remains at 1 on the server, although if you examine the cluster via SHOW CLUSTER it shows that the quorum is set to 2:
View of Cluster from system ID 4345 node: SLAVE 16-SEP-2011 11:21:15
+--------------------------------------------------------+---------+
| SYSTEMS | MEMBERS |
+--------+--------------------------------+--------------+---------+
| NODE | HW_TYPE | SOFTWARE | STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE | AlphaServer 1000A 5/300 | VMS V8.3 | MEMBER |
| ALEPH | VAXstation 4000-VLC | VMS V7.3 | MEMBER |
+--------+--------------------------------+--------------+---------+
+-------------------------------------------------------------------------------
| CLUSTER
+--------+-----------+----------+------------+-------------------+--------------
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS | FORMED | LAST_TRANSIT
+--------+-----------+----------+------------+-------------------+--------------
| 2 | 2 | 2 | 2 | 16-SEP-2011 11:04 | 16-SEP-2011 1
+--------+-----------+----------+------------+-------------------+--------------
With the satellites VOTES set to 0 (which causes the endless %CNXMAN messages) if I turn off the satellite at that point I get the following:
$$
%CNXMAN, Quorum lost, blocking activity
$$
%CNXMAN, Timed-out lost connection to system ALEPH
%CNXMAN, Proposing reconfiguration of the VMScluster
%CNXMAN, Discovered system ALEPH
%CNXMAN, Removed from VMScluster system ALEPH
%CNXMAN, Completing VMScluster state transition
%CNXMAN, Established connection to system ALEPH
and the server hangs. I end up having to reboot the server, because the satellite never joins the cluster successfully.
It's all fun!
Regards, Mark.
Did you adjust EXPECTED_VOTES on the alpha server?
From: Mark Wickens <mark at wickensonline.co.uk>
Sender: owner-hecnet at Update.UU.SE
Date: Fri, 16 Sep 2011 11:09:04 +0100
To: <hecnet at Update.UU.SE>
ReplyTo: hecnet at Update.UU.SE
Subject: Fwd: Re: [HECnet] More clustering fun
Just to add to the picture - if I reduce VOTES on the satellite to 0 I get this happening:
-------- Original Message --------
Subject:
Re: [HECnet] More clustering fun
Date:
Fri, 16 Sep 2011 09:33:47 +0000
From:
hvlems at zonnet.nl
Reply-To:
hvlems at zonnet.nl
To:
Mark Wickens <mark at wickensonline.co.uk>
All I can think of is this:
1) Both slave and aleph both use the same VMSCLUSTER license
2) The cluster id and cluster password are different on both nodes.
On an alpha you can modify this in sysman (use help in sysman to find the correct command). On a Vax the command is burried in sysgen.
Hans
-----Original Message-----
From: Mark Wickens <mark at wickensonline.co.uk>
Date: Fri, 16 Sep 2011 10:11:41
Cc: <hvlems at zonnet.nl>
Subject: Re: [HECnet] More clustering fun
Hi Hans,
I didn't want to pre-empt what I thought happened last time as I wasn't
sure I'd got it right, but it has happened again so it's definitely an
issue.
I've updated the VOTES in ALEPH (the satellite) MODPARAMS.DAT, ran
AUTOGEN and rebooted the cluster.
Now when ALEPH attempts to join the cluster I get these messages repeatedly:
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
and I see this on SLAVE:
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) completed VMScluster state transition
This just repeats forever.
Some more information from SLAVE (the ALPHA server):
SHOW CLUSTER:
View of Cluster from system ID 4345 node:
SLAVE
16-SEP-2011 10:06:13
+--------------------------------------------------------+---------+
| SYSTEMS | MEMBERS |
+--------+--------------------------------+--------------+---------+
| NODE | HW_TYPE | SOFTWARE | STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE | AlphaServer 1000A 5/300 | VMS V8.3 | MEMBER |
| ALEPH | VAXstation 4000-VLC | VMS V7.3 | NEW |
+--------+--------------------------------+--------------+---------+
+------------------------------------------------------------------------------------+
|
CLUSTER |
+--------+-----------+----------+------------+-------------------+-------------------+
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS | FORMED |
LAST_TRANSITION |
+--------+-----------+----------+------------+-------------------+-------------------+
| 1 | 1 | 1 | 1 | 16-SEP-2011 09:56 |
16-SEP-2011 09:56 |
+--------+-----------+----------+------------+-------------------+-------------------+
SYSGEN> SHOW EXPECTED_VOTES
%CNXMAN, Completing VMScluster state transition
Parameter Name Current Default Min. Max. Unit
Dynamic
-------------- ------- ------- ------- ------- ----
-------
EXPECTED_VOTES 1 1 1 127 Votes
Any ideas why this is going wrong?
Thanks for the help, much appreciated,
Mark.
On 16/09/11 09:40, hvlems at zonnet.nl wrote:
> Regarding the alphaserver: check the value of expectedvotes in sysgen.
> In a cluster with non-voting satellites only, its value must be less than Votes+1
> Hans
> ------Origineel bericht------
> Van: Mark Wickens
> Afzender: owner-hecnet at Update.UU.SE
> Aan: hecnet at Update.UU.SE
> Beantwoorden: hecnet at Update.UU.SE
> Onderwerp: [HECnet] More clustering fun
> Verzonden: 16 september 2011 10:14
>
> I've now refreshed the VAX satellites system drive and installed it in the
> ALPHA server. The one problem I have remaining is that the VOTES the
> satellite is contributing to the cluster is 1. I believe for a proper
> satellite this should be 0.
>
> Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
> and reboot? Do I need to do anything with the ALPHA servers configuration?
>
> Presumably I will need to reboot the ALPHA server as well.
>
> Thanks for the help,
>
> Kind regards, Mark.
>
Just to add to the picture - if I reduce VOTES on the satellite to 0 I get this happening:
-------- Original Message --------
Subject:
Re: [HECnet] More clustering fun
Date:
Fri, 16 Sep 2011 09:33:47 +0000
From:
hvlems at zonnet.nl
Reply-To:
hvlems at zonnet.nl
To:
Mark Wickens <mark at wickensonline.co.uk>
All I can think of is this:
1) Both slave and aleph both use the same VMSCLUSTER license
2) The cluster id and cluster password are different on both nodes.
On an alpha you can modify this in sysman (use help in sysman to find the correct command). On a Vax the command is burried in sysgen.
Hans
-----Original Message-----
From: Mark Wickens <mark at wickensonline.co.uk>
Date: Fri, 16 Sep 2011 10:11:41
Cc: <hvlems at zonnet.nl>
Subject: Re: [HECnet] More clustering fun
Hi Hans,
I didn't want to pre-empt what I thought happened last time as I wasn't
sure I'd got it right, but it has happened again so it's definitely an
issue.
I've updated the VOTES in ALEPH (the satellite) MODPARAMS.DAT, ran
AUTOGEN and rebooted the cluster.
Now when ALEPH attempts to join the cluster I get these messages repeatedly:
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
and I see this on SLAVE:
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) completed VMScluster state transition
This just repeats forever.
Some more information from SLAVE (the ALPHA server):
SHOW CLUSTER:
View of Cluster from system ID 4345 node:
SLAVE
16-SEP-2011 10:06:13
+--------------------------------------------------------+---------+
| SYSTEMS | MEMBERS |
+--------+--------------------------------+--------------+---------+
| NODE | HW_TYPE | SOFTWARE | STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE | AlphaServer 1000A 5/300 | VMS V8.3 | MEMBER |
| ALEPH | VAXstation 4000-VLC | VMS V7.3 | NEW |
+--------+--------------------------------+--------------+---------+
+------------------------------------------------------------------------------------+
|
CLUSTER |
+--------+-----------+----------+------------+-------------------+-------------------+
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS | FORMED |
LAST_TRANSITION |
+--------+-----------+----------+------------+-------------------+-------------------+
| 1 | 1 | 1 | 1 | 16-SEP-2011 09:56 |
16-SEP-2011 09:56 |
+--------+-----------+----------+------------+-------------------+-------------------+
SYSGEN> SHOW EXPECTED_VOTES
%CNXMAN, Completing VMScluster state transition
Parameter Name Current Default Min. Max. Unit
Dynamic
-------------- ------- ------- ------- ------- ----
-------
EXPECTED_VOTES 1 1 1 127 Votes
Any ideas why this is going wrong?
Thanks for the help, much appreciated,
Mark.
On 16/09/11 09:40, hvlems at zonnet.nl wrote:
> Regarding the alphaserver: check the value of expectedvotes in sysgen.
> In a cluster with non-voting satellites only, its value must be less than Votes+1
> Hans
> ------Origineel bericht------
> Van: Mark Wickens
> Afzender: owner-hecnet at Update.UU.SE
> Aan: hecnet at Update.UU.SE
> Beantwoorden: hecnet at Update.UU.SE
> Onderwerp: [HECnet] More clustering fun
> Verzonden: 16 september 2011 10:14
>
> I've now refreshed the VAX satellites system drive and installed it in the
> ALPHA server. The one problem I have remaining is that the VOTES the
> satellite is contributing to the cluster is 1. I believe for a proper
> satellite this should be 0.
>
> Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
> and reboot? Do I need to do anything with the ALPHA servers configuration?
>
> Presumably I will need to reboot the ALPHA server as well.
>
> Thanks for the help,
>
> Kind regards, Mark.
>
I've now refreshed the VAX satellites system drive and installed it in the
ALPHA server. The one problem I have remaining is that the VOTES the
satellite is contributing to the cluster is 1. I believe for a proper
satellite this should be 0.
The number of votes a node has determines what happens to it when it
loses contact with other members of the cluster. If each node in a two
node cluster has one vote, then the cluster quorum is two votes. If
something happens to either node, the other notices that quorum has been
lost and will hang until quorum is reestablished. The reason for the hang
is that all each node knows is that it can't see the other node. It doesn't
know whether the other has shut down or is still running and might become
visible again shortly, in which case, everything can resume.
If your diskless satellite should die unexpectedly, it is somewhat
irritating if it also ends up hanging the other node, for no good reason.
If the satellite has lost contact with the node with the disks, then it can
do nothing anyway. Hence the reason for recommending zero votes for diskless
satellites. No other ill effects will result from leaving votes set to one.
Whatever the number of votes each node has, when shutting down a voting
member of the cluster, REMOVE_NODE should be specified in order to avoid
hanging the rest of the cluster after the shutdown completes. This specifies
that the remaining cluster nodes should recompute quorum taking into account
the loss of the node being shut down.
Regards.
Peter Coghlan.