My apolgies for the confusion Peter. But you're right to assume I meant the cl.gr.nr.
And yes, there is obviously something wrong here. IF an autogen was done I'd say have a look at modparams.dat
- alloclass must be unique for each node
- same for tapealloclass
And check the cluster license. AFAIK the cluster license must bu unique for each node. Or one license with 0 units and in that case make sure all nodenames are mentioned in the /INCLUDE list, again on all nodes where the license was loaded.
PS
I'm travelling now so can't check my mail too often
Hans
-----Original Message-----
From: Peter Coghlan <HECNET at beyondthepale.ie>
Sender: owner-hecnet at Update.UU.SE
Date: Fri, 16 Sep 2011 12:52:13
To: <hecnet at Update.UU.SE>
Reply-To: hecnet at Update.UU.SESubject: Re: Fwd: Re: [HECnet] More clustering fun
2) The cluster id and cluster password are different on both nodes.
Whats the cluster id?
The cluster id is 4.
Sorry - I should have said "What do you mean by the cluster id?"
I guess you mean the cluster group number but I wanted to be
sure we weren't referring to two different things.
I've confirmed that the authorize file is the same on both the alpha and
vax nodes:
$$ diff dsa0:[sys0.syscommon.sysexe]cluster_authorize.dat -
_$$ $4$dkb400:[sys0.syscommon.sysexe]cluster_authorize.dat
Number of difference sections found: 0
Number of difference records found: 0
DIFFERENCES /IGNORE=()/MERGED=1-
DSA0:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1-
$4$DKB400:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1
[snip]
It's weird that by changing the VOTES to 0 on the satellite and running an
AUTOGEN that this behaviour is experienced. Before that the satellite
boots off the served disk no problem.
I doubt that changing VOTES to 0 was responsible. To check this, you could
set VOTES back to 1 (probably manually using SYSGEN) and test again.
(A problem with running AUTOGEN for the first time is that it finds anything
that you didn't know was lurking in MODPARAMS.DAT and acts on it.)
Indeed, the 'master' disk which is a local drive in the VAXstation boots
into the cluster no problem, but again, with VOTES = 1.
Check that there are no files called SYS$SPECIFIC:[SYSEXE]CLUSTER_AUTHORIZE.DAT
on either node. If you find any, delete them.
You may have to reboot both nodes in order to get them to read the correct
CLUSTER_AUTHORIZE.DAT files after copying them or deleting errant ones.
Regards,
Peter Coghlan.
2) The cluster id and cluster password are different on both nodes.
Whats the cluster id?
The cluster id is 4.
Sorry - I should have said "What do you mean by the cluster id?"
I guess you mean the cluster group number but I wanted to be
sure we weren't referring to two different things.
I've confirmed that the authorize file is the same on both the alpha and
vax nodes:
$$ diff dsa0:[sys0.syscommon.sysexe]cluster_authorize.dat -
_$$ $4$dkb400:[sys0.syscommon.sysexe]cluster_authorize.dat
Number of difference sections found: 0
Number of difference records found: 0
DIFFERENCES /IGNORE=()/MERGED=1-
DSA0:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1-
$4$DKB400:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1
[snip]
It's weird that by changing the VOTES to 0 on the satellite and running an
AUTOGEN that this behaviour is experienced. Before that the satellite
boots off the served disk no problem.
I doubt that changing VOTES to 0 was responsible. To check this, you could
set VOTES back to 1 (probably manually using SYSGEN) and test again.
(A problem with running AUTOGEN for the first time is that it finds anything
that you didn't know was lurking in MODPARAMS.DAT and acts on it.)
Indeed, the 'master' disk which is a local drive in the VAXstation boots
into the cluster no problem, but again, with VOTES = 1.
Check that there are no files called SYS$SPECIFIC:[SYSEXE]CLUSTER_AUTHORIZE.DAT
on either node. If you find any, delete them.
You may have to reboot both nodes in order to get them to read the correct
CLUSTER_AUTHORIZE.DAT files after copying them or deleting errant ones.
Regards,
Peter Coghlan.
On Fri, 16 Sep 2011, Peter Coghlan wrote:
2) The cluster id and cluster password are different on both nodes.
Whats the cluster id?
The cluster id is 4.
I've confirmed that the authorize file is the same on both the alpha and vax nodes:
$$ diff dsa0:[sys0.syscommon.sysexe]cluster_authorize.dat -
_$$ $4$dkb400:[sys0.syscommon.sysexe]cluster_authorize.dat
Number of difference sections found: 0
Number of difference records found: 0
DIFFERENCES /IGNORE=()/MERGED=1-
DSA0:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1-
$4$DKB400:[SYS0.SYSCOMMON.SYSEXE]CLUSTER_AUTHORIZE.DAT;1
The cluster group number and the cluster password must be the same
on all nodes in the cluster. The easiest way to achieve this is
to copy sys$common:[sysexe]cluster_authorize.dat from one to the
other.
I set the right values when I ran CLUSTER_CONFIG_LAN.COM on the VAX.
If the cluster group numbers are different, you will end up
trying to form two different clusters.
If the passwords are different, the results will probably be
something like you are seeing, nodes failing to completely
join the cluster. There may also be some messages in the
operator.log
It's weird that by changing the VOTES to 0 on the satellite and running an AUTOGEN that this behaviour is experienced. Before that the satellite boots off the served disk no problem.
Indeed, the 'master' disk which is a local drive in the VAXstation boots into the cluster no problem, but again, with VOTES = 1.
Regards, Mark
2) The cluster id and cluster password are different on both nodes.
Whats the cluster id?
The cluster group number and the cluster password must be the same
on all nodes in the cluster. The easiest way to achieve this is
to copy sys$common:[sysexe]cluster_authorize.dat from one to the
other.
If the cluster group numbers are different, you will end up
trying to form two different clusters.
If the passwords are different, the results will probably be
something like you are seeing, nodes failing to completely
join the cluster. There may also be some messages in the
operator.log
Regards,
Peter.
On 16/09/11 11:22, hvlems at zonnet.nl wrote:
Did you adjust EXPECTED_VOTES on the alpha server?
From: Mark Wickens <mark at wickensonline.co.uk>
Sender: owner-hecnet at Update.UU.SE
Date: Fri, 16 Sep 2011 11:09:04 +0100
To: <hecnet at Update.UU.SE>
ReplyTo: hecnet at Update.UU.SE
Subject: Fwd: Re: [HECnet] More clustering fun
Just to add to the picture - if I reduce VOTES on the satellite to 0 I get this happening:
-------- Original Message --------
Subject:
Re: [HECnet] More clustering fun
Date:
Fri, 16 Sep 2011 09:33:47 +0000
From:
hvlems at zonnet.nl
Reply-To:
hvlems at zonnet.nl
To:
Mark Wickens <mark at wickensonline.co.uk>
All I can think of is this:
1) Both slave and aleph both use the same VMSCLUSTER license
2) The cluster id and cluster password are different on both nodes.
On an alpha you can modify this in sysman (use help in sysman to find the correct command). On a Vax the command is burried in sysgen.
Hans
-----Original Message-----
From: Mark Wickens <mark at wickensonline.co.uk>
Date: Fri, 16 Sep 2011 10:11:41
Cc: <hvlems at zonnet.nl>
Subject: Re: [HECnet] More clustering fun
Hi Hans,
I didn't want to pre-empt what I thought happened last time as I wasn't
sure I'd got it right, but it has happened again so it's definitely an
issue.
I've updated the VOTES in ALEPH (the satellite) MODPARAMS.DAT, ran
AUTOGEN and rebooted the cluster.
Now when ALEPH attempts to join the cluster I get these messages repeatedly:
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
and I see this on SLAVE:
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) completed VMScluster state transition
This just repeats forever.
Some more information from SLAVE (the ALPHA server):
SHOW CLUSTER:
View of Cluster from system ID 4345 node:
SLAVE
16-SEP-2011 10:06:13
+--------------------------------------------------------+---------+
| SYSTEMS | MEMBERS |
+--------+--------------------------------+--------------+---------+
| NODE | HW_TYPE | SOFTWARE | STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE | AlphaServer 1000A 5/300 | VMS V8.3 | MEMBER |
| ALEPH | VAXstation 4000-VLC | VMS V7.3 | NEW |
+--------+--------------------------------+--------------+---------+
+------------------------------------------------------------------------------------+
|
CLUSTER |
+--------+-----------+----------+------------+-------------------+-------------------+
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS | FORMED |
LAST_TRANSITION |
+--------+-----------+----------+------------+-------------------+-------------------+
| 1 | 1 | 1 | 1 | 16-SEP-2011 09:56 |
16-SEP-2011 09:56 |
+--------+-----------+----------+------------+-------------------+-------------------+
SYSGEN> SHOW EXPECTED_VOTES
%CNXMAN, Completing VMScluster state transition
Parameter Name Current Default Min. Max. Unit
Dynamic
-------------- ------- ------- ------- ------- ----
-------
EXPECTED_VOTES 1 1 1 127 Votes
Any ideas why this is going wrong?
Thanks for the help, much appreciated,
Mark.
On 16/09/11 09:40, hvlems at zonnet.nl wrote:
> Regarding the alphaserver: check the value of expectedvotes in sysgen.
> In a cluster with non-voting satellites only, its value must be less than Votes+1
> Hans
> ------Origineel bericht------
> Van: Mark Wickens
> Afzender: owner-hecnet at Update.UU.SE
> Aan: hecnet at Update.UU.SE
> Beantwoorden: hecnet at Update.UU.SE
> Onderwerp: [HECnet] More clustering fun
> Verzonden: 16 september 2011 10:14
>
> I've now refreshed the VAX satellites system drive and installed it in the
> ALPHA server. The one problem I have remaining is that the VOTES the
> satellite is contributing to the cluster is 1. I believe for a proper
> satellite this should be 0.
>
> Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
> and reboot? Do I need to do anything with the ALPHA servers configuration?
>
> Presumably I will need to reboot the ALPHA server as well.
>
> Thanks for the help,
>
> Kind regards, Mark.
>
The EXPECTED_VOTES on the ALPHA is set to 1 - which I assume is the value we would anticipate would work?
Even when both the server and the satellite have VOTES set to 1 the expected votes remains at 1 on the server, although if you examine the cluster via SHOW CLUSTER it shows that the quorum is set to 2:
View of Cluster from system ID 4345 node: SLAVE 16-SEP-2011 11:21:15
+--------------------------------------------------------+---------+
| SYSTEMS | MEMBERS |
+--------+--------------------------------+--------------+---------+
| NODE | HW_TYPE | SOFTWARE | STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE | AlphaServer 1000A 5/300 | VMS V8.3 | MEMBER |
| ALEPH | VAXstation 4000-VLC | VMS V7.3 | MEMBER |
+--------+--------------------------------+--------------+---------+
+-------------------------------------------------------------------------------
| CLUSTER
+--------+-----------+----------+------------+-------------------+--------------
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS | FORMED | LAST_TRANSIT
+--------+-----------+----------+------------+-------------------+--------------
| 2 | 2 | 2 | 2 | 16-SEP-2011 11:04 | 16-SEP-2011 1
+--------+-----------+----------+------------+-------------------+--------------
With the satellites VOTES set to 0 (which causes the endless %CNXMAN messages) if I turn off the satellite at that point I get the following:
$$
%CNXMAN, Quorum lost, blocking activity
$$
%CNXMAN, Timed-out lost connection to system ALEPH
%CNXMAN, Proposing reconfiguration of the VMScluster
%CNXMAN, Discovered system ALEPH
%CNXMAN, Removed from VMScluster system ALEPH
%CNXMAN, Completing VMScluster state transition
%CNXMAN, Established connection to system ALEPH
and the server hangs. I end up having to reboot the server, because the satellite never joins the cluster successfully.
It's all fun!
Regards, Mark.
Did you adjust EXPECTED_VOTES on the alpha server?
From: Mark Wickens <mark at wickensonline.co.uk>
Sender: owner-hecnet at Update.UU.SE
Date: Fri, 16 Sep 2011 11:09:04 +0100
To: <hecnet at Update.UU.SE>
ReplyTo: hecnet at Update.UU.SE
Subject: Fwd: Re: [HECnet] More clustering fun
Just to add to the picture - if I reduce VOTES on the satellite to 0 I get this happening:
-------- Original Message --------
Subject:
Re: [HECnet] More clustering fun
Date:
Fri, 16 Sep 2011 09:33:47 +0000
From:
hvlems at zonnet.nl
Reply-To:
hvlems at zonnet.nl
To:
Mark Wickens <mark at wickensonline.co.uk>
All I can think of is this:
1) Both slave and aleph both use the same VMSCLUSTER license
2) The cluster id and cluster password are different on both nodes.
On an alpha you can modify this in sysman (use help in sysman to find the correct command). On a Vax the command is burried in sysgen.
Hans
-----Original Message-----
From: Mark Wickens <mark at wickensonline.co.uk>
Date: Fri, 16 Sep 2011 10:11:41
Cc: <hvlems at zonnet.nl>
Subject: Re: [HECnet] More clustering fun
Hi Hans,
I didn't want to pre-empt what I thought happened last time as I wasn't
sure I'd got it right, but it has happened again so it's definitely an
issue.
I've updated the VOTES in ALEPH (the satellite) MODPARAMS.DAT, ran
AUTOGEN and rebooted the cluster.
Now when ALEPH attempts to join the cluster I get these messages repeatedly:
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
and I see this on SLAVE:
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) completed VMScluster state transition
This just repeats forever.
Some more information from SLAVE (the ALPHA server):
SHOW CLUSTER:
View of Cluster from system ID 4345 node:
SLAVE
16-SEP-2011 10:06:13
+--------------------------------------------------------+---------+
| SYSTEMS | MEMBERS |
+--------+--------------------------------+--------------+---------+
| NODE | HW_TYPE | SOFTWARE | STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE | AlphaServer 1000A 5/300 | VMS V8.3 | MEMBER |
| ALEPH | VAXstation 4000-VLC | VMS V7.3 | NEW |
+--------+--------------------------------+--------------+---------+
+------------------------------------------------------------------------------------+
|
CLUSTER |
+--------+-----------+----------+------------+-------------------+-------------------+
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS | FORMED |
LAST_TRANSITION |
+--------+-----------+----------+------------+-------------------+-------------------+
| 1 | 1 | 1 | 1 | 16-SEP-2011 09:56 |
16-SEP-2011 09:56 |
+--------+-----------+----------+------------+-------------------+-------------------+
SYSGEN> SHOW EXPECTED_VOTES
%CNXMAN, Completing VMScluster state transition
Parameter Name Current Default Min. Max. Unit
Dynamic
-------------- ------- ------- ------- ------- ----
-------
EXPECTED_VOTES 1 1 1 127 Votes
Any ideas why this is going wrong?
Thanks for the help, much appreciated,
Mark.
On 16/09/11 09:40, hvlems at zonnet.nl wrote:
> Regarding the alphaserver: check the value of expectedvotes in sysgen.
> In a cluster with non-voting satellites only, its value must be less than Votes+1
> Hans
> ------Origineel bericht------
> Van: Mark Wickens
> Afzender: owner-hecnet at Update.UU.SE
> Aan: hecnet at Update.UU.SE
> Beantwoorden: hecnet at Update.UU.SE
> Onderwerp: [HECnet] More clustering fun
> Verzonden: 16 september 2011 10:14
>
> I've now refreshed the VAX satellites system drive and installed it in the
> ALPHA server. The one problem I have remaining is that the VOTES the
> satellite is contributing to the cluster is 1. I believe for a proper
> satellite this should be 0.
>
> Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
> and reboot? Do I need to do anything with the ALPHA servers configuration?
>
> Presumably I will need to reboot the ALPHA server as well.
>
> Thanks for the help,
>
> Kind regards, Mark.
>
Just to add to the picture - if I reduce VOTES on the satellite to 0 I get this happening:
-------- Original Message --------
Subject:
Re: [HECnet] More clustering fun
Date:
Fri, 16 Sep 2011 09:33:47 +0000
From:
hvlems at zonnet.nl
Reply-To:
hvlems at zonnet.nl
To:
Mark Wickens <mark at wickensonline.co.uk>
All I can think of is this:
1) Both slave and aleph both use the same VMSCLUSTER license
2) The cluster id and cluster password are different on both nodes.
On an alpha you can modify this in sysman (use help in sysman to find the correct command). On a Vax the command is burried in sysgen.
Hans
-----Original Message-----
From: Mark Wickens <mark at wickensonline.co.uk>
Date: Fri, 16 Sep 2011 10:11:41
Cc: <hvlems at zonnet.nl>
Subject: Re: [HECnet] More clustering fun
Hi Hans,
I didn't want to pre-empt what I thought happened last time as I wasn't
sure I'd got it right, but it has happened again so it's definitely an
issue.
I've updated the VOTES in ALEPH (the satellite) MODPARAMS.DAT, ran
AUTOGEN and rebooted the cluster.
Now when ALEPH attempts to join the cluster I get these messages repeatedly:
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
%CNXMAN, sending VAXcluster membership request to system SLAVE
and I see this on SLAVE:
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:35.79 %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:39.04 %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) completed VMScluster state transition
$$
%CNXMAN, Received VMScluster membership request from system ALEPH
%CNXMAN, Proposing addition of system ALEPH
%CNXMAN, Completing VMScluster state transition
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) received VMScluster membership
request from node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) proposed addition of node ALEPH
%%%%%%%%%%% OPCOM 16-SEP-2011 10:05:43.02 %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) completed VMScluster state transition
This just repeats forever.
Some more information from SLAVE (the ALPHA server):
SHOW CLUSTER:
View of Cluster from system ID 4345 node:
SLAVE
16-SEP-2011 10:06:13
+--------------------------------------------------------+---------+
| SYSTEMS | MEMBERS |
+--------+--------------------------------+--------------+---------+
| NODE | HW_TYPE | SOFTWARE | STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE | AlphaServer 1000A 5/300 | VMS V8.3 | MEMBER |
| ALEPH | VAXstation 4000-VLC | VMS V7.3 | NEW |
+--------+--------------------------------+--------------+---------+
+------------------------------------------------------------------------------------+
|
CLUSTER |
+--------+-----------+----------+------------+-------------------+-------------------+
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS | FORMED |
LAST_TRANSITION |
+--------+-----------+----------+------------+-------------------+-------------------+
| 1 | 1 | 1 | 1 | 16-SEP-2011 09:56 |
16-SEP-2011 09:56 |
+--------+-----------+----------+------------+-------------------+-------------------+
SYSGEN> SHOW EXPECTED_VOTES
%CNXMAN, Completing VMScluster state transition
Parameter Name Current Default Min. Max. Unit
Dynamic
-------------- ------- ------- ------- ------- ----
-------
EXPECTED_VOTES 1 1 1 127 Votes
Any ideas why this is going wrong?
Thanks for the help, much appreciated,
Mark.
On 16/09/11 09:40, hvlems at zonnet.nl wrote:
> Regarding the alphaserver: check the value of expectedvotes in sysgen.
> In a cluster with non-voting satellites only, its value must be less than Votes+1
> Hans
> ------Origineel bericht------
> Van: Mark Wickens
> Afzender: owner-hecnet at Update.UU.SE
> Aan: hecnet at Update.UU.SE
> Beantwoorden: hecnet at Update.UU.SE
> Onderwerp: [HECnet] More clustering fun
> Verzonden: 16 september 2011 10:14
>
> I've now refreshed the VAX satellites system drive and installed it in the
> ALPHA server. The one problem I have remaining is that the VOTES the
> satellite is contributing to the cluster is 1. I believe for a proper
> satellite this should be 0.
>
> Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
> and reboot? Do I need to do anything with the ALPHA servers configuration?
>
> Presumably I will need to reboot the ALPHA server as well.
>
> Thanks for the help,
>
> Kind regards, Mark.
>
I've now refreshed the VAX satellites system drive and installed it in the
ALPHA server. The one problem I have remaining is that the VOTES the
satellite is contributing to the cluster is 1. I believe for a proper
satellite this should be 0.
The number of votes a node has determines what happens to it when it
loses contact with other members of the cluster. If each node in a two
node cluster has one vote, then the cluster quorum is two votes. If
something happens to either node, the other notices that quorum has been
lost and will hang until quorum is reestablished. The reason for the hang
is that all each node knows is that it can't see the other node. It doesn't
know whether the other has shut down or is still running and might become
visible again shortly, in which case, everything can resume.
If your diskless satellite should die unexpectedly, it is somewhat
irritating if it also ends up hanging the other node, for no good reason.
If the satellite has lost contact with the node with the disks, then it can
do nothing anyway. Hence the reason for recommending zero votes for diskless
satellites. No other ill effects will result from leaving votes set to one.
Whatever the number of votes each node has, when shutting down a voting
member of the cluster, REMOVE_NODE should be specified in order to avoid
hanging the rest of the cluster after the shutdown completes. This specifies
that the remaining cluster nodes should recompute quorum taking into account
the loss of the node being shut down.
Regards.
Peter Coghlan.
Regarding the alphaserver: check the value of expectedvotes in sysgen.
In a cluster with non-voting satellites only, its value must be less than Votes+1
Hans
------Origineel bericht------
Van: Mark Wickens
Afzender: owner-hecnet at Update.UU.SE
Aan: hecnet at Update.UU.SE
Beantwoorden: hecnet at Update.UU.SE
Onderwerp: [HECnet] More clustering fun
Verzonden: 16 september 2011 10:14
I've now refreshed the VAX satellites system drive and installed it in the
ALPHA server. The one problem I have remaining is that the VOTES the
satellite is contributing to the cluster is 1. I believe for a proper
satellite this should be 0.
Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
and reboot? Do I need to do anything with the ALPHA servers configuration?
Presumably I will need to reboot the ALPHA server as well.
Thanks for the help,
Kind regards, Mark.
Correct. VOTES is not dynamic so you have two options:
1) Modify modparams.dat and run autogen with the reboot option
2) Modify modparams.dat
Run sysgen to modify VOTES and write current
Reboot
There is a lazy third option: same as 2) but skip modifying modparams.dat. Which is sloppy work, to put it mildly.
------Origineel bericht------
Van: Mark Wickens
Afzender: owner-hecnet at Update.UU.SE
Aan: hecnet at Update.UU.SE
Beantwoorden: hecnet at Update.UU.SE
Onderwerp: [HECnet] More clustering fun
Verzonden: 16 september 2011 10:14
I've now refreshed the VAX satellites system drive and installed it in the
ALPHA server. The one problem I have remaining is that the VOTES the
satellite is contributing to the cluster is 1. I believe for a proper
satellite this should be 0.
Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
and reboot? Do I need to do anything with the ALPHA servers configuration?
Presumably I will need to reboot the ALPHA server as well.
Thanks for the help,
Kind regards, Mark.