Highly available NFS server

Setup an Active/Passive topology with common storage for shared data.
By Kostas Koutsogiannopoulos

Environment

Our topology consists of 2 CentOS 7 servers (Active-Passive).

Both nodes are using one common (shared) disk for data:

+------+-------------------------+---LAN+ ^ ^ | | +-----+-----+ heartbeat +-----+-----+ |nfs-server1|<----------->|nfs-server2| +-----+-----+ +-----+-----+ | COMMON | | /------\ | | | XFS | | | | LVM | |
+--------+ VG +-------+ | PV | \------/ DISK

The active one need to be able to serve the data to clients on network as nfs share.

We will begin setting up nfs server on a single server "nfs-server1"

$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

Install packages

Beginning with minimal CentOS installation we will need the fowllowing packages for NFS and cluster setup:

$ sudo yum update

$ sudo yum install nfs-utils

$ sudo yum groupinstall 'High Availability'

Create shared directory

We will create the /nfsdata directory that will live on root filesystem for some testing. On cluster setup later, this will be the mount point for the filesystem that will be mounted on active node.

$ sudo mkdir /nfsdata

$ sudo chmod -R 755 /nfsdata

$ sudo chown nfsnobody:nfsnobody /nfsdata

Manage firewall

These are the firewall exceptions for inter-cluster communication and NFS service to work correctly. Note that we may need extra rules for fencing functionality later:

$ sudo firewall-cmd --permanent --zone=public --add-service=nfs
success
$ sudo firewall-cmd --permanent --zone=public --add-service=mountd
success
$ sudo firewall-cmd --permanent --zone=public --add-service=rpc-bind
success
$ sudo firewall-cmd --permanent --add-service=high-availability
success
$ sudo firewall-cmd --reload
success

Test nfs service on one server alone

On your nfs server start nfs service (do not enable it for start at boot, we will configure it at our cluster setup later):

$ sudo systemctl start nfs-server

Add the following line in /etc/exports file (192.168.16.14 is the IP of a client):

/nfsdata 192.168.16.14(rw,sync,no_root_squash,no_all_squash)

Restart nfs-server service:

$ sudo systemctl restart nfs-server

Now go to the client (in our case 192.168.16.14) and try to mount the remote nfs share:

$ sudo mount -t nfs nfs-server1:/nfsdata /mnt/nfsdata

After succesfull mount you can remove everything from /etc/exports file and stop nfs service:

$ sudo systemctl stop nfs-server

We try always to manage clustered resource configuration via cluster resource agents that are configured globally (as cluster configuration). If we need to configure something on any node separately, then we need a mechanism to replicate that configuration on every other node. This is making things more complex and beyond article's scope.

Set password for the user hacluster

The user "hacluster" is created with "High Availability" packages installation and will be used for internal communication between nodes. Lets set a password for that user.

$ sudo passwd hacluster
Changing password for user hacluster. New password: Retype new password: passwd: all authentication tokens updated successfully.

Cloning the node

Now it is a good time to clone nfs-server1 creating nfs-server2.

After cloning and before boot, we need to:

Attach a new disk on both nodes (common disk). This disk will host our data that will be nfs-shared over the network.
(optional) Attach a new network card on both nodes dedicated to inter-cluster communication (heartbeat).

Then we can boot nfs-server2, change it's hostname running:

$ sudo hostnamectl set-hostname nfs-server2

...and if you have static network setup, configure network interfaces.

Lets manage our shared disk now as long as we have only nfs-server2 up.

We will create an xfs filesystem over LVM:

# pvcreate /dev/vdb
Physical volume "/dev/vdb" successfully created.

# vgcreate NFS_SHARED /dev/vdb
Volume group "NFS_SHARED" successfully created.

# lvcreate -l 100%FREE -n NFS_LVM NFS_SHARED
Logical volume "NFS_LVM" created.

# mkfs.xfs /dev/mapper/NFS_SHARED-NFS_LVM
meta-data=/dev/mapper/NFS_SHARED-NFS_LVM isize=512 agcount=4, agsize=196352 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=785408, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

At last, we can boot nfs-server1 running some tests:

Make sure that our hostnames are resovable on each one of them.
Make sure that our nodes can communicate with each other (via heartbeat network interfaces)
Make sure that every node can display the "NFS_SHARED" Logical volume: # lvdisplay NFS_SHARED
Make sure that every node can display the filesystem: # blkid /dev/mapper/NFS_SHARED-NFS_LVM
Dont try to mount the shared filesystem on more than one node at the same time because you may end up with a corrupted filesystem.

Now we are starting, enabling cluster services on both nodes:

[root@nfs-server1 ~]# systemctl start pcsd.service

[root@nfs-server1 ~]# systemctl enable pcsd.service

[root@nfs-server2 ~]# systemctl start pcsd.service

[root@nfs-server2 ~]# systemctl enable pcsd.service

Creating cluster

We have both of our nodes up 'n running with a shared disk attached to both of them and capable to communicate with each other.

From now on we can apply cluster configuration globally. This means that we can run the commands bellow on any of the cluster nodes. The cluster will take over to configure every member.

So we can monitor configuration on any node with a command like:

# watch pcs status

... and execute everything on another node:

Cluster authorization using "hacluster" username, password we set before cloning:

# pcs cluster auth nfs-server1 nfs-server2
Username: hacluster Password: nfs-server1: Authorized nfs-server2: Authorized

Cluster creation:

# pcs cluster setup --start --name nfs-server-cluster nfs-server1 nfs-server2
Destroying cluster on nodes: nfs-server1, nfs-server2... nfs-server1: Stopping Cluster (pacemaker)... nfs-server2: Stopping Cluster (pacemaker)... nfs-server1: Successfully destroyed cluster nfs-server2: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'nfs-server1', 'nfs-server2' nfs-server1: successful distribution of the file 'pacemaker_remote authkey' nfs-server2: successful distribution of the file 'pacemaker_remote authkey' Sending cluster config files to the nodes... nfs-server1: Succeeded nfs-server2: Succeeded

Starting cluster on nodes: nfs-server1, nfs-server2... nfs-server1: Starting Cluster... nfs-server2: Starting Cluster...

Synchronizing pcsd certificates on nodes nfs-server1, nfs-server2... nfs-server1: Success nfs-server2: Success Restarting pcsd on the nodes in order to reload the certificates... nfs-server1: Success nfs-server2: Success

Enable cluster services to auto-run on boot:

# pcs cluster enable --all
nfs-server1: Cluster Enabled nfs-server2: Cluster Enabled

Creating clustered resources

For a complete NFS server, we need at least the following resources:

Filesystem on common disk
NFS service
NFS export configuration
Virtual (cluster) IP address for the active node to listen

Lets configure them one by one:

Filesystem resource:

# pcs resource create SharedFS Filesystem device=/dev/mapper/NFS_SHARED-NFS_LVM directory=/nfsdata fstype=xfs --group nfsresourcegroup
Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')

NFS service resource:

# pcs resource create NFSService nfsserver nfs_shared_infodir=/nfsdata/nfsinfo --group nfsresourcegroup
Assumed agent name 'ocf:heartbeat:nfsserver' (deduced from 'nfsserver')

NFS export resource:

# pcs resource create NFSExport exportfs clientspec="192.168.16.0/24" options=rw,sync,no_root_squash,no_all_squash directory=/nfsdata fsid=0 --group nfsresourcegroup
Assumed agent name 'ocf:heartbeat:exportfs' (deduced from 'exportfs')

Virtual IP resource(192.168.16.100):

# pcs resource create VIP IPaddr2 ip=192.168.16.100 cidr_netmask=24 --group nfsresourcegroup
Assumed agent name 'ocf:heartbeat:IPaddr2' (deduced from 'IPaddr2')

In order to be more detailed, we will configure one extra resource. This will send NFSv3 reboot notifications to every NFS client in case of resource movement, right after NFS service is initialized on the new active node.

NFS notify resource:

# pcs resource create NFSNotify nfsnotify source_host=192.168.16.100 --group nfsresourcegroup
Assumed agent name 'ocf:heartbeat:nfsnotify' (deduced from 'nfsnotify')

Ordering Constraints

Last but not least, we need some order constraints that will enforce our resources to start in specific order (and stop in reverse):

# pcs constraint order SharedFS then NFSService

Adding SharedFS NFSService (kind: Mandatory) (Options: first-action=start then-action=start)

# pcs constraint order NFSService then NFSExport

Adding NFSService NFSExport (kind: Mandatory) (Options: first-action=start then-action=start)

# pcs constraint order NFSExport then VIP

Adding NFSExport VIP (kind: Mandatory) (Options: first-action=start then-action=start)

# pcs constraint order set NFSNotify require-all=true

After all these we have the following situation:

# pcs status
Cluster name: nfs-server-cluster WARNING: no stonith devices and stonith-enabled is not false Stack: corosync Current DC: nfs-server2 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum Last updated: Sat Jun 2 12:54:13 2018 Last change: Sat Jun 2 12:51:51 2018 by root via cibadmin on nfs-server1

2 nodes configured 5 resources configured

Online: [ nfs-server1 nfs-server2 ]

Full list of resources:

Resource Group: nfsresourcegroup SharedFS (ocf::heartbeat:Filesystem): Stopped NFSService (ocf::heartbeat:nfsserver): Stopped NFSExport (ocf::heartbeat:exportfs): Stopped VIP (ocf::heartbeat:IPaddr2): Stopped NFSNotify (ocf::heartbeat:nfsnotify): Stopped

Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled

You noticed that the cluster is keeping all the configured resources stopped. This is happening because we have not configured any fencing resource yet. You can see a clear warning for that on the output above:

WARNING: no stonith devices and stonith-enabled is not false

Using command:

# pcs property set stonith-enabled=false

... you can disable fencing -for now- and every resource will immediately start up:

# pcs status
Cluster name: nfs-server-cluster Stack: corosync Current DC: nfs-server2 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum Last updated: Sat Jun 2 13:06:15 2018 Last change: Sat Jun 2 13:04:32 2018 by root via cibadmin on nfs-server1

2 nodes configured 5 resources configured

Online: [ nfs-server1 nfs-server2 ]

Full list of resources:

Resource Group: nfsresourcegroup SharedFS (ocf::heartbeat:Filesystem): Started nfs-server1 NFSService (ocf::heartbeat:nfsserver): Started nfs-server1 NFSExport (ocf::heartbeat:exportfs): Started nfs-server1 VIP (ocf::heartbeat:IPaddr2): Started nfs-server1 NFSNotify (ocf::heartbeat:nfsnotify): Started nfs-server1

Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled

Some testing

Now that we have a highly available NFS service, we can run some maintenance on our machines with -almost- no downtime.

For example a possible scenario is starting with the passive(inactive) node:

[root@nfs-server2 ~]# pcs node standby
... maintenance actions on nfs-server2...
[root@nfs-server2 ~]# pcs node unstandby
[root@nfs-server1 ~]# pcs node standby
... wait for resources to move and run maintenance actions on nfs-server1...
[root@nfs-server1 ~]# pcs node unstandby

If you had a client connected to nfs service on virtual IP during maintenance (during resource moving to be more specific) you could do some testing, for example:

[user@TestPC ~]# for i in {1..10}; do time touch /mnt/nfsdata/test; sleep 1; echo $?; done

real 0m0.033s user 0m0.000s sys 0m0.002s 0

real 0m0.016s user 0m0.001s sys 0m0.000s 0

real 0m0.006s user 0m0.000s sys 0m0.001s 0

real 0m0.008s user 0m0.000s sys 0m0.001s 0

real 0m0.031s user 0m0.000s sys 0m0.001s 0

real 0m7.724s user 0m0.000s sys 0m0.001s 0

real 0m0.007s user 0m0.000s sys 0m0.001s 0

real 0m0.008s user 0m0.001s sys 0m0.000s 0

real 0m0.007s user 0m0.000s sys 0m0.001s 0

real 0m0.032s user 0m0.000s sys 0m0.001s 0

As you see, the sixth try to touch /mnt/nfsdata/test file, took more time than usual because of a little resource unavailability but eventually, completed succesfully (return code 0). Of cource you can try more IO intensive operations (like watching your favorite movie while patching your NAS servers). You will have only delays as long as nfs client - server are able to handle pending transactions after initialization on the new active node.

We built and tested our cluster running some controlled actions. We marked an active node as standby and watched every resource gracefully stop on this node and start on the other. But what happens when a failure occurs? Is automatic failover always possible?

Fencing

Fencing is the most important mechanism of a cluster during a failure. It guarantees that the node that is going to take over, will start clean and the malfunctioning node is really isolated.

Fencing, in linux clusters is implemented by "stonith" (Shoot The Other Node In The Head) resources. The exact implementation totally depends on your infrastucture. There are multiple "fence agents" that will do their job on specific environments (physical/virtual machines, software/hardware fencing devices etc).

On another article we will add fencing functionality to our cluster, running the servers as virtual machines on KVM hypervisors.