Rescue GlusterFS Distributed Storage System

Keywords: Big Data sudo Kubernetes Lambda jupyter

After the GlusterFS distributed storage system was migrated in a previous period of time, the phenomenon of dropping lines and intermittent happened frequently, which made it unable to work properly for a long time.Observations revealed that one of the nodes was always restarting, further revealing that only one of the four nodes in the replicated storage was alive, while the other two nodes were still running, but the corresponding directory was empty, and the gluster volume status showed no online.The procedure for repairing the GlusterFS distributed storage system is documented below.

1. View Status

Get the volume status and information for GlusterFS:

#Get state is the result of runtime.
sudo gluster volume status gvzr00

#Getting information is a predefined result.
sudo gluster volume info gvzr00

The difference between the results defined and the results at run time is the problem.

Get information about the online node (peer):

sudo gluster peer status 

One of the nodes was found to be disconnetted. After restarting several times and failing to update the system software, it was decided to remove the node temporarily.

2. Remove Nodes

Remove the node using the following command:

sudo gluster peer detach 

However, execution failed.

  • Tips that there is a brick connected on this node, you need to remove the bricks located on this node on all volume s first.

First use volume info to view the corresponding bricks, then force the removal of bricks:

#Removing brick on replica volume gvzr00 requires specifying the replica parameter.
sudo gluster volume remove-brick gvzr00 replica 2 10.1.1.193:/zpool/gvzr00 force

#Remove brick on strip volume gvz00 (note: data loss may occur).
sudo gluster volume remove-brick gvz00 10.1.1.193:/zpool/gvz00 force

Then force the peer to be removed (because it is offline, the force parameter must be used):

sudo gluster peer detach 10.1.1.193 force

3. Recovery Node

Re-add peer:

sudo gluster peer add 10.1.1.193

Get the status of the peer:

sudo gluster peer status

Re-add brick:

sudo gluster volume add-brick gvzr00 replica 2 10.1.1.193:/zpool/gvzr00

Get the status of the volume:

sudo gluster volume info gvzr00

sudo gluster volume status gvzr00

The output is as follows:

sudo gluster volume status gvzr00
Status of volume: gvzr00
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.1.1.205:/zpool/gvzr00              49153     0          Y       30347
Brick 10.1.1.193:/zpool/gvzr00              49152     0          Y       15144
Brick 10.1.1.150:/zpool/gvzr00              49152     0          Y       11586
NFS Server on localhost                     2049      0          Y       16425
Self-heal Daemon on localhost               N/A       N/A        Y       16499
NFS Server on 10.1.1.150                    N/A       N/A        N       N/A  
Self-heal Daemon on 10.1.1.150              N/A       N/A        Y       11661
NFS Server on 10.1.1.205                    N/A       N/A        N       N/A  
Self-heal Daemon on 10.1.1.205              N/A       N/A        Y       5848 
NFS Server on 10.1.1.203                    2049      0          Y       27732
Self-heal Daemon on 10.1.1.203              N/A       N/A        Y       27770
NFS Server on 10.1.1.167                    2049      0          Y       24585
Self-heal Daemon on 10.1.1.167              N/A       N/A        Y       24619
NFS Server on 10.1.1.202                    2049      0          Y       28924
Self-heal Daemon on 10.1.1.202              N/A       N/A        Y       28941
NFS Server on 10.1.1.234                    2049      0          Y       26891
Self-heal Daemon on 10.1.1.234              N/A       N/A        Y       26917
NFS Server on 10.1.1.193                    2049      0          Y       15689
Self-heal Daemon on 10.1.1.193              N/A       N/A        Y       15724
 
Task Status of Volume gvzr00
------------------------------------------------------------------------------
There are no active volume tasks

The basic storage service is back to normal.

4. Restore JupyterHub service

For use in Kubernetes, further:

First you need to create PVs and PVCs in Kubernetes.Reference resources:

Then JupyterHub runtime error, Notebook Server can not start, go into the pod log to find the prompt message "NoneType", fix it as follows:

kubectl patch deploy -n jupyter hub --type json \
--patch '[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["bash", "-c", "\nmkdir -p ~/hotfix\ncp \
-r /usr/local/lib/python3.6/dist-packages/kubespawner ~/hotfix\nls -R ~/hotfix\npatch ~/hotfix/kubespawner/spawner.py \
<< EOT\n72c72\n<             key=lambda x: x.last_timestamp,\n---\n>             key=lambda x: x.last_timestamp and x.last_timestamp.timestamp() or 0.,\nEOT\n\nPYTHONPATH=$HOME/hotfix \
jupyterhub --config /srv/jupyterhub_config.py --upgrade-db\n"]}]'

Go back to JupyterHub's service and get back to normal.

5. More References

Posted by renesis on Mon, 06 Jan 2020 01:02:34 -0800