Detailed explanation of PG abnormal state and fault summary
reference resources:
https://www.jianshu.com/p/36c2d5682d87
https://blog.csdn.net/wylfengyujiancheng/article/details/89235241?utm_medium=distribute.pc_relevant.none-task-blog-2defaultbaidujs_baidulandingword~default-1.no_search_link&spm=1001.2101.3001.4242
https://github.com/lidaohang/ceph_study/blob/master/%E5%B8%B8%E8%A7%81%20PG%20%E6%95%85%E9%9A%9C%E5%A4%84%E7%90%86.md
Ceph's Rados design principle and Implementation
1. Detailed explanation of PG abnormal state
1.1 PG status introduction
Here, PG state refers to the external state of PG, that is, the state that can be directly seen by the user.
You can view the current status of PG through ceph pg stat command. The health status is "active + clean"
[root@node-1 ~]# ceph pg stat 464 pgs: 464 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail
Some common PG external states are given below. (refer to section 6.3 of Ceph's Rados design principle and Implementation)
state | meaning |
---|---|
activating | Peering is about to complete, waiting for all replicas to synchronize and solidify peering results (Info, log, etc.) |
active | Active. PG can normally handle read and write requests from clients |
backfilling | Filling status in the background. backfill is a special scenario of recovery. After peering is completed, if incremental synchronization cannot be performed on some PG instances in the Up Set based on the current authoritative log (for example, the OSD hosting these PG instances is offline for too long, or the overall migration of PG instances is caused by the addition of a new OSD to the cluster), full synchronization will be performed by completely copying all objects in the current Primary |
backfill-toofull | The OSD where the replica is located does not have enough space, and the backfill process is suspended |
backfill-wait | Wait for the backfill resource reservation to complete |
clean | Currently, there are no degraded objects (objects to be repaired) in PG. the contents of acting set and up set are consistent, and the size is equal to the number of storage pool replicas |
creating | PG is being created |
deep | PG is performing or is about to perform deep Scrub (object consistency scan) |
degraded | There are degraded objects in PG (after peering, PG detects an inconsistency in a PG instance), or the size of the acting set is less than the number of replicas in the storage pool (but not less than the minimum number of replicas in the storage pool) |
down | During Peering, PG detects that the current surviving replica is insufficient to complete data recovery in an Interval that cannot be skipped |
incomplete | During Peering, the authoritative log cannot be selected or the selected acting set is not enough to complete data repair |
inconsistent | During Scurb, it was detected that one or some objects were inconsistent between replicas |
peered | Peering has completed, but pg the current acting set size is less than the minimum number of replicas specified by the storage pool |
peering | Peer in progress |
recovering | PG is repairing degraded objects (inconsistent objects) in the background according to Peering results |
recovering-wait | Wait for the Recovery resource reservation to complete |
remapped | In case of any change in PG activity set, the data will be migrated from the old activity set to the new activity set. During the migration, the client requests are still processed with the primary OSD in the old activity set. Once the migration is completed, the primary OSD in the new activity set starts processing |
repair | Fix inconsistent objects |
scrubbing | PG is executing Scrub |
stale | The Monitor detects that the current OSD of the Primary node is down and there is no subsequent handover, or the Primary node fails to report PG related statistics to the Monitor (for example, temporary network congestion) |
undersized | The number of replicas in the current acting set is less than the number of replicas in the storage pool (but not less than the minimum number of replicas in the storage pool) |
unactive | PG cannot process read / write requests |
unclean | PG cannot recover from the previous failure |
1.2 detailed explanation of PG abnormal state
Reference link: http://luqitao.github.io/2016/07/14/ceph-pg-states-introduction/
Some PG abnormal states (requiring manual repair) are described below.
-
Degraded: degraded
When the client writes data to the master OSD, the master OSD is responsible for writing data copies to the other copy OSDs. After the master OSD writes the object to the memory, the master OSD will stay in the degraded state until the replica OSD creates the object replica and reports it to the master OSD. The homing group state can be in the active+degraded state because an OSD can be in the active state even if it does not hold all objects. If an OSD hangs, Ceph will mark all the homing groups assigned to this OSD as degraded; After that OSD is reborn, they must be interconnected again. However, the client can still write new objects to the homing group in the degraded state as long as it is still in the active state.
If an OSD hangs and is always in the degraded state, Ceph will mark the down OSD as out of the cluster and remap the data on the down OSD to other OSDs. The time interval from marked down to out is controlled by mon osd down out interval. The default is 300 seconds.
The homing group will also be degraded because Ceph cannot find one or more objects that should exist in this homing group. At this time, you cannot read or write the objects that cannot be found, but you can still access other objects in the degraded homing group.
-
Remapped: remapped
When the acting set of a homing group is changed, the data should be migrated from the old set to the new one. The new primary OSD will take some time to provide services, so the old primary OSD will continue to provide services until the homing group is migrated. After data migration, the running diagram will contain the main OSD in the new acting set. -
stale: obsolete
By default, the OSD daemon will report its homing group, outgoing traffic, boot and failure statistics every half a second (0.5), which is higher than the heartbeat threshold. If the acting set of the primary OSD of a homing group fails to report to the monitor, or other monitors have reported that the primary OSD has been down, the monitors will mark the homing group as stale.
When starting the cluster, you will often see the stale state until the interconnection is completed. After the cluster is running for a while, if you can still see that there are homed groups in the stale state, it means that the primary OSD of those homed groups is down or is not reporting statistics to the monitor.
-
Inconsistent: inconsistent
PG usually has multiple copies, and the data of all copies should be completely consistent. However, sometimes the data on the replica is inconsistent due to OSD failure, network congestion and other factors. At this time, it is necessary to wake up and repair the inconsistent PG.
2. Common fault handling methods
2.1 the number of OSDs is less than the set number of copies
Generally, the storage pool is set to 3 replicas, that is, 1 pg will be stored in 3 OSD s. Under normal conditions, PG status is displayed as "active + clean"
If your cluster is smaller than three replicas, for example, there are only two OSDs, you may have all OSDs in the up and in states, but PG can never achieve "active + clean", which may be because of osd pool size/min_size is set to a value greater than 2.
- If two OSDs are available, osd pool size is greater than 2, OSD pool min_ If the size is less than or equal to 2, the PG status displays grade, but it does not affect the reading and writing of the pool.
- If 2 OSDs are available, OSD pool min_ If the size is greater than 2, PG displays peered status. At this time, the pool cannot respond to read-write requests.
# min_size=4,size=5, actual osd copies = 3 [root@node-1 ~]# ceph osd dump | grep pool-1 pool 1 'pool-1' replicated size 5 min_size 4 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 6359 flags hashpspool stripe_width 0 application rbd [root@node-1 ~]# ceph pg stat 464 pgs: 70 undersized+degraded+peered, 58 undersized+peered, 336 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail; 192/162165 objects degraded (0.118%) [root@node-1 ~]# rados -p pool-1 put file_bench_cephfs.f file_bench_cephfs.f # The operation is blocked and the write request cannot be completed
It can be seen that osd pool min_size is the number of OSD replicas that must be met, and osd pool size is the number of OSD replicas that are recommended to be met. The former is a condition that must be met, otherwise the pool cannot be read or written; The latter can not be satisfied, but the cluster will report a warning. The above problems can be solved by setting reasonable osd pool size and osd pool min size.
[root@node-1 ~]# ceph osd pool set pool-1 size 3 set pool 1 size to 3 [root@node-1 ~]# ceph osd pool set pool-1 min_size 2 set pool 1 min_size to 2 [root@node-1 ~]# ceph pg stat 464 pgs: 464 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail
Crash map error
Another possible reason why PG cannot reach the clean state is that there is an error in the cluster's Cross map, resulting in PG unable to map to the correct place.
2.2 PG fault caused by OSD down
The most common PG failures are caused by one or more OSD processes hanging up. Generally, the health is restored after restarting OSD.
You can check whether there is OSD down through ceph -s or ceph osd stat.
[root@node-1 ~]# ceph osd stat 4 osds: 4 up (since 4h), 4 in (since 6d); epoch: e6364
Try to stop one or more OSDs (3 replica clusters, 4 OSDs in total) and observe the cluster status.
# When one OSD is stopped and three OSDs remain, an active + underestimated + degraded warning appears, indicating that the cluster can still read and write [root@node-1 ~]# ceph health detail HEALTH_WARN 1 osds down; Degraded data redundancy: 52306/162054 objects degraded (32.277%), 197 pgs degraded OSD_DOWN 1 osds down osd.0 (root=default,host=node-1) is down PG_DEGRADED Degraded data redundancy: 52306/162054 objects degraded (32.277%), 197 pgs degraded pg 1.1d is active+undersized+degraded, acting [2,1] pg 1.60 is active+undersized+degraded, acting [1,2] pg 1.62 is active+undersized+degraded, acting [2,1] ... # Two OSDs are stopped, and there are still two OSDs, meeting the requirements of min_size=2, the cluster can still read and write [root@node-1 ~]# ceph health detail HEALTH_WARN 2 osds down; 1 host (2 osds) down; Degraded data redundancy: 54018/162054 objects degraded (33.333%), 208 pgs degraded, 441 pgs undersized OSD_DOWN 2 osds down osd.0 (root=default,host=node-1) is down osd.3 (root=default,host=node-1) is down OSD_HOST_DOWN 1 host (2 osds) down host node-1 (root=default) (2 osds) is down PG_DEGRADED Degraded data redundancy: 54018/162054 objects degraded (33.333%), 208 pgs degraded, 441 pgs undersized pg 1.29 is stuck undersized for 222.261023, current state active+undersized, last acting [2,1] pg 1.2a is stuck undersized for 222.251868, current state active+undersized, last acting [2,1] pg 1.2b is stuck undersized for 222.246564, current state active+undersized, last acting [2,1] pg 1.2c is stuck undersized for 221.679774, current state active+undersized+degraded, last acting [1,2] # Three OSDs are stopped, and one OSD remains, which is not satisfied for min_size=2, the cluster loses the ability to read and write, and there is an understated + degraded + peered warning [root@node-2 ~]# ceph -s cluster: id: 60e065f1-d992-4d1a-8f4e-f74419674f7e health: HEALTH_WARN 3 osds down 2 hosts (3 osds) down Reduced data availability: 192 pgs inactive Degraded data redundancy: 107832/161748 objects degraded (66.667%), 208 pgs degraded services: mon: 3 daemons, quorum node-1,node-2,node-3 (age 5h) mgr: node-1(active, since 20h) mds: cephfs:2 {0=mds2=up:active,1=mds1=up:active} 1 up:standby osd: 4 osds: 1 up (since 47s), 4 in (since 6d) rgw: 1 daemon active (node-2) task status: data: pools: 9 pools, 464 pgs objects: 53.92k objects, 803 MiB usage: 16 GiB used, 24 GiB / 40 GiB avail pgs: 100.000% pgs not active 107832/161748 objects degraded (66.667%) 256 undersized+peered 208 undersized+degraded+peered # When four OSDs are stopped and it is found that there is only one OSD left, even if the OSD process is stopped, it is still up through the check of ceph -s command # However, check the change process and find that the process status is dead # The PG status is stale + underestimated + peered, and the cluster loses the ability to read and write [root@node-1 ~]# systemctl status ceph-osd@0 ● ceph-osd@0.service - Ceph object storage daemon osd.0 Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled) Active: inactive (dead) since IV 2021-10-14 15:36:14 CST; 1min 56s ago Process: 5528 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS) Process: 5524 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) Main PID: 5528 (code=exited, status=0/SUCCESS) . . . [root@node-1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.03918 root default -3 0.01959 host node-1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 down 0.09999 1.00000 -5 0.00980 host node-2 1 hdd 0.00980 osd.1 down 1.00000 1.00000 -7 0.00980 host node-3 2 hdd 0.00980 osd.2 down 1.00000 1.00000 [root@node-1 ~]# ceph pg stat 464 pgs: 440 down, 14 stale+undersized+peered, 10 stale+undersized+degraded+peered; 801 MiB data, 12 GiB used, 24 GiB / 40 GiB avail; 3426/161460 objects degraded (2.122%)
Restart all stopped OSD s and the cluster will slowly recover.
# All OSD s have been restarted and PG is peering, but it can be read and written at this time. [root@node-1 ~]# ceph -s cluster: id: 60e065f1-d992-4d1a-8f4e-f74419674f7e health: HEALTH_WARN Reduced data availability: 1 pg inactive, 2 pgs peering Degraded data redundancy: 16715/162054 objects degraded (10.314%), 65 pgs degraded services: mon: 3 daemons, quorum node-1,node-2,node-3 (age 5h) mgr: node-1(active, since 20h) mds: cephfs:2 {0=mds2=up:active,1=mds1=up:active} 1 up:standby osd: 4 osds: 4 up (since 5s), 4 in (since 5s) rgw: 1 daemon active (node-2) task status: data: pools: 9 pools, 464 pgs objects: 54.02k objects, 803 MiB usage: 11 GiB used, 19 GiB / 30 GiB avail pgs: 65.302% pgs not active 16715/162054 objects degraded (10.314%) 294 peering 75 active+undersized 62 active+undersized+degraded 21 active+clean 9 remapped+peering 2 active+recovery_wait+degraded 1 active+recovering+degraded # After a while, check the health of ceph again and find that it is rebalancing. At this time, the cluster can still read and write, and the PG status is "active + clean" [root@node-1 ~]# ceph -s cluster: id: 60e065f1-d992-4d1a-8f4e-f74419674f7e health: HEALTH_OK ... progress: Rebalancing after osd.0 marked in [==............................] [root@node-1 ~]# rados -p pool-1 put file_bench_cephfs.f file_bench_cephfs.f [root@node-1 ~]# rados -p pool-1 ls | grep file_bench file_bench_cephfs.f
Here are the PG statuses that the cluster cannot read or write:
- stale: OSD all hung
- peered: OSD is less than min_size
- down: the OSD node data is too old, and other online OSDs are not enough to complete data repair
The stale and peered states have been demonstrated above by stopping the OSD service.
A classic scene of down: a (main), B, C
a. first kill B b. New write data to A,C c. kill A and C d. Pull up B
At this time, the surviving B data is old (excluding new data), and there are no other OSDs in the cluster to help them complete data migration. Therefore, a down will be displayed. Refer to the link: https://zhuanlan.zhihu.com/p/138778000#:~:text=3.8.3%20PG%E4%B8%BADown%E7%9A%84OSD%E4%B8%A2%E5%A4%B1%E6%88%96%E6%97%A0%E6%B3%95%E6%8B%89%E8%B5%B7
The solution to down is still to restart the failed OSD.
2.3 A PG data is damaged
Reference link: https://ceph.com/geen-categorie/ceph-manually-repair-object/
Generally, the damaged PG can be repaired manually. Use ceph pg repair {pgid}
When the PG status is inconsistent, it indicates that there are inconsistent objects in the PG. It is possible that an OSD disk is damaged, or a silent error occurs in the data on the disk.
Next, manually construct an example of PG data corruption and repair it.
# 1. Close the OSD service $ systemctl stop ceph-osd@{id} # 2. Use CEPH objectstore tool to mount / var/lib/ceph/osd/ceph-0 to / mnt/ceph-osd@0 [root@node-1 ceph-objectstore-tool-test]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op fuse --mountpoint /mnt/ceph-osd@0/ mounting fuse at /mnt/ceph-osd@0/ ... # 3. Delete / MNT / CEPH OSD / @ 0 / 10.0_ A directory file (i.e. an object in PG) in the head / all folder destroys an object of 10.0pg [root@node-1 all]# rm -rf \#10\:01ec679f\:\:\:10000011eba.00000000\:head#/ rm: Cannot delete"#10:01ec679f:::10000011eba.00000000:head#/bitwise_hash ": operation not allowed rm: Cannot delete"#10:01ec679f:::10000011eba.00000000:head#/omap ": operation not allowed rm: Cannot delete"#10:01ec679f:::10000011eba.00000000:head#/attr ": operation not allowed # 4. Uninstall / mnt/ceph-osd@0 , restart the OSD service and wait for the cluster to return to normal # 5. Manually scrub 10.0 PG, command: ceph pg scrub 10.0, wait for background scrub to complete [root@node-1 ~]# ceph pg scrub 10.0 instructing pg 10.0 on osd.2 to scrub # 6. It is found that the cluster reports an error. The PG id is 10.0 and the status is active+clean+inconsistent [root@node-1 ~]# ceph health detail HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 2 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 10.0 is active+clean+inconsistent, acting [2,1,0] # 7. Execute repair. PG status: active+clean+scrubbing+deep+inconsistent+repair [root@node-1 ~]# ceph pg repair 10.0 instructing pg 10.0 on osd.2 to repair [root@node-1 ~]# ceph -s cluster: id: 60e065f1-d992-4d1a-8f4e-f74419674f7e health: HEALTH_ERR 2 scrub errors Possible data damage: 1 pg inconsistent . . . data: pools: 9 pools, 464 pgs objects: 53.99k objects, 802 MiB usage: 16 GiB used, 24 GiB / 40 GiB avail pgs: 463 active+clean 1 active+clean+scrubbing+deep+inconsistent+repair # 8. Wait for the cluster to recover [root@node-1 ~]# ceph health detail HEALTH_OK
If the ceph pg repair {pgid} command cannot repair PG, you can use CEPH objectstore tool to import the whole PG.
Reference link: https://www.jianshu.com/p/36c2d5682d87#:~:text=%E8%B5%B7%E5%A4%AF%E4%BD%8F%E3%80%82-,3.9%20Incomplete,-Peering%E8%BF%87%E7%A8%8B%E4%B8%AD
Structural fault
# Construct the fault environment and use CEPH objectstore tool to delete the same object on two of the three replicas. # Note that before using CEPH objectstore tool, you need to stop the osd service and use systemctl to stop CEPH osd @{id} # Select 10.0 and delete 1000000d4dc.0000000 objects on both node-2 and node3 nodes. The cluster is 3 replicas, and 10.0PG is distributed on node1, 2 and 3 [root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 1000000d4dc.00000000 remove remove #10:03f57502:::1000000d4dc.00000000:head# [root@node-3 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 1000000d4dc.00000000 remove remove #10:03f57502:::1000000d4dc.00000000:head# [root@node-1 ~]# ceph health detail HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 2 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 10.0 is active+clean+inconsistent, acting [2,1,0]
Repair with CEPH objectstore tool
# Query data comparison # 1. Export the object list of PG and put all the lists in the ~ / export folder of node-1 node for comparison $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list > ~/export/pg-10.0-osd0.txt [root@node-1 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list > ~/export/pg-10.0-osd1.txt [root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op list > ~/pg-10.0-osd1.txt [root@node-3 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 --op list > ~/pg-10.0-osd2.txt [root@node-1 export]# scp root@node-2:/root/pg-10.0-osd1.txt ./ pg-10.0-osd1.txt 100% 97KB 19.5MB/s 00:00 pg-10.0-osd0.txt pg-10.0-osd1.txt [root@node-1 export]# scp root@node-3:/root/pg-10.0-osd2.txt ./ pg-10.0-osd2.txt 100% 97KB 35.0MB/s 00:00 [root@node-1 export]# ls pg-10.0-osd0.txt pg-10.0-osd1.txt pg-10.0-osd2.txt # 2. Query the number of objects in PG and find that 10.0PG on node-1 has the most objects, 833 $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list | wc -l [root@node-1 export]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list | wc -l 833 [root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op list | wc -l 832 [root@node-3 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 --op list | wc -l 832 # 3. Compare whether the objects of all replicas are consistent. In this example, node-2 and node-3 are consistent, node-1 is inconsistent with them, but the number of objects is the largest # In this example, the PG copy of node-1 is used and the OSD on node-2 and node-3 is imported # -As mentioned above, after diff comparison, check whether the object list of each copy (master-slave copies) is consistent. Avoid data inconsistency. The backup with the largest number of objects and the largest number of objects after diff comparison. # -As mentioned above, after diff comparison, the quantity is inconsistent. If the largest number does not contain all objects, you need to consider not overwriting the import and then exporting. Finally, import all complete objects. Note: import requires removing PG in advance, which is equal to overwriting the import. # -As mentioned above, if the data is consistent after diff comparison, use the backup with the largest number of objects, and then import it to the pg with a small number of objects. Then mark complete all replicas. Be sure to export pg backup at the osd node of all replicas to avoid exceptions, and then restore the pg. [root@node-1 export]# diff -u ./pg-10.0-osd0.txt ./pg-10.0-osd1.txt [root@node-1 export]# diff -u ./pg-10.0-osd0.txt ./pg-10.0-osd2.txt [root@node-1 export]# diff -u ./pg-10.0-osd2.txt ./pg-10.0-osd1.txt # 4. Export the PG of node-1 node. The export file name can be defined by itself, and copy this file to node-2 and node3 nodes ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op export --file ~/export/pg-10.0.obj [root@node-1 export]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op export --file ~/export/pg-10.0.obj Read #10:03f57502:::1000000d4dc.00000000:head# Read #10:03f6b1a4:::100000091a0.00000000:head# Read #10:03f6dfc2:::10000010b31.00000000:head# Read #10:03f913b2:::10000010740.00000000:head# Read #10:03f99080:::10000010f0f.00000000:head# Read #10:03fc19a4:::10000011c5e.00000000:head# Read #10:03fe3b90:::10000010166.00000000:head# Read #10:03fe60e1:::10000011c44.00000000:head# ........ Export successful [root@node-1 export]# ls pg-10.0.obj pg-10.0-osd0.txt pg-10.0-osd1.txt pg-10.0-osd2.txt [root@node-1 export]# scp pg-10.0.obj root@node-2:/root/ pg-10.0.obj 100% 4025KB 14.7MB/s 00:00 [root@node-1 export]# scp pg-10.0.obj root@node-3:/root/ pg-10.0.obj # Note: for all subsequent operations, node-2 and node-3 nodes are the same. For simplicity, only node-2 nodes are shown # 5. Import the backup PG on node-2 and node-1 nodes # Before importing the backup, it is recommended to export the PG to be replaced, so that it can be restored in case of subsequent problems # Import the specified PG metadata into the current pg. before importing, you need to delete the current PG (export to back up the PG data before removing). You need to remove the current PG, otherwise it cannot be imported, and you will be prompted that it already exists. # 5.1 backup ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op export --file ~/pg-10.0-node-1.obj [root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op export --file ~/pg-10.0-node-2.obj Read #10:03f6b1a4:::100000091a0.00000000:head# Read #10:03f6dfc2:::10000010b31.00000000:head# Read #10:03f913b2:::10000010740.00000000:head# Read #10:03f99080:::10000010f0f.00000000:head# Read #10:03fc19a4:::10000011c5e.00000000:head# Read #10:03fe3b90:::10000010166.00000000:head# Read #10:03fe60e1:::10000011c44.00000000:head# ... Export successful # 5.2 deletion ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op remove --force [root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op remove --force marking collection for removal setting '_remove' omap key finish_remove_pgs 10.0_head removing 10.0 Remove successful # 5.3 import ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op import --file ~/pg-10.0.obj [root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op import --file ~/pg-10.0.obj Write #10:03f6dfc2:::10000010b31.00000000:head# snapset 1=[]:{} Write #10:03f913b2:::10000010740.00000000:head# snapset 1=[]:{} Write #10:03f99080:::10000010f0f.00000000:head# snapset 1=[]:{} .... write_pg epoch 6727 info 10.0( v 5925'23733 (5924'20700,5925'23733] local-lis/les=6726/6727 n=814 ec=5833/5833 lis/c 6726/6726 les/c/f 6727/6727/0 6726/6726/6724) Import successful # 6. Check that PG is not in inconsistent state and is slowly recovering [root@node-3 ~]# ceph -s cluster: id: 60e065f1-d992-4d1a-8f4e-f74419674f7e health: HEALTH_WARN Degraded data redundancy: 37305/161973 objects degraded (23.032%), 153 pgs degraded, 327 pgs undersized services: mon: 3 daemons, quorum node-1,node-2,node-3 (age 23m) mgr: node-1(active, since 23m) mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby osd: 4 osds: 4 up (since 6s), 4 in (since 16h) rgw: 1 daemon active (node-2) task status: data: pools: 9 pools, 464 pgs objects: 53.99k objects, 802 MiB usage: 16 GiB used, 24 GiB / 40 GiB avail pgs: 37305/161973 objects degraded (23.032%) 182 active+undersized 145 active+undersized+degraded 129 active+clean 8 active+recovering+degraded io: recovery: 0 B/s, 2 objects/s
2.4 OSD down cannot be restarted
The above describes the method of restarting OSD to solve cluster failures, but sometimes OSD down can not be restarted.
# OSD down causes PG failure. Check OSD and find osd.1 down on node-2 [root@node-2 ~]# ceph health detail HEALTH_WARN Degraded data redundancy: 53852/161556 objects degraded (33.333%), 209 pgs degraded, 464 pgs undersized PG_DEGRADED Degraded data redundancy: 53852/161556 objects degraded (33.333%), 209 pgs degraded, 464 pgs undersized ... [root@node-2 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.03918 root default -3 0.01959 host node-1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 0.09999 1.00000 -5 0.00980 host node-2 1 hdd 0.00980 osd.1 down 0 1.00000 -7 0.00980 host node-3 2 hdd 0.00980 osd.2 up 1.00000 1.00000 # Try to restart OSD and find that it cannot be restarted [root@node-2 ~]# systemctl restart ceph-osd@1 Job for ceph-osd@1.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@1.service" and "journalctl -xe" for details. To force a start use "systemctl reset-failed ceph-osd@1.service" followed by "systemctl start ceph-osd@1.service" again. [root@node-2 ~]# systemctl stop ceph-osd@1 [root@node-2 ~]# systemctl restart ceph-osd@1 Job for ceph-osd@1.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@1.service" and "journalctl -xe" for details. To force a start use "systemctl reset-failed ceph-osd@1.service" followed by "systemctl start ceph-osd@1.service" again. [root@node-2 ~]# systemctl stop ceph-osd@1 [root@node-2 ~]# systemctl restart ceph-osd@1 Job for ceph-osd@1.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@1.service" and "journalctl -xe" for details. To force a start use "systemctl reset-failed ceph-osd@1.service" followed by "systemctl start ceph-osd@1.service" again
There are three solutions to the above problems:
- If the down OSD does not affect the cluster write, that is, the PG status is grade, you can try to restart the OSD after waiting for a period of time.
- Alternatively, if there are no other important services on the server, you can try restarting the server.
- The above two solutions cannot be solved. You can manually delete the OSD and then rebuild the OSD.
The following is an example of manually deleting an OSD and then re creating an OSD:
# In this example, osd.1 on the node-2 node is deleted # 1. [root@node-1 ~]# ceph osd rm osd.1 removed osd.1 [root@node-1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.03918 root default -3 0.01959 host node-1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 0.09999 1.00000 -5 0.00980 host node-2 1 hdd 0.00980 osd.1 DNE 0 -7 0.00980 host node-3 2 hdd 0.00980 osd.2 up 1.00000 1.00000 # 2. [root@node-1 ~]# ceph osd crush rm osd.1 removed item id 1 name 'osd.1' from crush map [root@node-1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.02939 root default -3 0.01959 host node-1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 0.09999 1.00000 -5 0 host node-2 -7 0.00980 host node-3 2 hdd 0.00980 osd.2 up 1.00000 1.00000 # 3. [root@node-1 ~]# ceph auth del osd.1 updated [root@node-1 ~]# ceph auth ls | grep osd.1 installed auth entries: # 4. Check the cluster status. Since the cluster is a 3-replica, one OSD is missing, and all PG statuses are still active, which does not affect the cluster reading and writing [root@node-1 ~]# ceph -s cluster: id: 60e065f1-d992-4d1a-8f4e-f74419674f7e health: HEALTH_WARN Degraded data redundancy: 53157/161724 objects degraded (32.869%), 207 pgs degraded, 461 pgs undersized services: mon: 3 daemons, quorum node-1,node-2,node-3 (age 21m) mgr: node-1(active, since 57m) mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby osd: 3 osds: 3 up (since 2m), 3 in (since 24m); 4 remapped pgs rgw: 1 daemon active (node-2) task status: data: pools: 9 pools, 464 pgs objects: 53.91k objects, 802 MiB usage: 11 GiB used, 19 GiB / 30 GiB avail pgs: 53157/161724 objects degraded (32.869%) 816/161724 objects misplaced (0.505%) 254 active+undersized 206 active+undersized+degraded 3 active+clean+remapped 1 active+undersized+degraded+remapped+backfilling io: recovery: 35 KiB/s, 7 objects/s # 5. Uninstall / var/lib/ceph/osd/ceph-1 / on node-2. This directory contains the relevant contents of OSD.1 and a / block file linked to a specific disk [root@node-1 ~]# ssh node-2 Last login: Mon Oct 18 09:48:25 2021 from node-1 [root@node-2 ~]# umount /var/lib/ceph/osd/ceph-1/ [root@node-2 ~]# rm -rf /var/lib/ceph/osd/ceph-1/ # 6. The disk mapped to OSD.1 on node-2 is / dev/sdb, which is also the disk to be rebuilt later [root@node-2 ~]# ceph-volume lvm list ====== osd.1 ======= [block] /dev/ceph-847ed937-dfbb-485e-af90-9cf27bf08c99/osd-block-66119fd9-226d-4665-b2cc-2b6564b7d715 block device /dev/ceph-847ed937-dfbb-485e-af90-9cf27bf08c99/osd-block-66119fd9-226d-4665-b2cc-2b6564b7d715 block uuid 9owOrT-EMVD-c2kY-53Xj-2ECv-0Kji-euIRkX cephx lockbox secret cluster fsid 60e065f1-d992-4d1a-8f4e-f74419674f7e cluster name ceph crush device class None encrypted 0 osd fsid 66119fd9-226d-4665-b2cc-2b6564b7d715 osd id 1 osdspec affinity type block vdo 0 devices /dev/sdb # 7. Format: execute dmsetup remove {device number} first, and then format [root@node-2 ~]# dmsetup ls ceph--847ed937--dfbb--485e--af90--9cf27bf08c99-osd--block--66119fd9--226d--4665--b2cc--2b6564b7d715 (253:2) centos-swap (253:1) centos-root (253:0) [root@node-2 ~]# dmsetup remove ceph--847ed937--dfbb--485e--af90--9cf27bf08c99-osd--block--66119fd9--226d--4665--b2cc--2b6564b7d715 [root@node-2 ~]# mkfs.xfs -f /dev/sdb meta-data=/dev/sdb isize=512 agcount=4, agsize=655360 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=2621440, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # 8. Recreate the OSD [root@node-2 ~]# ceph-volume lvm create --data /dev/sdb Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 1e8fb7ca-a870-414b-930d-1e22f32eb84b Running command: /usr/sbin/vgcreate --force --yes ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254 /dev/sdb stdout: Wiping xfs signature on /dev/sdb. stdout: Physical volume "/dev/sdb" successfully created. stdout: Volume group "ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254" successfully created Running command: /usr/sbin/lvcreate --yes -l 2559 -n osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254 stdout: Logical volume "osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b" created. Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1 Running command: /usr/bin/chown -h ceph:ceph /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b Running command: /usr/bin/chown -R ceph:ceph /dev/dm-2 Running command: /usr/bin/ln -s /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b /var/lib/ceph/osd/ceph-1/block Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-1/activate.monmap stderr: 2021-10-18 13:53:29.353 7f369a1e2700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory 2021-10-18 13:53:29.353 7f369a1e2700 -1 AuthRegistry(0x7f36940662f8) no keyring found at /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx stderr: got monmap epoch 3 Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-1/keyring --create-keyring --name osd.1 --add-key AQDYC21hGa9KKRAAZoHogGn9ouEPpb9RY3/FXw== stdout: creating /var/lib/ceph/osd/ceph-1/keyring added entity osd.1 auth(key=AQDYC21hGa9KKRAAZoHogGn9ouEPpb9RY3/FXw==) Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/ Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid 1e8fb7ca-a870-414b-930d-1e22f32eb84b --setuser ceph --setgroup ceph stderr: 2021-10-18 13:53:29.899 7fa4e1df1a80 -1 bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid --> ceph-volume lvm prepare successful for: /dev/sdb Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1 Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b --path /var/lib/ceph/osd/ceph-1 --no-mon-config Running command: /usr/bin/ln -snf /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b /var/lib/ceph/osd/ceph-1/block Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block Running command: /usr/bin/chown -R ceph:ceph /dev/dm-2 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1 Running command: /usr/bin/systemctl enable ceph-volume@lvm-1-1e8fb7ca-a870-414b-930d-1e22f32eb84b stderr: Created symlink from /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-1e8fb7ca-a870-414b-930d-1e22f32eb84b.service to /usr/lib/systemd/system/ceph-volume@.service. Running command: /usr/bin/systemctl enable --runtime ceph-osd@1 Running command: /usr/bin/systemctl start ceph-osd@1 --> ceph-volume lvm activate successful for osd ID: 1 --> ceph-volume lvm create successful for: /dev/sdb # 9. Check the cluster and wait for the data to migrate and recover slowly [root@node-2 ~]# ceph -s cluster: id: 60e065f1-d992-4d1a-8f4e-f74419674f7e health: HEALTH_WARN Degraded data redundancy: 53237/161556 objects degraded (32.953%), 213 pgs degraded services: mon: 3 daemons, quorum node-1,node-2,node-3 (age 4h) mgr: node-1(active, since 5h) mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby osd: 4 osds: 4 up (since 38s), 4 in (since 4h); 111 remapped pgs rgw: 1 daemon active (node-2) task status: data: pools: 9 pools, 464 pgs objects: 53.85k objects, 802 MiB usage: 12 GiB used, 28 GiB / 40 GiB avail pgs: 53237/161556 objects degraded (32.953%) 814/161556 objects misplaced (0.504%) 241 active+clean 103 active+recovery_wait+degraded 63 active+undersized+degraded+remapped+backfill_wait 46 active+recovery_wait+undersized+degraded+remapped 9 active+recovery_wait 1 active+recovering+undersized+degraded+remapped 1 active+remapped+backfill_wait io: recovery: 0 B/s, 43 keys/s, 2 objects/s [root@node-2 ~]# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.00980 1.00000 10 GiB 4.8 GiB 3.8 GiB 20 MiB 1004 MiB 5.2 GiB 47.59 1.57 440 up 3 hdd 0.00980 0.09999 10 GiB 1.3 GiB 265 MiB 1.5 MiB 1022 MiB 8.7 GiB 12.59 0.42 25 up 1 hdd 0.00980 1.00000 10 GiB 1.2 GiB 212 MiB 0 B 1 GiB 8.8 GiB 12.08 0.40 361 up 2 hdd 0.00980 1.00000 10 GiB 4.9 GiB 3.9 GiB 22 MiB 1002 MiB 5.1 GiB 48.85 1.61 464 up TOTAL 40 GiB 12 GiB 8.1 GiB 44 MiB 4.0 GiB 28 GiB 30.28 MIN/MAX VAR: 0.40/1.61 STDDEV: 18.02
When rebuilding OSD, you should note that if you have customized the crush map in your cluster, you also need to check the crush map.
During OSD recovery, the io services provided by the cluster may be affected. The following modifiable configurations are given here.
Reference link:
https://www.cnblogs.com/gzxbkk/p/7704464.html
http://strugglesquirrel.com/2019/02/02/ceph%E8%BF%90%E7%BB%B4%E5%A4%A7%E5%AE%9D%E5%89%91%E4%B9%8B%E9%9B%86%E7%BE%A4osd%E4%B8%BAfull/
In order to prevent the osd from hanging up after pg starts to migrate, write the following configuration in the configuration file global
osd_op_thread_suicide_timeout = 900 osd_op_thread_timeout = 900 osd_recovery_thread_suicide_timeout = 900 osd_heartbeat_grace = 900
The disk recovery speed configuration, in fact, has been written by default. If you want to speed up the migration, you can try to adjust the following parameters
osd_recovery_max_single_start #The larger the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default value is 1 osd_recovery_max_active #The larger the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default value is 3 osd_recovery_op_priority #The larger the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default value is 3 osd_max_backfills #The larger the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default value is 1 osd_recovery_sleep #The smaller the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default is 0 seconds [root@node-2 ~]# ceph config set osd osd_recovery_max_single_start 32 [root@node-2 ~]# ceph config set osd osd_max_backfills 32 [root@node-2 ~]# ceph config set osd osd_recovery_max_active 32 After setting, OSD The recovery speed is greatly accelerated. Note that the three configurations should be increased synchronously. Otherwise, adding only one of them will not accelerate the recovery speed due to other short boards # ceph -s checks the recovery speed. Before setting the above parameters, the recovery speed is only 6 objects/s. [root@node-2 ~]# ceph -s io: recovery: 133 KiB/s, 35 objects/s
Attach configuration control command
# View all configurations ceph config ls # View configuration default information [root@node-2 ~]# ceph config help osd_recovery_sleep osd_recovery_sleep - Time in seconds to sleep before next recovery or backfill op (float, advanced) Default: 0.000000 Can update at runtime: true # View customized configurations [root@node-2 ~]# ceph config dump WHO MASK LEVEL OPTION VALUE RO mon advanced mon_warn_on_insecure_global_id_reclaim false mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false mgr advanced mgr/balancer/active false mgr advanced mgr/balancer/mode upmap mgr advanced mgr/balancer/sleep_interval 60 mds advanced mds_session_blacklist_on_evict true mds advanced mds_session_blacklist_on_timeout true client advanced client_reconnect_stale true client advanced debug_client 20/20 # Modify configuration ceph config set {mon/osd/client/mgr/..} [config_name] [value] ceph config set client client_reconnect_stale true # Query the specified user-defined configuration (osd.osd mon.mon mgr.mgr, written like this) $ ceph config get client.client WHO MASK LEVEL OPTION VALUE RO client advanced client_reconnect_stale true client advanced debug_client 20/20 # Delete custom configuration $ ceph config rm [who] [name]
2.5 PG missing
Reference link: https://zhuanlan.zhihu.com/p/74323736
Generally speaking, PG loss is unlikely to occur in the case of cluster three replicas. If it occurs, it means that the lost data cannot be retrieved.
# 1. Find the missing PG root@storage01-ib:~# ceph pg dump_stuck unclean | grep unknown 20.37 unknown [] -1 [] -1 20.29 unknown [] -1 [] -1 20.16 unknown [] -1 [] -1 # 2. Create PG root@storage01-ib:~# ceph osd force-create-pg 20.37 --yes-i-really-mean-it pg 20.37 now creating, ok
Note: do not use single copy clusters.
2.6 too few or too many PG
When the warning "1 pools have many more objects per pg than average" appears, it indicates that the number of PG in a pool in the cluster is too small, and the objects carried by each PG are more than 10 times higher than the average PG objects in the cluster. The simplest solution is to increase the number of PG in the pool.
# Alarm: 1 pools have many more objects per pg than average [root@lab8106 ~]# ceph -s cluster fa7ec1a1-662a-4ba3-b478-7cb570482b62 health HEALTH_WARN 1 pools have many more objects per pg than average # Increase the number of pg in this pool. After version N, you only need to adjust pg_num, pgp_num will adjust automatically. ceph osd pool set cephfs_metadata pg_num 64