Common faults and treatment of PG

Keywords: Ceph

Detailed explanation of PG abnormal state and fault summary

reference resources:
https://www.jianshu.com/p/36c2d5682d87

https://blog.csdn.net/wylfengyujiancheng/article/details/89235241?utm_medium=distribute.pc_relevant.none-task-blog-2defaultbaidujs_baidulandingword~default-1.no_search_link&spm=1001.2101.3001.4242

https://github.com/lidaohang/ceph_study/blob/master/%E5%B8%B8%E8%A7%81%20PG%20%E6%95%85%E9%9A%9C%E5%A4%84%E7%90%86.md

Ceph's Rados design principle and Implementation

1. Detailed explanation of PG abnormal state

1.1 PG status introduction

Here, PG state refers to the external state of PG, that is, the state that can be directly seen by the user.

You can view the current status of PG through ceph pg stat command. The health status is "active + clean"

[root@node-1 ~]#  ceph pg stat
464 pgs: 464 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail

Some common PG external states are given below. (refer to section 6.3 of Ceph's Rados design principle and Implementation)

statemeaning
activatingPeering is about to complete, waiting for all replicas to synchronize and solidify peering results (Info, log, etc.)
activeActive. PG can normally handle read and write requests from clients
backfillingFilling status in the background. backfill is a special scenario of recovery. After peering is completed, if incremental synchronization cannot be performed on some PG instances in the Up Set based on the current authoritative log (for example, the OSD hosting these PG instances is offline for too long, or the overall migration of PG instances is caused by the addition of a new OSD to the cluster), full synchronization will be performed by completely copying all objects in the current Primary
backfill-toofullThe OSD where the replica is located does not have enough space, and the backfill process is suspended
backfill-waitWait for the backfill resource reservation to complete
cleanCurrently, there are no degraded objects (objects to be repaired) in PG. the contents of acting set and up set are consistent, and the size is equal to the number of storage pool replicas
creatingPG is being created
deepPG is performing or is about to perform deep Scrub (object consistency scan)
degradedThere are degraded objects in PG (after peering, PG detects an inconsistency in a PG instance), or the size of the acting set is less than the number of replicas in the storage pool (but not less than the minimum number of replicas in the storage pool)
downDuring Peering, PG detects that the current surviving replica is insufficient to complete data recovery in an Interval that cannot be skipped
incompleteDuring Peering, the authoritative log cannot be selected or the selected acting set is not enough to complete data repair
inconsistentDuring Scurb, it was detected that one or some objects were inconsistent between replicas
peeredPeering has completed, but pg the current acting set size is less than the minimum number of replicas specified by the storage pool
peeringPeer in progress
recoveringPG is repairing degraded objects (inconsistent objects) in the background according to Peering results
recovering-waitWait for the Recovery resource reservation to complete
remappedIn case of any change in PG activity set, the data will be migrated from the old activity set to the new activity set. During the migration, the client requests are still processed with the primary OSD in the old activity set. Once the migration is completed, the primary OSD in the new activity set starts processing
repairFix inconsistent objects
scrubbingPG is executing Scrub
staleThe Monitor detects that the current OSD of the Primary node is down and there is no subsequent handover, or the Primary node fails to report PG related statistics to the Monitor (for example, temporary network congestion)
undersizedThe number of replicas in the current acting set is less than the number of replicas in the storage pool (but not less than the minimum number of replicas in the storage pool)
unactivePG cannot process read / write requests
uncleanPG cannot recover from the previous failure

1.2 detailed explanation of PG abnormal state

Reference link: http://luqitao.github.io/2016/07/14/ceph-pg-states-introduction/

Some PG abnormal states (requiring manual repair) are described below.

  • Degraded: degraded

    When the client writes data to the master OSD, the master OSD is responsible for writing data copies to the other copy OSDs. After the master OSD writes the object to the memory, the master OSD will stay in the degraded state until the replica OSD creates the object replica and reports it to the master OSD. The homing group state can be in the active+degraded state because an OSD can be in the active state even if it does not hold all objects. If an OSD hangs, Ceph will mark all the homing groups assigned to this OSD as degraded; After that OSD is reborn, they must be interconnected again. However, the client can still write new objects to the homing group in the degraded state as long as it is still in the active state.

    If an OSD hangs and is always in the degraded state, Ceph will mark the down OSD as out of the cluster and remap the data on the down OSD to other OSDs. The time interval from marked down to out is controlled by mon osd down out interval. The default is 300 seconds.

    The homing group will also be degraded because Ceph cannot find one or more objects that should exist in this homing group. At this time, you cannot read or write the objects that cannot be found, but you can still access other objects in the degraded homing group.

  • Remapped: remapped
    When the acting set of a homing group is changed, the data should be migrated from the old set to the new one. The new primary OSD will take some time to provide services, so the old primary OSD will continue to provide services until the homing group is migrated. After data migration, the running diagram will contain the main OSD in the new acting set.

  • stale: obsolete

    By default, the OSD daemon will report its homing group, outgoing traffic, boot and failure statistics every half a second (0.5), which is higher than the heartbeat threshold. If the acting set of the primary OSD of a homing group fails to report to the monitor, or other monitors have reported that the primary OSD has been down, the monitors will mark the homing group as stale.

    When starting the cluster, you will often see the stale state until the interconnection is completed. After the cluster is running for a while, if you can still see that there are homed groups in the stale state, it means that the primary OSD of those homed groups is down or is not reporting statistics to the monitor.

  • Inconsistent: inconsistent

    PG usually has multiple copies, and the data of all copies should be completely consistent. However, sometimes the data on the replica is inconsistent due to OSD failure, network congestion and other factors. At this time, it is necessary to wake up and repair the inconsistent PG.

2. Common fault handling methods

2.1 the number of OSDs is less than the set number of copies

Generally, the storage pool is set to 3 replicas, that is, 1 pg will be stored in 3 OSD s. Under normal conditions, PG status is displayed as "active + clean"

If your cluster is smaller than three replicas, for example, there are only two OSDs, you may have all OSDs in the up and in states, but PG can never achieve "active + clean", which may be because of osd pool size/min_size is set to a value greater than 2.

  • If two OSDs are available, osd pool size is greater than 2, OSD pool min_ If the size is less than or equal to 2, the PG status displays grade, but it does not affect the reading and writing of the pool.
  • If 2 OSDs are available, OSD pool min_ If the size is greater than 2, PG displays peered status. At this time, the pool cannot respond to read-write requests.
# min_size=4,size=5, actual osd copies = 3
[root@node-1 ~]# ceph osd dump | grep pool-1
pool 1 'pool-1' replicated size 5 min_size 4 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 6359 flags hashpspool stripe_width 0 application rbd
[root@node-1 ~]# ceph pg stat
464 pgs: 70 undersized+degraded+peered, 58 undersized+peered, 336 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail; 192/162165 objects degraded (0.118%)
[root@node-1 ~]# rados -p pool-1 put file_bench_cephfs.f file_bench_cephfs.f 
# The operation is blocked and the write request cannot be completed

It can be seen that osd pool min_size is the number of OSD replicas that must be met, and osd pool size is the number of OSD replicas that are recommended to be met. The former is a condition that must be met, otherwise the pool cannot be read or written; The latter can not be satisfied, but the cluster will report a warning. The above problems can be solved by setting reasonable osd pool size and osd pool min size.

[root@node-1 ~]# ceph osd pool set pool-1 size 3
set pool 1 size to 3
[root@node-1 ~]# ceph osd pool set pool-1 min_size 2
set pool 1 min_size to 2
[root@node-1 ~]# ceph pg stat
464 pgs: 464 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail

Crash map error

Another possible reason why PG cannot reach the clean state is that there is an error in the cluster's Cross map, resulting in PG unable to map to the correct place.

2.2 PG fault caused by OSD down

The most common PG failures are caused by one or more OSD processes hanging up. Generally, the health is restored after restarting OSD.

You can check whether there is OSD down through ceph -s or ceph osd stat.

[root@node-1 ~]# ceph osd stat
4 osds: 4 up (since 4h), 4 in (since 6d); epoch: e6364

Try to stop one or more OSDs (3 replica clusters, 4 OSDs in total) and observe the cluster status.

# When one OSD is stopped and three OSDs remain, an active + underestimated + degraded warning appears, indicating that the cluster can still read and write
[root@node-1 ~]# ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 52306/162054 objects degraded (32.277%), 197 pgs degraded
OSD_DOWN 1 osds down
    osd.0 (root=default,host=node-1) is down
PG_DEGRADED Degraded data redundancy: 52306/162054 objects degraded (32.277%), 197 pgs degraded
    pg 1.1d is active+undersized+degraded, acting [2,1]
    pg 1.60 is active+undersized+degraded, acting [1,2]
    pg 1.62 is active+undersized+degraded, acting [2,1]
    ...
    
    
# Two OSDs are stopped, and there are still two OSDs, meeting the requirements of min_size=2, the cluster can still read and write
[root@node-1 ~]# ceph health detail
HEALTH_WARN 2 osds down; 1 host (2 osds) down; Degraded data redundancy: 54018/162054 objects degraded (33.333%), 208 pgs degraded, 441 pgs undersized
OSD_DOWN 2 osds down
    osd.0 (root=default,host=node-1) is down
    osd.3 (root=default,host=node-1) is down
OSD_HOST_DOWN 1 host (2 osds) down
    host node-1 (root=default) (2 osds) is down
PG_DEGRADED Degraded data redundancy: 54018/162054 objects degraded (33.333%), 208 pgs degraded, 441 pgs undersized
    pg 1.29 is stuck undersized for 222.261023, current state active+undersized, last acting [2,1]
    pg 1.2a is stuck undersized for 222.251868, current state active+undersized, last acting [2,1]
    pg 1.2b is stuck undersized for 222.246564, current state active+undersized, last acting [2,1]
    pg 1.2c is stuck undersized for 221.679774, current state active+undersized+degraded, last acting [1,2]
    
    
# Three OSDs are stopped, and one OSD remains, which is not satisfied for min_size=2, the cluster loses the ability to read and write, and there is an understated + degraded + peered warning
[root@node-2 ~]# ceph -s
  cluster:
    id:     60e065f1-d992-4d1a-8f4e-f74419674f7e
    health: HEALTH_WARN
            3 osds down
            2 hosts (3 osds) down
            Reduced data availability: 192 pgs inactive
            Degraded data redundancy: 107832/161748 objects degraded (66.667%), 208 pgs degraded
 
  services:
    mon: 3 daemons, quorum node-1,node-2,node-3 (age 5h)
    mgr: node-1(active, since 20h)
    mds: cephfs:2 {0=mds2=up:active,1=mds1=up:active} 1 up:standby
    osd: 4 osds: 1 up (since 47s), 4 in (since 6d)
    rgw: 1 daemon active (node-2)
 
  task status:
 
  data:
    pools:   9 pools, 464 pgs
    objects: 53.92k objects, 803 MiB
    usage:   16 GiB used, 24 GiB / 40 GiB avail
    pgs:     100.000% pgs not active
             107832/161748 objects degraded (66.667%)
             256 undersized+peered
             208 undersized+degraded+peered
  
# When four OSDs are stopped and it is found that there is only one OSD left, even if the OSD process is stopped, it is still up through the check of ceph -s command
# However, check the change process and find that the process status is dead
# The PG status is stale + underestimated + peered, and the cluster loses the ability to read and write
[root@node-1 ~]# systemctl status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon osd.0
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
   Active: inactive (dead) since IV 2021-10-14 15:36:14 CST; 1min 56s ago
  Process: 5528 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
  Process: 5524 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 5528 (code=exited, status=0/SUCCESS)
. . . 
[root@node-1 ~]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF 
-1       0.03918 root default                            
-3       0.01959     host node-1                         
 0   hdd 0.00980         osd.0       up  1.00000 1.00000 
 3   hdd 0.00980         osd.3     down  0.09999 1.00000 
-5       0.00980     host node-2                         
 1   hdd 0.00980         osd.1     down  1.00000 1.00000 
-7       0.00980     host node-3                         
 2   hdd 0.00980         osd.2     down  1.00000 1.00000 
 
 [root@node-1 ~]# ceph pg stat
464 pgs: 440 down, 14 stale+undersized+peered, 10 stale+undersized+degraded+peered; 801 MiB data, 12 GiB used, 24 GiB / 40 GiB avail; 3426/161460 objects degraded (2.122%)

Restart all stopped OSD s and the cluster will slowly recover.

# All OSD s have been restarted and PG is peering, but it can be read and written at this time.
[root@node-1 ~]# ceph -s
  cluster:
    id:     60e065f1-d992-4d1a-8f4e-f74419674f7e
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive, 2 pgs peering
            Degraded data redundancy: 16715/162054 objects degraded (10.314%), 65 pgs degraded
 
  services:
    mon: 3 daemons, quorum node-1,node-2,node-3 (age 5h)
    mgr: node-1(active, since 20h)
    mds: cephfs:2 {0=mds2=up:active,1=mds1=up:active} 1 up:standby
    osd: 4 osds: 4 up (since 5s), 4 in (since 5s)
    rgw: 1 daemon active (node-2)
 
  task status:
 
  data:
    pools:   9 pools, 464 pgs
    objects: 54.02k objects, 803 MiB
    usage:   11 GiB used, 19 GiB / 30 GiB avail
    pgs:     65.302% pgs not active
             16715/162054 objects degraded (10.314%)
             294 peering
             75  active+undersized
             62  active+undersized+degraded
             21  active+clean
             9   remapped+peering
             2   active+recovery_wait+degraded
             1   active+recovering+degraded
             
# After a while, check the health of ceph again and find that it is rebalancing. At this time, the cluster can still read and write, and the PG status is "active + clean" 
[root@node-1 ~]# ceph -s
  cluster:
    id:     60e065f1-d992-4d1a-8f4e-f74419674f7e
    health: HEALTH_OK
 ...
  progress:
    Rebalancing after osd.0 marked in
      [==............................]
 
[root@node-1 ~]# rados -p pool-1 put file_bench_cephfs.f file_bench_cephfs.f 
[root@node-1 ~]# rados -p pool-1 ls | grep file_bench
file_bench_cephfs.f

Here are the PG statuses that the cluster cannot read or write:

  • stale: OSD all hung
  • peered: OSD is less than min_size
  • down: the OSD node data is too old, and other online OSDs are not enough to complete data repair

The stale and peered states have been demonstrated above by stopping the OSD service.

A classic scene of down: a (main), B, C

 a. first kill B        
 b. New write data to A,C        
 c. kill A and C       
 d. Pull up B 

At this time, the surviving B data is old (excluding new data), and there are no other OSDs in the cluster to help them complete data migration. Therefore, a down will be displayed. Refer to the link: https://zhuanlan.zhihu.com/p/138778000#:~:text=3.8.3%20PG%E4%B8%BADown%E7%9A%84OSD%E4%B8%A2%E5%A4%B1%E6%88%96%E6%97%A0%E6%B3%95%E6%8B%89%E8%B5%B7

The solution to down is still to restart the failed OSD.

2.3 A PG data is damaged

Reference link: https://ceph.com/geen-categorie/ceph-manually-repair-object/

Generally, the damaged PG can be repaired manually. Use ceph pg repair {pgid}

When the PG status is inconsistent, it indicates that there are inconsistent objects in the PG. It is possible that an OSD disk is damaged, or a silent error occurs in the data on the disk.

Next, manually construct an example of PG data corruption and repair it.

# 1. Close the OSD service
$ systemctl stop ceph-osd@{id}

# 2. Use CEPH objectstore tool to mount / var/lib/ceph/osd/ceph-0 to / mnt/ceph-osd@0
[root@node-1 ceph-objectstore-tool-test]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op fuse --mountpoint /mnt/ceph-osd@0/
mounting fuse at /mnt/ceph-osd@0/ ...

# 3. Delete / MNT / CEPH OSD / @ 0 / 10.0_ A directory file (i.e. an object in PG) in the head / all folder destroys an object of 10.0pg
[root@node-1 all]# rm -rf \#10\:01ec679f\:\:\:10000011eba.00000000\:head#/
rm: Cannot delete"#10:01ec679f:::10000011eba.00000000:head#/bitwise_hash ": operation not allowed
rm: Cannot delete"#10:01ec679f:::10000011eba.00000000:head#/omap ": operation not allowed
rm: Cannot delete"#10:01ec679f:::10000011eba.00000000:head#/attr ": operation not allowed

# 4. Uninstall / mnt/ceph-osd@0 , restart the OSD service and wait for the cluster to return to normal

# 5. Manually scrub 10.0 PG, command: ceph pg scrub 10.0, wait for background scrub to complete
[root@node-1 ~]# ceph pg scrub 10.0
instructing pg 10.0 on osd.2 to scrub

# 6. It is found that the cluster reports an error. The PG id is 10.0 and the status is active+clean+inconsistent
[root@node-1 ~]# ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 10.0 is active+clean+inconsistent, acting [2,1,0]

# 7. Execute repair. PG status: active+clean+scrubbing+deep+inconsistent+repair
[root@node-1 ~]# ceph pg repair 10.0
instructing pg 10.0 on osd.2 to repair
[root@node-1 ~]# ceph -s
  cluster:
    id:     60e065f1-d992-4d1a-8f4e-f74419674f7e
    health: HEALTH_ERR
            2 scrub errors
            Possible data damage: 1 pg inconsistent
  . . . 
  data:
    pools:   9 pools, 464 pgs
    objects: 53.99k objects, 802 MiB
    usage:   16 GiB used, 24 GiB / 40 GiB avail
    pgs:     463 active+clean
             1   active+clean+scrubbing+deep+inconsistent+repair

# 8. Wait for the cluster to recover
[root@node-1 ~]# ceph health detail
HEALTH_OK

If the ceph pg repair {pgid} command cannot repair PG, you can use CEPH objectstore tool to import the whole PG.

Reference link: https://www.jianshu.com/p/36c2d5682d87#:~:text=%E8%B5%B7%E5%A4%AF%E4%BD%8F%E3%80%82-,3.9%20Incomplete,-Peering%E8%BF%87%E7%A8%8B%E4%B8%AD

Structural fault

# Construct the fault environment and use CEPH objectstore tool to delete the same object on two of the three replicas.
# Note that before using CEPH objectstore tool, you need to stop the osd service and use systemctl to stop CEPH osd @{id}
# Select 10.0 and delete 1000000d4dc.0000000 objects on both node-2 and node3 nodes. The cluster is 3 replicas, and 10.0PG is distributed on node1, 2 and 3
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 1000000d4dc.00000000 remove
remove #10:03f57502:::1000000d4dc.00000000:head#
[root@node-3 ~]#  ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 1000000d4dc.00000000 remove
remove #10:03f57502:::1000000d4dc.00000000:head#
[root@node-1 ~]# ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 10.0 is active+clean+inconsistent, acting [2,1,0]

Repair with CEPH objectstore tool

# Query data comparison
# 1. Export the object list of PG and put all the lists in the ~ / export folder of node-1 node for comparison
$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list > ~/export/pg-10.0-osd0.txt
[root@node-1 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list > ~/export/pg-10.0-osd1.txt
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op list > ~/pg-10.0-osd1.txt
[root@node-3 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 --op list > ~/pg-10.0-osd2.txt
[root@node-1 export]# scp root@node-2:/root/pg-10.0-osd1.txt ./
pg-10.0-osd1.txt                                                          100%   97KB  19.5MB/s   00:00    
pg-10.0-osd0.txt  pg-10.0-osd1.txt
[root@node-1 export]# scp root@node-3:/root/pg-10.0-osd2.txt ./
pg-10.0-osd2.txt                                                          100%   97KB  35.0MB/s   00:00    
[root@node-1 export]# ls
pg-10.0-osd0.txt  pg-10.0-osd1.txt  pg-10.0-osd2.txt


# 2. Query the number of objects in PG and find that 10.0PG on node-1 has the most objects, 833
$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list | wc -l
[root@node-1 export]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list | wc -l
833
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op list | wc -l
832
[root@node-3 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 --op list | wc -l
832

# 3. Compare whether the objects of all replicas are consistent. In this example, node-2 and node-3 are consistent, node-1 is inconsistent with them, but the number of objects is the largest
# In this example, the PG copy of node-1 is used and the OSD on node-2 and node-3 is imported
# -As mentioned above, after diff comparison, check whether the object list of each copy (master-slave copies) is consistent. Avoid data inconsistency. The backup with the largest number of objects and the largest number of objects after diff comparison.
# -As mentioned above, after diff comparison, the quantity is inconsistent. If the largest number does not contain all objects, you need to consider not overwriting the import and then exporting. Finally, import all complete objects. Note: import requires removing PG in advance, which is equal to overwriting the import.
# -As mentioned above, if the data is consistent after diff comparison, use the backup with the largest number of objects, and then import it to the pg with a small number of objects. Then mark complete all replicas. Be sure to export pg backup at the osd node of all replicas to avoid exceptions, and then restore the pg.
[root@node-1 export]# diff -u ./pg-10.0-osd0.txt ./pg-10.0-osd1.txt 
[root@node-1 export]# diff -u ./pg-10.0-osd0.txt ./pg-10.0-osd2.txt
[root@node-1 export]# diff -u ./pg-10.0-osd2.txt ./pg-10.0-osd1.txt 

# 4. Export the PG of node-1 node. The export file name can be defined by itself, and copy this file to node-2 and node3 nodes
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op export --file ~/export/pg-10.0.obj
[root@node-1 export]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op export --file ~/export/pg-10.0.obj
Read #10:03f57502:::1000000d4dc.00000000:head#
Read #10:03f6b1a4:::100000091a0.00000000:head#
Read #10:03f6dfc2:::10000010b31.00000000:head#
Read #10:03f913b2:::10000010740.00000000:head#
Read #10:03f99080:::10000010f0f.00000000:head#
Read #10:03fc19a4:::10000011c5e.00000000:head#
Read #10:03fe3b90:::10000010166.00000000:head#
Read #10:03fe60e1:::10000011c44.00000000:head#
........
Export successful
[root@node-1 export]# ls
pg-10.0.obj  pg-10.0-osd0.txt  pg-10.0-osd1.txt  pg-10.0-osd2.txt
[root@node-1 export]# scp pg-10.0.obj root@node-2:/root/
pg-10.0.obj                                                               100% 4025KB  14.7MB/s   00:00    
[root@node-1 export]# scp pg-10.0.obj root@node-3:/root/
pg-10.0.obj 

# Note: for all subsequent operations, node-2 and node-3 nodes are the same. For simplicity, only node-2 nodes are shown
# 5. Import the backup PG on node-2 and node-1 nodes
# Before importing the backup, it is recommended to export the PG to be replaced, so that it can be restored in case of subsequent problems
# Import the specified PG metadata into the current pg. before importing, you need to delete the current PG (export to back up the PG data before removing). You need to remove the current PG, otherwise it cannot be imported, and you will be prompted that it already exists.

# 5.1 backup
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op export --file ~/pg-10.0-node-1.obj
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op export --file ~/pg-10.0-node-2.obj
Read #10:03f6b1a4:::100000091a0.00000000:head#
Read #10:03f6dfc2:::10000010b31.00000000:head#
Read #10:03f913b2:::10000010740.00000000:head#
Read #10:03f99080:::10000010f0f.00000000:head#
Read #10:03fc19a4:::10000011c5e.00000000:head#
Read #10:03fe3b90:::10000010166.00000000:head#
Read #10:03fe60e1:::10000011c44.00000000:head#
...
Export successful

# 5.2 deletion
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op remove --force
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op remove --force
 marking collection for removal
setting '_remove' omap key
finish_remove_pgs 10.0_head removing 10.0
Remove successful


# 5.3 import
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op import --file ~/pg-10.0.obj
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op import --file ~/pg-10.0.obj 
Write #10:03f6dfc2:::10000010b31.00000000:head#
snapset 1=[]:{}
Write #10:03f913b2:::10000010740.00000000:head#
snapset 1=[]:{}
Write #10:03f99080:::10000010f0f.00000000:head#
snapset 1=[]:{}
....
write_pg epoch 6727 info 10.0( v 5925'23733 (5924'20700,5925'23733] local-lis/les=6726/6727 n=814 ec=5833/5833 lis/c 6726/6726 les/c/f 6727/6727/0 6726/6726/6724)
Import successful

# 6. Check that PG is not in inconsistent state and is slowly recovering
[root@node-3 ~]# ceph -s
  cluster:
    id:     60e065f1-d992-4d1a-8f4e-f74419674f7e
    health: HEALTH_WARN
            Degraded data redundancy: 37305/161973 objects degraded (23.032%), 153 pgs degraded, 327 pgs undersized
 
  services:
    mon: 3 daemons, quorum node-1,node-2,node-3 (age 23m)
    mgr: node-1(active, since 23m)
    mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby
    osd: 4 osds: 4 up (since 6s), 4 in (since 16h)
    rgw: 1 daemon active (node-2)
 
  task status:
 
  data:
    pools:   9 pools, 464 pgs
    objects: 53.99k objects, 802 MiB
    usage:   16 GiB used, 24 GiB / 40 GiB avail
    pgs:     37305/161973 objects degraded (23.032%)
             182 active+undersized
             145 active+undersized+degraded
             129 active+clean
             8   active+recovering+degraded
 
  io:
    recovery: 0 B/s, 2 objects/s

2.4 OSD down cannot be restarted

The above describes the method of restarting OSD to solve cluster failures, but sometimes OSD down can not be restarted.

# OSD down causes PG failure. Check OSD and find osd.1 down on node-2
[root@node-2 ~]# ceph health detail
HEALTH_WARN Degraded data redundancy: 53852/161556 objects degraded (33.333%), 209 pgs degraded, 464 pgs undersized
PG_DEGRADED Degraded data redundancy: 53852/161556 objects degraded (33.333%), 209 pgs degraded, 464 pgs undersized
...
[root@node-2 ~]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF 
-1       0.03918 root default                            
-3       0.01959     host node-1                         
 0   hdd 0.00980         osd.0       up  1.00000 1.00000 
 3   hdd 0.00980         osd.3       up  0.09999 1.00000 
-5       0.00980     host node-2                         
 1   hdd 0.00980         osd.1     down        0 1.00000 
-7       0.00980     host node-3                         
 2   hdd 0.00980         osd.2       up  1.00000 1.00000 

# Try to restart OSD and find that it cannot be restarted
[root@node-2 ~]# systemctl restart ceph-osd@1
Job for ceph-osd@1.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@1.service" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed ceph-osd@1.service" followed by "systemctl start ceph-osd@1.service" again.
[root@node-2 ~]# systemctl stop ceph-osd@1
[root@node-2 ~]# systemctl restart ceph-osd@1
Job for ceph-osd@1.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@1.service" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed ceph-osd@1.service" followed by "systemctl start ceph-osd@1.service" again.
[root@node-2 ~]# systemctl stop ceph-osd@1
[root@node-2 ~]# systemctl restart ceph-osd@1
Job for ceph-osd@1.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@1.service" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed ceph-osd@1.service" followed by "systemctl start ceph-osd@1.service" again

There are three solutions to the above problems:

  • If the down OSD does not affect the cluster write, that is, the PG status is grade, you can try to restart the OSD after waiting for a period of time.
  • Alternatively, if there are no other important services on the server, you can try restarting the server.
  • The above two solutions cannot be solved. You can manually delete the OSD and then rebuild the OSD.

The following is an example of manually deleting an OSD and then re creating an OSD:

# In this example, osd.1 on the node-2 node is deleted

# 1. 
[root@node-1 ~]# ceph osd rm osd.1
removed osd.1
[root@node-1 ~]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF 
-1       0.03918 root default                            
-3       0.01959     host node-1                         
 0   hdd 0.00980         osd.0       up  1.00000 1.00000 
 3   hdd 0.00980         osd.3       up  0.09999 1.00000 
-5       0.00980     host node-2                         
 1   hdd 0.00980         osd.1      DNE        0         
-7       0.00980     host node-3                         
 2   hdd 0.00980         osd.2       up  1.00000 1.00000 
 
 # 2.
[root@node-1 ~]# ceph osd crush rm osd.1
removed item id 1 name 'osd.1' from crush map
[root@node-1 ~]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF 
-1       0.02939 root default                            
-3       0.01959     host node-1                         
 0   hdd 0.00980         osd.0       up  1.00000 1.00000 
 3   hdd 0.00980         osd.3       up  0.09999 1.00000 
-5             0     host node-2                         
-7       0.00980     host node-3                         
 2   hdd 0.00980         osd.2       up  1.00000 1.00000
 
 # 3.
[root@node-1 ~]# ceph auth del osd.1
updated
[root@node-1 ~]# ceph auth ls | grep osd.1
installed auth entries:

# 4. Check the cluster status. Since the cluster is a 3-replica, one OSD is missing, and all PG statuses are still active, which does not affect the cluster reading and writing
[root@node-1 ~]# ceph -s
  cluster:
    id:     60e065f1-d992-4d1a-8f4e-f74419674f7e
    health: HEALTH_WARN
            Degraded data redundancy: 53157/161724 objects degraded (32.869%), 207 pgs degraded, 461 pgs undersized
 
  services:
    mon: 3 daemons, quorum node-1,node-2,node-3 (age 21m)
    mgr: node-1(active, since 57m)
    mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby
    osd: 3 osds: 3 up (since 2m), 3 in (since 24m); 4 remapped pgs
    rgw: 1 daemon active (node-2)
 
  task status:
 
  data:
    pools:   9 pools, 464 pgs
    objects: 53.91k objects, 802 MiB
    usage:   11 GiB used, 19 GiB / 30 GiB avail
    pgs:     53157/161724 objects degraded (32.869%)
             816/161724 objects misplaced (0.505%)
             254 active+undersized
             206 active+undersized+degraded
             3   active+clean+remapped
             1   active+undersized+degraded+remapped+backfilling
 
  io:
    recovery: 35 KiB/s, 7 objects/s


# 5. Uninstall / var/lib/ceph/osd/ceph-1 / on node-2. This directory contains the relevant contents of OSD.1 and a / block file linked to a specific disk
[root@node-1 ~]# ssh node-2
Last login: Mon Oct 18 09:48:25 2021 from node-1
[root@node-2 ~]# umount /var/lib/ceph/osd/ceph-1/
[root@node-2 ~]# rm -rf /var/lib/ceph/osd/ceph-1/

# 6. The disk mapped to OSD.1 on node-2 is / dev/sdb, which is also the disk to be rebuilt later
[root@node-2 ~]# ceph-volume lvm list


====== osd.1 =======

  [block]       /dev/ceph-847ed937-dfbb-485e-af90-9cf27bf08c99/osd-block-66119fd9-226d-4665-b2cc-2b6564b7d715

      block device              /dev/ceph-847ed937-dfbb-485e-af90-9cf27bf08c99/osd-block-66119fd9-226d-4665-b2cc-2b6564b7d715
      block uuid                9owOrT-EMVD-c2kY-53Xj-2ECv-0Kji-euIRkX
      cephx lockbox secret      
      cluster fsid              60e065f1-d992-4d1a-8f4e-f74419674f7e
      cluster name              ceph
      crush device class        None
      encrypted                 0
      osd fsid                  66119fd9-226d-4665-b2cc-2b6564b7d715
      osd id                    1
      osdspec affinity          
      type                      block
      vdo                       0
      devices                   /dev/sdb


# 7. Format: execute dmsetup remove {device number} first, and then format
[root@node-2 ~]# dmsetup ls
ceph--847ed937--dfbb--485e--af90--9cf27bf08c99-osd--block--66119fd9--226d--4665--b2cc--2b6564b7d715	(253:2)
centos-swap	(253:1)
centos-root	(253:0)
[root@node-2 ~]# dmsetup remove ceph--847ed937--dfbb--485e--af90--9cf27bf08c99-osd--block--66119fd9--226d--4665--b2cc--2b6564b7d715
[root@node-2 ~]# mkfs.xfs -f /dev/sdb
meta-data=/dev/sdb               isize=512    agcount=4, agsize=655360 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=2621440, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# 8. Recreate the OSD
[root@node-2 ~]# ceph-volume lvm create --data /dev/sdb
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 1e8fb7ca-a870-414b-930d-1e22f32eb84b
Running command: /usr/sbin/vgcreate --force --yes ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254 /dev/sdb
 stdout: Wiping xfs signature on /dev/sdb.
 stdout: Physical volume "/dev/sdb" successfully created.
 stdout: Volume group "ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254" successfully created
Running command: /usr/sbin/lvcreate --yes -l 2559 -n osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254
 stdout: Logical volume "osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b" created.
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
Running command: /usr/bin/chown -h ceph:ceph /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-2
Running command: /usr/bin/ln -s /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b /var/lib/ceph/osd/ceph-1/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-1/activate.monmap
 stderr: 2021-10-18 13:53:29.353 7f369a1e2700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-10-18 13:53:29.353 7f369a1e2700 -1 AuthRegistry(0x7f36940662f8) no keyring found at /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx
 stderr: got monmap epoch 3
Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-1/keyring --create-keyring --name osd.1 --add-key AQDYC21hGa9KKRAAZoHogGn9ouEPpb9RY3/FXw==
 stdout: creating /var/lib/ceph/osd/ceph-1/keyring
added entity osd.1 auth(key=AQDYC21hGa9KKRAAZoHogGn9ouEPpb9RY3/FXw==)
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid 1e8fb7ca-a870-414b-930d-1e22f32eb84b --setuser ceph --setgroup ceph
 stderr: 2021-10-18 13:53:29.899 7fa4e1df1a80 -1 bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid
--> ceph-volume lvm prepare successful for: /dev/sdb
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b --path /var/lib/ceph/osd/ceph-1 --no-mon-config
Running command: /usr/bin/ln -snf /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b /var/lib/ceph/osd/ceph-1/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-2
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
Running command: /usr/bin/systemctl enable ceph-volume@lvm-1-1e8fb7ca-a870-414b-930d-1e22f32eb84b
 stderr: Created symlink from /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-1e8fb7ca-a870-414b-930d-1e22f32eb84b.service to /usr/lib/systemd/system/ceph-volume@.service.
Running command: /usr/bin/systemctl enable --runtime ceph-osd@1
Running command: /usr/bin/systemctl start ceph-osd@1
--> ceph-volume lvm activate successful for osd ID: 1
--> ceph-volume lvm create successful for: /dev/sdb

# 9. Check the cluster and wait for the data to migrate and recover slowly
[root@node-2 ~]# ceph -s
  cluster:
    id:     60e065f1-d992-4d1a-8f4e-f74419674f7e
    health: HEALTH_WARN
            Degraded data redundancy: 53237/161556 objects degraded (32.953%), 213 pgs degraded
 
  services:
    mon: 3 daemons, quorum node-1,node-2,node-3 (age 4h)
    mgr: node-1(active, since 5h)
    mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby
    osd: 4 osds: 4 up (since 38s), 4 in (since 4h); 111 remapped pgs
    rgw: 1 daemon active (node-2)
 
  task status:
 
  data:
    pools:   9 pools, 464 pgs
    objects: 53.85k objects, 802 MiB
    usage:   12 GiB used, 28 GiB / 40 GiB avail
    pgs:     53237/161556 objects degraded (32.953%)
             814/161556 objects misplaced (0.504%)
             241 active+clean
             103 active+recovery_wait+degraded
             63  active+undersized+degraded+remapped+backfill_wait
             46  active+recovery_wait+undersized+degraded+remapped
             9   active+recovery_wait
             1   active+recovering+undersized+degraded+remapped
             1   active+remapped+backfill_wait
 
  io:
    recovery: 0 B/s, 43 keys/s, 2 objects/s
[root@node-2 ~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP    META     AVAIL   %USE  VAR  PGS STATUS 
 0   hdd 0.00980  1.00000 10 GiB 4.8 GiB 3.8 GiB  20 MiB 1004 MiB 5.2 GiB 47.59 1.57 440     up 
 3   hdd 0.00980  0.09999 10 GiB 1.3 GiB 265 MiB 1.5 MiB 1022 MiB 8.7 GiB 12.59 0.42  25     up 
 1   hdd 0.00980  1.00000 10 GiB 1.2 GiB 212 MiB     0 B    1 GiB 8.8 GiB 12.08 0.40 361     up 
 2   hdd 0.00980  1.00000 10 GiB 4.9 GiB 3.9 GiB  22 MiB 1002 MiB 5.1 GiB 48.85 1.61 464     up 
                    TOTAL 40 GiB  12 GiB 8.1 GiB  44 MiB  4.0 GiB  28 GiB 30.28                 
MIN/MAX VAR: 0.40/1.61  STDDEV: 18.02

When rebuilding OSD, you should note that if you have customized the crush map in your cluster, you also need to check the crush map.

During OSD recovery, the io services provided by the cluster may be affected. The following modifiable configurations are given here.

Reference link:
https://www.cnblogs.com/gzxbkk/p/7704464.html
http://strugglesquirrel.com/2019/02/02/ceph%E8%BF%90%E7%BB%B4%E5%A4%A7%E5%AE%9D%E5%89%91%E4%B9%8B%E9%9B%86%E7%BE%A4osd%E4%B8%BAfull/

In order to prevent the osd from hanging up after pg starts to migrate, write the following configuration in the configuration file global

osd_op_thread_suicide_timeout = 900
osd_op_thread_timeout = 900
osd_recovery_thread_suicide_timeout = 900
osd_heartbeat_grace = 900

The disk recovery speed configuration, in fact, has been written by default. If you want to speed up the migration, you can try to adjust the following parameters

osd_recovery_max_single_start  #The larger the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default value is 1
osd_recovery_max_active        #The larger the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default value is 3
osd_recovery_op_priority       #The larger the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default value is 3
osd_max_backfills              #The larger the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default value is 1
osd_recovery_sleep             #The smaller the, the faster the OSD recovery speed, and the greater the impact on the external services of the cluster. The default is 0 seconds

[root@node-2 ~]# ceph config set osd osd_recovery_max_single_start 32
[root@node-2 ~]# ceph config set osd osd_max_backfills 32
[root@node-2 ~]# ceph config set osd osd_recovery_max_active 32
 After setting, OSD The recovery speed is greatly accelerated. Note that the three configurations should be increased synchronously. Otherwise, adding only one of them will not accelerate the recovery speed due to other short boards

# ceph -s checks the recovery speed. Before setting the above parameters, the recovery speed is only 6 objects/s.
[root@node-2 ~]# ceph -s
  io:
    recovery: 133 KiB/s, 35 objects/s

Attach configuration control command

# View all configurations
ceph config ls

# View configuration default information
[root@node-2 ~]# ceph config help osd_recovery_sleep
osd_recovery_sleep - Time in seconds to sleep before next recovery or backfill op
  (float, advanced)
  Default: 0.000000
  Can update at runtime: true

# View customized configurations
[root@node-2 ~]# ceph config dump
WHO      MASK LEVEL    OPTION                                         VALUE RO 
  mon         advanced mon_warn_on_insecure_global_id_reclaim         false    
  mon         advanced mon_warn_on_insecure_global_id_reclaim_allowed false    
  mgr         advanced mgr/balancer/active                            false    
  mgr         advanced mgr/balancer/mode                              upmap    
  mgr         advanced mgr/balancer/sleep_interval                    60       
  mds         advanced mds_session_blacklist_on_evict                 true     
  mds         advanced mds_session_blacklist_on_timeout               true     
  client      advanced client_reconnect_stale                         true     
  client      advanced debug_client                                   20/20 
  
# Modify configuration
ceph config set {mon/osd/client/mgr/..} [config_name] [value]
ceph config set client client_reconnect_stale true

# Query the specified user-defined configuration (osd.osd mon.mon mgr.mgr, written like this)
$ ceph config get client.client
WHO    MASK LEVEL    OPTION                 VALUE RO 
client      advanced client_reconnect_stale true     
client      advanced debug_client           20/20

# Delete custom configuration
$ ceph config rm [who] [name]

2.5 PG missing

Reference link: https://zhuanlan.zhihu.com/p/74323736

Generally speaking, PG loss is unlikely to occur in the case of cluster three replicas. If it occurs, it means that the lost data cannot be retrieved.

# 1. Find the missing PG
root@storage01-ib:~#  ceph pg dump_stuck unclean | grep unknown
20.37                         unknown     []         -1     []             -1 
20.29                         unknown     []         -1     []             -1 
20.16                         unknown     []         -1     []             -1 

# 2. Create PG
root@storage01-ib:~# ceph osd force-create-pg 20.37 --yes-i-really-mean-it
pg 20.37 now creating, ok

Note: do not use single copy clusters.

2.6 too few or too many PG

When the warning "1 pools have many more objects per pg than average" appears, it indicates that the number of PG in a pool in the cluster is too small, and the objects carried by each PG are more than 10 times higher than the average PG objects in the cluster. The simplest solution is to increase the number of PG in the pool.

# Alarm: 1 pools have many more objects per pg than average
[root@lab8106 ~]# ceph -s
cluster fa7ec1a1-662a-4ba3-b478-7cb570482b62
health HEALTH_WARN
1 pools have many more objects per pg than average

# Increase the number of pg in this pool. After version N, you only need to adjust pg_num, pgp_num will adjust automatically.
ceph osd pool set  cephfs_metadata pg_num 64

Posted by devilincarnated on Wed, 20 Oct 2021 17:04:12 -0700