Ceph automation four kings

Every Ceph novice has been more or less beaten by the four heavenly (PIT) kings mentioned in the article, or become a great God, or delete the library and run away and become a ghost. Therefore, it is necessary to popularize the means of these four heavenly kings early to help you get rid of the sea of suffering early. This article is based on your basic understanding of Ceph's architecture and basic principles. If you are not familiar with Ceph, you can see the following contents


1. Add crush to OSD automation

Many novices don't look at the basic Crush algorithm principle. According to others' ceph.conf, they copy their homework. The most important thing to copy is the following configuration.

osd crush update on start = false

After restarting or adding a new OSD, you find that your cluster pg is abnormal. One OSD has become a child without a mother. There are little tadpoles looking for a mother everywhere. As everyone knows, the default Crushmap will automatically add the started OSD to the host based on the host. If you enable the configuration of osd crush update on start = false, the automatic allocation will be turned off and can only be configured according to the user-defined Crushmap distribution rules, while novices often have no user-defined Crushmap rules, So I met a lot of orphans OSD. Therefore, before copying homework, we must find out the differences between our own environment and the other party's environment. Blind copying homework will suffer a great loss.

What every novice must learn is the basic principle of the crush algorithm, and carefully study and simulate the crush map editing operation according to the following documents.


Similar settings include osd class update on start = false

2.pg automatic balancer

If you master the basic crush algorithm and basic concepts, you will find that the pseudo-random algorithm of crush can not 100% ensure that each OSD can achieve balanced data distribution. With the increase of data you write, you will find that the utilization of OSD disk is very uneven in many cases. So you will encounter the problem of how to balance the distribution of OSD data. Fortunately, the official has been trying to do this well and launched a series of automation tools.

The balancer can optimize the placement of PGs across OSDs in order to achieve a balanced distribution, either automatically or in a supervised fashion.

In the earliest days, a module called balancer was implemented by extending the mgr module. This module was manually scheduled to achieve PG distribution optimization on OSD. You can check whether your ceph supports this feature through the following command.

root@host:/home/demo# ceph mgr module ls
    "always_on_modules": [
    "enabled_modules": [

The new version does not support manual shutdown.

root@host:/home/demo# ceph mgr module disable balancer
Error EINVAL: module 'balancer' cannot be disabled (always-on)

You can view the current opening status of this module through the following command

root@host:/home/demo# ceph balancer status
    "last_optimize_duration": "0:00:22.402340",
    "plans": [],
    "mode": "crush-compat",
    "active": false,
    "optimize_result": "Unable to find further optimization, change balancer mode and retry might help",
    "last_optimize_started": "Tue Nov 16 18:29:38 2021"

Once the balancer active=true, the pit here will be dug. Then, with the change of the amount of your cluster, once the automatic balancer adjustment conditions are met, the cluster will automatically realize PG adjustment on OSD. The PG automatic balancing function is a beautiful thing for many novices, but when you really use it, you will find that when the cluster pressure is high, the automatic balancing is performed, or the automatic balancing is triggered frequently, and your cluster performance will continue to shake, resulting in increased business delay or jamming. The fatal thing is that once the balancing is triggered, you can't stop halfway, Therefore, we can only wait for the balance to be completed, but the balance is often unable to predict when to start and end. Therefore, the performance of the cluster is also intermittent. Therefore, in production, this strategy can close the best, look beautiful, and bring you endless troubles. Please refer to the following for details


3. Automatic scaling of resource pool autoscale

As a novice, are you still fantasizing about the expansion of clusters? You just need to add machines and disks, and then adjust the number of PG operations to be as fierce as a tiger. The cruel reality is that once you do this, the cluster performance will be exhausted due to data balance. How to balance the performance impact of capacity expansion is an art of Ceph maintenance. However, the official has developed an automatic PG expansion module for you. According to the current scale of the cluster, the number of PG can be set automatically, which will no longer bother novices. Fortunately, the default is to enable the warn mode, which will only give an alarm under specific circumstances, but will not be executed. If you regularly expand / shrink capacity online, you can open this pg_autoscale_mode=on, you will experience unparalleled bitterness, similar to the pain of women giving birth to children, which will be deeply impressed in your heart. If you want to have a good sleep, just honestly off this module. Don't toss around. Automated things are not as beautiful as you look.

#The default policy is warn
root@host:/home/demo# ceph daemon /home/ceph/var/run/ceph-osd.10.asok config show|grep scale
    "osd_pool_default_pg_autoscale_mode": "warn",
    "rgw_rados_pool_autoscale_bias": "4.000000",

#Set the corresponding policy through the pool set command
root@host:/home/demo# ceph osd pool set
Invalid command: missing required parameter pool(<poolname>)
osd pool set <poolname> size|min_size|pg_num|pgp_num|pgp_num_actual|crush_rule|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority|compression_mode|compression_algorithm|compression_required_ratio|compression_max_blob_size|compression_min_blob_size|csum_type|csum_min_block|csum_max_block|allow_ec_overwrites|fingerprint_algorithm|pg_autoscale_mode|pg_autoscale_bias|pg_num_min|target_size_bytes|target_size_ratio <val> {--yes-i-really-mean-it} :  set pool parameter <var> to <val>
root@host:/home/demo# ceph osd pool get .rgw.root pg_autoscale_mode
pg_autoscale_mode: warn

Refer to here for specific settings


4.RGW automatic Shard redistribution

Finally, only some RGW users may encounter it. When the number of files in a single Bucket continues to grow, the underlying database will continue to be re shard. The whole reshard process is uncontrollable. The more files, the longer the time. Because the reshard process needs to be locked, it will lead to business card IO and directly make the service unavailable. So when this feature was officially launched, I started RGW at the first time_ dynamic_ Reshalding = false. Many articles have introduced the basic principle and mechanism of shard. I won't repeat it here. Interested students go to see the previous articles.



Both novice and old drivers should understand the logic and mechanism behind the automation mechanism in Ceph distributed system as much as possible and start it carefully. Most of the time, the automatic / fool setting is just the wishful thinking of developers. They do not understand the complexity of operation and maintenance and production environment, and it is difficult to adapt 100% to your business scenario. Therefore, in order not to be brought into the ditch, they must first start in-depth exploration and learning of these basic knowledge and verify the effectiveness of these functions in practice, Instead of pinning everything on the official default settings at the beginning.

Posted by robviperx on Sat, 04 Dec 2021 18:14:46 -0800