C | process scheduling | fair scheduling Lottery&CFS

In addition to the MLFQ mentioned in the previous article, another scheduling policy is called proportional share / fair share. The goal of this scheduling policy is to control the proportion of CPU time occupied by each process. An early implementation of this policy is called lottery scheduling, which means that processes that should run longer will have a better chance of winning lottery. Inside linux, CFS is used as another implementation.

How can we design a scheduler to share the CPU in a proportional manner? What are the key mechanisms for doing so? How effective are they´╝č

Basic concept: ticket = share

The Ticket held by the process is used to represent the share of resource s that the process should have.

The scheduler will randomly select a winning ticket, and the process with the winning ticket will be scheduled. Although the extraction process is random, the law of large numbers shows that in the case of long-term operation, the scheduled probability will approach the proportion of tickets.


Ticket Currency

Different users can distribute their own currencies to their job s, which will eventually be converted into global tickets.

Ticket Transfer

A process can temporarily transfer its ticket to another process to deal with sudden needs (such as the server suddenly processing information)

Ticket Inflation

Processes can temporarily increase or decrease their own tickets, which is usually used between a group of mutually trusted processes, so that short-term resource allocation changes do not need communication.


//Pseudo code
// counter: used to track if we've found the winner yet
int counter = 0;

// winner: use some call to a random number generator to
// get a value, between 0 and the total # of tickets
int winner = getrandom(0, totaltickets);

// current: use this to walk through the list of jobs
node_t *current = head;
while (current) {
counter = counter + current->tickets;
if (counter > winner)
break; // found the winner
current = current->next;
// 'current' is the winner: schedule it...

Simply generate a random number... And then traverse all processes to see which process the random number is in.

However, the random number generated by computer is unevenly distributed after taking the modulus to a certain interval, so other algorithms are needed, such as.

// Assumes 0 <= max <= RAND_MAX
// Returns in the closed interval [0, max]
long random_at_most(long max) {
  unsigned long
    // max <= RAND_MAX < ULONG_MAX, so this is okay.
    num_bins = (unsigned long) max + 1,
    num_rand = (unsigned long) RAND_MAX + 1,
    bin_size = num_rand / num_bins,
    defect   = num_rand % num_bins;

  long x;
  do {
   x = random();
  // This is carefully written not to overflow
  while (num_rand - defect <= (unsigned long)x);

  // Truncated division is intentional
  return x/bin_size;

In order to minimize the number of times to traverse the linked list, processes with more ticket s should be placed in front of the list, so the linked list is best ordered.

Ticket Assignment

How to allocate these ticket s depends on the specific implementation

Stride Scheduling

Assuming a total of 100 tickets, ABC is 10 / 5 / 25 respectively, then the stripe step is 10 / 20 / 4 (named stripe)

Maintain another variable at the same time. Each time the process runs, the counter will increase automatically (named pass)

Scheduling method:

The process with the smallest pass is selected each time, and the self increment is stripe.

curr = remove_min(queue); // pick client with min pass
schedule(curr); // run for quantum
curr->pass += curr->stride; // update pass using stride
insert(queue, curr); // return curr to queue

However, although this reduces the overhead of generating random numbers, the pass value is embarrassing. If a new process is added, the pass value is not easy to set. Therefore, this strategy is not easy to implement.

The Linux Completely Fair Scheduler (CFS)

Linux uses CFS as the scheduling algorithm. In order to allocate CPU proportionally, it uses the count based virtual runtime technique.

Under normal circumstances, the growth rate of vruntime will be directly proportional to the growth rate of physical time. The operating system will select the process with the smallest vruntime for scheduling, and divide each process into corresponding time slice s. But this also leads to a problem, when to switch out? There are several parameters.

sched latency

How long should the process run (usually 48ms / number of processes)

min granularity

Minimum time slice of the process

Due to the use of time counter, these running times will be an integral multiple of the interrupt cycle. Although the actual running time is not necessarily an integer multiple, the recorded time is accurate because of the existence of vruntime.


The nice values for each process range from - 19 to 20 and are mapped to different weights. Then allocate time slice proportionally. Note that the proportion here is basically approximately equal to the proportional growth

static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,

R-B Tree

linux uses the red black tree to store the nodes of all running processes (excluding the sleeping process), so that the process with the smallest vruntime can still be inserted after it is found and scheduled

In order to prevent the vruntime of the waking process from falling far behind other processes, resulting in a stalemate, when the process wakes up, the vruntime will be the smallest vruntime in the tree. Of course, this will sacrifice some fairness.

Posted by k4pil on Sun, 21 Nov 2021 22:09:02 -0800