Using C + + concurrent API [async, thread, atomic, volatile] - Lecture 16 of C++2.0

Keywords: C++ Concurrent Programming thread async atomic

Concurrent API usage

0 Basics

  • Hardware thread (logical core): the thread actually executing the calculation (number of hardware threads = number of physical CPUs * number of physical cpu cores * number of logical cores per core (2) [with hyper threading technology enabled])
  • Software thread: the thread used by the operating system for cross process management and hardware thread scheduling. (generally, more software threads can be created than hardware threads, because if a software thread is blocked, the throughput of other non blocking software threads will increase.)
  • std::thread is an object in the C + + process, which is used as a handle to the underlying software.
    • If the std::thread object is a null handle (i.e. there is no software thread), that is, it is in the default construction state (there is no function to be executed on behalf of), or it is moved, connected (the function to be run has ended), or separated (the connection between the std::thread object and its underlying software thread is cut off)
    • If you want different threads to have their own independent variables, use the keyword thread_local
    • The concurrency API of C + + does not provide some functions, such as priority or kernel affinity (one or more processes can be bound to one or more processors in Linux system)
  • The highest level thread scheduler will use the system wide thread pool to avoid overbooking, and will also use the work stealing algorithm to improve the load balance between hardware cores

1 concurrent API usage

1.1 priority shall be given to task-based programming rather than thread based programming

Task based: pass the function to std::async;
Thread based: pass the function to std::thread;

Disadvantages of thread based std::thread processing:

  • There is no way to get the return value of the function;
  • The following issues need to be addressed manually:
    • Exhausted threads: when the number of threads is more than the number provided by the system, STD: will be thrown; system_ Error exception, true at any time (even if the running function cannot throw an exception (noexcept))
      • resolvent:
      • Running the function on the current thread will cause load imbalance
      • Wait for some existing software threads to complete their work before creating a new std::thread object (but it may exist if the existing thread depends on the function bound by the new thread)
    • Over determined: the number of software threads in the ready state (non blocking) exceeds the number of hardware threads. Because the thread scheduler will use the time slice scheduling algorithm on the hardware threads for the software threads, it will cause context switching (when the switched software threads are in different hardware threads, high computing costs will occur)
      • Adjust the optimal ratio of software thread and hardware thread (which determines the frequency of software thread runnable state), but this will change dynamically (depending on context switching cost, hit rate when software thread uses CPU cache, computer architecture, etc.)

Advantages of Task-based std::async processing:

  • The implementer of C + + standard library manages threads,
    • The possibility of thread exhaustion is greatly reduced (because calling std::async, the system does not guarantee to create a new software thread, but it allows the binding function to run in the thread requesting the result of the binding function), so that overbooking, thread exhaustion and load balancing can be handled

When std::thread should be used:

  • You need to access the API of the underlying off the shelf implementation (such as pthread or Windows Thread Library). std::thread provides native_handle member function to access;
  • Need and have the ability to optimize thread usage for applications (for example, server software is developed);
  • You need to implement threading technology that goes beyond the C + + concurrent API (for example, you need to implement your own thread pool)

Use test cases:

int doAsyncWork(){
    std::cout<<"hello"<<std::endl;
    return 22;
}
#include <thread>
#include <future>

int main(){
    std::thread t(doAsyncWork);
    t.join();

    auto fut = std::async(doAsyncWork);
    std::cout<<fut.get()<<std::endl;
    }

1.2 using std::launch::async in the case of asynchrony

std::async all startup policies:

  • std::launch::async: function f must run asynchronously;
  • std::launch::deferred: function f will only run when get or wait of the period value returned by std::async is called.

The default startup strategy of std::async is: std::launch::async | std::launch::deferred, which allows f to run asynchronously or synchronously, so as to deal with thread creation and sales, avoid overbooking, and load balancing. The example code is as follows

void fun(){}

int main(){
    auto fut = std::async(fun);
    auto fut2 = std::async(std::launch::async | std::launch::deferred, fun);
    }

Although the default policy is flexible, it will lead to uncertainty. The example code is as follows:

  • The following occurs:
    • It is impossible to predict whether f will run concurrently with t, or it may be delayed;
    • It is impossible to predict whether f is running on a thread different from the thread calling the get or wait function of fut;
    • It is not even known whether f will run;
  • Can cause the following problems:
    • When read / write Thread Local Storage (TLS) [the program has global variables, but there are different values in different threads], it is impossible to know that the thread's local storage will be read;
    • It will affect the program logic of timeout based wait calls (because it may f not be run, that is, std::future_status::ready will never be reached)

Given thread t executes a statement:

auto fut = std::async(f); 

For the following program, if f and the called std::async are not executed concurrently, but are delayed (for example, when the load is heavy, forcing the computer to overbooking or running out of threads), the value std::future_status::ready may never be taken, that is, the loop will never terminate

#include <thread>
#include <future>
using namespace std::literals;

void fun(){
    std::this_thread::sleep_for(1s);
}

int main(){
    auto fut = std::async(fun);
    while (fut.wait_for(200ms) != std::future_status::ready){
        std::cout<<"test"<<std::endl;
    }
    }

resolvent:

#include <thread>
#include <future>
using namespace std::literals;

void fun(){
    std::this_thread::sleep_for(1s);
}

int main(){
    auto fut = std::async(fun);
    if(fut.wait_for(0s) == std::future_status::deferred){//If delayed
        
    } else{//Not postponed
        while (fut.wait_for(200ms) != std::future_status::ready){
            std::cout<<"test"<<std::endl;
        } 
    }
    }

Default std::async policy for tasks and conditions for normal operation:

  • The task does not need to be executed concurrently with the thread calling get or wait;
  • The thread_local variable of which thread to read / write has no effect;
  • Alternatively, you can guarantee to call get or wait on the period value returned by std::async, or you can accept that the task may not be executed;
  • Code that uses wait_for or wait_unitil [block the current thread until the condition variable wakes up or after a period of time / after a specified time] will take into account the possibility of task delay.

If the above conditions are not met, the program needs to be executed asynchronously. An example is as follows.

auto fut = std::async(std::launch::async, fun);//Start f asynchronously

Form of template:;

//c++11
template<typename F, typename... Ts>
inline 
std::future<typename std::result_of<F(Ts...)>::type>
        readllAsync(F&& f, Ts&&... params){
    return std::async(std::launch::async,
                      std::forward<F>(f),
                      std::forward<Ts>(params)...);
}

//c++14
template<typename F, typename... Ts>
inline 
auto
readllAsync(F&& f, Ts&&... params){
    return std::async(std::launch::async,
                      std::forward<F>(f),
                      std::forward<Ts>(params)...);
}

1.3 storage and use of phase value

1.3.1 storage of phase value

Storage place of period value (result of callee) [period value channel]:

  • Not suitable for storage on callee:
    • Because before the caller uses get to call the corresponding period value, the callee may have completed the execution and be destructed
  • Not suitable for storage on caller:
    • Because if you create STD:: shared from an object of type std::future_ Future type objects, which may be copied many times after the original std::future object is destructed, and only the type is moved when the result type of the called party is transferred. It is necessary to ensure that the life cycle of the called party's result is longer than that of the copied object, and different period values may be corresponding to different copied objects (that is, one result period value cannot meet the needs of multiple objects) .
  • Suitable for storage in shared state (external to the caller and callee) [dynamic allocation, usually using objects on the heap]

1.3.2 behavior of period valued destructors

The behavior of the period value destructor is determined by its associated sharing state:

  • Normal: under normal conditions, the expected destructor will only destruct the member variables referred to in the construction period;
  • Special case: it means that the last period value of the shared state of the non deferred task started via std::aysnc will remain blocked until the end of the task (that is, it will be blocked until the end of the asynchronously running task). [the advantage is that there will be no asynchronous task calling a non-existent period value (end of life cycle)]

1.3.3 phase value channel object

Period value channel object:

  • If you use std::async to create a thread, you don't have to worry about the storage of period values under the following conditions;
    • The exception of the period value is that only the shared state involved calls std::async, and the policy is std::launch::async. The period value refers to the last period value involving the shared state
  • If you use std::thread to create a thread, you need to use std::packaged_task can only be moved and executed asynchronously to store the period value. The sample code is as follows.
int calValue(){return 9;}

int main(){
    std::packaged_task<int()>pt (calValue);
    auto fut = pt.get_future();//Enables functions to run asynchronously
    std::thread t(std::move(pt));
    //...
    //Operation 1: no operation
    //Operation 2: join or detach t
    //t.join();
    }

1.3.4 use of phase value channel

Period value channel purpose:

  • The callee passes the result to the caller;
  • Information needs to be passed from one place to another.

Use 1 has been described in the above example, and use 2 is described below.

For example, you need to let one task notify another task running asynchronously. The example code is as follows.

General version:

Disadvantages:
Shared state requires heap memory and is limited to one communication

void detect2(){//Detection task
    std::thread t([]{
                p.get_future().wait();
                react();
            });
    //... where the thread is suspended before calling react
    p.set_value();//Cancel t pause and call react
    //... other work
    t.join();
}

use ThreadRAII version:

General version:

Disadvantages:
Same as the previous version, but

class ThreadRAII{
public:
    enum class DtorAction{join , detch};
    ThreadRAII(std::thread&& t, DtorAction a):action(a),t(std::move(t)){}
    ~ThreadRAII(){
        if(t.joinable()){//Connectability test
            if(action == DtorAction::join){
                t.join();
            }else{
                t.detach();
            }
        }
    }
    //Because fictitious functions are declared, the compiler does not automatically generate move operations
    ThreadRAII(ThreadRAII&&) = default;
    ThreadRAII& operator=(ThreadRAII&&) = default;
    std::thread& get(){return t;}//Accessing objects of the underlying std::thread type
private:
    DtorAction action;
    std::thread t;
};

void react(){}//Response task
std::promise<void> p;//Indicates that there is no channel for data transmission
void detect(){//Detection task
    ThreadRAII tr(
            std::thread([]{
                p.get_future().wait();
                react();
            }), ThreadRAII::DtorAction::join
            );
    //... where the thread is suspended before calling react
    p.set_value();//Cancel thread pause and call react
    //... other work
}

Use mutex version [need improvement]:

Disadvantages:
1. Cause code smell. Because there is probably no access to shared data between detection and reaction tasks [the two tasks are basically mutually exclusive], but mutexes are used.
2. The wait statement of the response task cannot correspond to false wake-up (without the convenience of notification conditions, the code waiting for the condition variable may also be awakened [in many languages])
3. If the detection task notifies the convenience condition before the response task calls wait, the response task will lose the response.

std::condition_variable cv;
std::mutex m;

void detect(){//Detection task
    // ...
    // Notification response task
    cv.notify_one();
    //cv.notify_all();

}

void react(){//Response task
    std::unique_lock<std::mutex> lk(m);//Lock mutex
    cv.wait(lk);//Wait for the arrival of the notice
    //event processing 
}

Use shared flag bit version [need improvement]:

Disadvantages:
The response task is always recurring, which may be costly (occupying the hardware thread that should be available to another task). It should be blocked before the task waiting flag bit is set to true.

std::atomic<bool> flag(false);
//... detect events
flag = true;//Notification response task

//...

while(Wait event)

///... react to events

Use the combination of conditional variables and flag bit based design [solve the problems of the previous two versions, which is not simple enough]:

std::condition_variable cv;
std::mutex m;
bool flag(false);

void detect(){//Detection task
    std::lock_guard<std::mutex> g(m);
    // ...
    // Notification response task
    cv.notify_one();
    //cv.notify_all();

}

void react(){//Response task
    std::unique_lock<std::mutex> lk(m);//Lock mutex
    cv.wait(lk, []{return flag;});//Wait for the arrival of the notice [use lambda to correspond to false wake-up]
    //event processing 
}

1.4 multithreaded data access using std::atomic

std::atomic function:

  • 1. When used for multithreading, it can ensure that the operations of its member functions are atomic operations without mutual exclusion. The example code is as follows:

Example code 1:

If the data type is changed to volatile, the read ai value may be any value

std::atomic<int> ai(0);
ai = 10;//Set the atom to 10
std::cout<<ai;//Atomic read ai value [other threads reading ai can only see 0, 10 and 11]
++ai;//The atom sets ai to 11
--ai;//The atom reduces ai from 10	

Example code 2:

When data race occurs in the following code, vc may be read as 0 by two threads at the same time, and then increased to 1 and written to vc, resulting in the final value of vc being 1

std::atomic<int> ac(0);
volatile int vc(0);

//The following functions are executed by threads running at the same time
void dealValue(){
    ++vc;
    ++ac;
}

  • 2. Restrict the compiler from reordering the code (i.e. no code in the source code shall be advanced to the write operation position where the std::atomic type variable will appear later), so as to ensure the timing of executing transactions;
    • Code reordering:
      • The irrelevant assignments will be reordered;
      • Simplify general memory (values written to memory are saved until overwritten)

Example code of assignment reordering [without any constraint]:

a = b;
x = y;

//The following are possible reordering results by the compiler
x = y;
a = b;

Example code of assignment reordering [add std::atomic and volatile constraints respectively]:

std::atomic<bool> valAvailable(false);
auto imptValue = computeImportantValue(); // Calculated value
valAvailable = true; // Notify other tasks of availability

//Other threads may treat the operation of setting valAvailable to true as before imptValue (that is, if imptValue is not executed, valAvailable will be set to true)
volatile bool valAvailable(false);
auto imptValue = computeImportantValue();
valAvailable = true;

Sample code for general memory simplification:

int x = 2;
auto y = x;
y = x;
x = 10;
x = 20;

//Compiler simplified results
auto y = x;
x = 20;
  • 3. It cannot be used to access special memory [such as memory mapped I/O (external sensor, display, printer and network port)];
  • Because in the simplified example code of conventional memory, if x corresponds to the value reported by the temperature sensor, the second value of X is not redundant, because the temperature may have changed between two readings [auto y = x; y = x;]; If x corresponds to the control port of the radio transmission, and the value 10 and the value 10 correspond to different commands, if the first assignment is optimized, the command sequence sent to the radio will be changed.
  • The solution is to use volatile and tell the compiler not to optimize any operations in memory.
  • 4. For the reading and writing operations of std::atomimc objects, use the member functions load and store respectively [because std::atomimc neither provides a move constructor nor a move assignment operator]

Error example code for read / write operation:

    std::atomic<int> x;
    auto y = x;
    y = x;

Correct example code for read / write operation:

    std::atomic<int> x;
    std::atomic<int> y (x.load());
    y.store(x.load());

Combined std:;atomic and volatile.

//Operations on vai are atomic operations
//And can't be optimized
volatile std::atomic<int> avi;

2 precautions

2.1 make std::thread type objects unconnectable in all paths

All States of std::thread:

  • Connectable: the underlying thread corresponding to std::thread
    • Asynchronously operated or operable;
    • Blocking or waiting for scheduling;
    • Has run to the end;
  • Unconnectible: connectable objects whose std::thread is not above:
    • std::thread constructed by default (there is no corresponding underlying thread);
    • std::thread moved
    • Joined std::thread
    • Detached std::thread

Because it is terrible to destroy a connectable thread (implicit join [calling join during destruct may lead to performance exceptions that are difficult to debug] and implicit detach [calling detch during destruct may lead to undetermined behavior that is difficult to debug]), the Standards Committee stipulates that the destructor of a connectable thread will terminate the program.

The solution is to manually realize that std::thread type objects are not connectable in all paths, and use RAII class (Resource Acquisition Is Initialization, Resource Acquisition Is Initialization [the key is destruction rather than initialization]). The example code is as follows.

ThreadRAII Code:

class ThreadRAII{
public:
    enum class DtorAction{join , detch};
    ThreadRAII(std::thread&& t, DtorAction a):action(a),t(std::move(t)){}
    ~ThreadRAII(){
        if(t.joinable()){//Connectability test
            if(action == DtorAction::join){
                t.join();
            }else{
                t.detach();
            }
        }
    }
    //Because fictitious functions are declared, the compiler does not automatically generate move operations
    ThreadRAII(ThreadRAII&&) = default;
    ThreadRAII& operator=(ThreadRAII&&) = default;
    std::thread& get(){return t;}//Accessing objects of the underlying std::thread type
private:
    DtorAction action;
    std::thread t;
};

⚠️:

  • An object of type std::thread can only change from connectable to non connectable by calling member functions, such as join, detach, or move operations. [it is safe to call multiple member functions on an object at the same time, only when these functions are const member functions]
  • Data race may occur in the above code when trying to call two member functions at the same time (one is the destructor and the other is other member functions);
  • Here, the std::thread type object is finally declared in the member list of the constructor, which can ensure that all previous member variables have been initialized when the std::thread type object is constructed.

Posted by Alex C on Sun, 28 Nov 2021 05:56:01 -0800