Detailed explanation of Watchdog detection principle

Keywords: Android

Why do you need a watchdog?
Watchdog, I first saw this word in the single chip microcomputer book of the University and talked about the watchdog timer. When the single chip microcomputer was just developed a long time ago, the single chip microcomputer was easy to be affected by external work, resulting in its own program running and flying. Therefore, there is a watchdog protection mechanism, that is, how many times do you need to feed the dog? If you don't feed the dog, The watchdog will trigger the restart. The general principle is that after the system runs, the watchdog counter starts to count automatically. If the watchdog is not cleared within a certain time, the watchdog counter will overflow, causing the watchdog interrupt and causing the system reset.

The mobile phone is actually a super powerful single chip microcomputer. Its running speed is N times faster than that of the single chip microcomputer, and its storage space is N times larger than that of the single chip microcomputer. There are several threads running in it. All kinds of software and hardware work together. We are not afraid of 10000, just in case. In case our system deadlocks, Wanyi's mobile phone is also greatly disturbed, and the program runs away. Everything may happen jj smecta, so, We also need a watchdog mechanism

Note: this document is based on Android 10.0 source code

1: Creation of watchdog
It is created and started by startbootstrap services in system server

SystemServer.java
private void startBootstrapServices() {
      .........
       final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
      .......
      watchdog.init(mSystemContext, mActivityManagerService);// See 2 for initialization
}

Watchdog.java

public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }
private Watchdog() {
        super("watchdog");

        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));// Create a thread to detect the UI
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));// Create a thread to detect IO
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));// Create thread for detection display
        // And the animation thread.
        mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
                "animation thread", DEFAULT_TIMEOUT));
        // And the surface animation thread.
        mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
                "surface animation thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());

        mOpenFdMonitor = OpenFdMonitor.create();

        // See the notes on DEFAULT_TIMEOUT.
        assert DB ||
                DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
    }

Many detection threads are created in the constructor and added to the list.

2: Watchdog initialization

Watchdog.java
public void init(Context context, ActivityManagerService activity) {
        mActivity = activity;
        context.registerReceiver(new RebootRequestReceiver(),//  context will execute the methods in ContextImpl.java
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
    }
ContextImpl.java
@Override
    public Intent registerReceiver(BroadcastReceiver receiver, IntentFilter filter,
            String broadcastPermission, Handler scheduler) {
        return registerReceiverInternal(receiver, getUserId(),
                filter, broadcastPermission, scheduler, getOuterContext(), 0);
    }

The init method just registers a broadcast to restart the system.

3: The activityservicemanager adds itself to the thread of Watchdog detection

ActivityServiceManager.java
public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {
.......
        Watchdog.getInstance().addMonitor(this);// Add yourself to the Watchdog's check thread
        Watchdog.getInstance().addThread(mHandler);// Add a Handler to the Watchdog queue
.......
}

4: Startup and operation of watchdog
watchdog.start() actually executes the run method, because watchdog is a thread

public void run() {
    while (true) {
        // A
               for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }
         // *************
         
         // B
          while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout);
                        // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }
          // ****************
          
          // C
          if (!fdLimitTriggered) {
                    final int waitState = evaluateCheckerCompletionLocked();
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    } else if (waitState == WAITING) {
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    } else if (waitState == WAITED_HALF) {
                        if (!waitedHalf) {
                            Slog.i(TAG, "WAITED_HALF");
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
                            ArrayList<Integer> pids = new ArrayList<Integer>();
                            pids.add(Process.myPid());
                            ActivityManagerService.dumpStackTraces(pids, null, null,
                                getInterestingNativePids());// Stack information here
                            waitedHalf = true;
                        }
                        continue;
                    }

                    // something is overdue!   This is more than 60 seconds
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);// Call this log
                }
                //*********
    }
}

A: Here, all handlercheckers are traversed and scheduleCheckLocked is executed

public final class HandlerChecker implements Runnable {
  mHandler.postAtFrontOfQueue(this);// After post, the run method is executed
}
public void scheduleCheckLocked() {
     if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
                    || (mPauseCount > 0)) {//  Here are important threads for detecting UI, IO and display

                mCompleted = true;
                return;
            }
}
   @Override
    public void run() {// Here are important services such as PMS, AMS, etc
        // Once we get here, we ensure that mMonitors does not change even if we call
        // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
        // move them to mMonitors on the next schedule when mCompleted is true, at which
        // point we have completed execution of this method.
        final int size = mMonitors.size();
        for (int i = 0 ; i < size ; i++) {// Traverse all detection threads
            synchronized (Watchdog.this) {
                mCurrentMonitor = mMonitors.get(i);
            }
            mCurrentMonitor.monitor();// All detected services must implement the Monitor interface
            // The above will execute the monitor method of each detected thread, as follows
             /*
            public void monitor() {// This method simply takes the lock and releases it quickly. If the lock is not obtained for a long time, it means that the thread is deadlocked or the card owner is stuck
             synchronized (this) { }
             }
             */
            
        }

        synchronized (Watchdog.this) {
            mCompleted = true;
            mCurrentMonitor = null;
        }
    }

B: Here, the detection cycle is set to 30 seconds

C: Call evaluateCheckerCompletionLocked to calculate the current check result. Then call getCompletionStateLocked to get the completion state.
There are four states
COMPLETED: the monitored message queue is not blocked, and the monitored monitor can apply for locks normally. As described in step [a], mCompleted=true.
WAITING: the monitored message queue blocking time or the monitored monitor cannot apply for a lock. The time is between 0-30s.
WAITED_HALF: the monitored message queue blocking time or the monitored monitor cannot apply for a lock is between 30-60s.
Overload: the monitored message queue blocking time or the monitored monitor's failure to apply for a lock exceeds our default delay of 60s
If it is in the latter two states, log them respectively and save them in the file.
After 60 seconds, Watchdog will call Process.killProcess(Process.myPid()); Commit suicide.

Posted by dmeade on Fri, 17 Sep 2021 23:28:49 -0700