wiki:watchdog

Version 7 (modified by Tim Harvey, 4 years ago) ( diff )

added note about Ubuntu based pre-built images having watchdog installed and configured

Watchdog Timer

Gateworks boards provide both a hardware boot watchdog timer that power cycles the board if boot firmware failed to run as well as SoC watchdogs.

Please see details below on the various differences between hardware watchdogs available and software watchdog daemons.

Terminology used here:

  • SoC (System on Chip) - refers to the chip containing the CPU core as well as vendor peripherals. This is often referred to as the CPU but technically the chips used on Gateworks products are SoC's which marry ARM CPU cores with other perhiperhals inside the chip.
  • resetting the watchdog (aka tickling or petting) refers to restarting the 'countdown timer' in a watchdog.
  • watchdog reset, trigger or expiration - the event that occurs when the internal countdown timer of a watchdog expires which usually results in a chip-level reset, a board-level reset, or a board power-cycle depending on the board and watchdog used.
  • timeout or timeout period - the time before the watchdog will trigger.
  • frequency - the period at which the watchdog will be reset or tickled.

Hardware

GSC Boot Watchdog

This is the most bulletproof watchdog because it runs on the Gateworks System Controller and results in a power-cycle of the board's primary power supply when tripped. Note that this feature is Gateworks specific.

Deficiencies of CPU/SoC watchdogs:

  • they are not enabled at powerup and often not enabled until fairly late when the Linux kernel driver that controls them initializes so if the board hangs (because of software issues or even CPU chip errata) before that, they do not help.
  • they issue a chip-level reset. Depending on the CPU and board design they may also assert an output signal from the chip, but often this does not or can not reach every peripheral chip in the system. This can result in hangs following chip-level reset.

In contrast the GSC boot watchdog benefits are:

  • when expired it momentarily disable the board's primary power supply thus acting as a full board power cycle.
  • is enabled when the board comes out of reset thus it can protect against any software or hardware issue from power-on until your software starts monitoring itself.
  • no driver is needed.

For more info:

Newport (ARM SBSA) CPU watchdog

The Cavinum CN80XX SoC uses the ARM SBSA watchdog. The CN80XX external watchdog reset output is also an input to the GSC so that the GSC can power-cycle the board if the SoC watchdog expires.

The linux kernel driver(drivers/watchdog/sbsa_gwdt.c) defaults to a 10 second timeout

Ventana (imx6) CPU watchdog

The IMX6 SoC watchdog has an 8bit timeout configuration ranging from 500ms to 128s in 500ms intervals and will issue a chip-level SoC reset. On some boards an external output can also be present to reset other peripherals.

The linux kernel driver (drivers/watchdog/imx2_wdt.c) defaults to 60 seconds and allows a timeout period between 1 and 128 seconds.

Due to some IMX6 chip errata resulting in occasional boot failures when booting from NAND flash (which is used as the primary boot device on all Ventana boards) a GSC 'boot' watchdog is used in a special mode to protect against boot failures. In this mode, the GSC 'boot' watchdog is disabled in the bootloader before launching the OS. If the GSC watchdog is enabled (not to be confused with the GSC 'boot' watchdog which can not be disabled) then the watchdog remains enabled from power-up and must be handled by software in the OS to avoid tripping.

Laguna (cns3xxx) CPU watchdog

The cns3xxx SoC has a 32bit count-down timer watchdog provided by the ARM11-MPCORE will issue a chip-level reset. An output from the cns3xxx is also used to reset other board peripherals such as the NOR FLASH.

The linux kernel driver (drivers/watchdog/mpcore_wdt.c) defaults to 60 seconds and allows a timeout period between 0 and 65536 seconds.

Software

The software side of a watchdog involves the software that is responsible for periodically resetting the watchdog timer (aka tickling or petting) to avoid it triggering. This can be as simple as resetting it based on a timer (without any additional checks) or can be very complex based on a series of complicated system checks.

This is not to be confused with the concept of a 'software watchdog' which is simply code that will perform checks and issue a soft reboot if they are not met. This is usually useful when using boards that have no hardware watchdog(s) available which is not the case for Gateworks products.

The rule of thumb is typically to tickle the watchdog at least twice as fast as its timeout however you may find that you want to increase this frequency if you are heavily loading your system and the watchdog process is not getting enough attention (this varies greatly on your CPU, application load, and kernel configuration).

Linux Kernel Drivers and nowayout

The Linux kernel has a watchdog driver API that can be implemented to provide a common userspace API to a hardware watchdog.

Most Linux watchdog drivers have a nowayout kernel parameter which can be defaulted at build time via the kernel config CONFIG_WATCHDOG_NOWAYOUT or passed in via a parameter during module loading or via bootargs. Drivers that support this should display the nowayout setting upon driver init. If nowayout=1 the driver does not allow the watchdog to be disabled (no way out of the situation). This is desireable in high reliability cases as the normal API behavior is to start the watchdog when /dev/watchdog is opened by the userspace app, and stop/disable the watchdog when it is closed (which can happen if the userspace watchdog process is killed or even crashes).

Example trying to kill the watchdog:

root@ventana:~# ps
  PID USER       VSZ STAT COMMAND
    1 root      1676 S    init [5]
    2 root         0 SW   [kthreadd]
    3 root         0 SW   [ksoftirqd/0]
....
  467 root      1720 S    watchdog


root@ventana:~# kill -9 467
[   49.320282] watchdog watchdog0: nowayout prevents watchdog being stopped!
[   49.327081] watchdog watchdog0: watchdog did not stop!

For more info:

Traditional Linux userpsace watchdog daemon (Yocto / Ubuntu)

The traditional linux userspace watchdog daemon (such as http://watchdog.sourceforge.net/) is an example of a very full featured watchdog daemon that can be configured to do controlled shutdowns before tipping the watchdog and can add all kinds of system level checks which need to pass before ticking the watchdog such as:

  • temperature checks
  • load checks
  • memory usage checks
  • process checks
  • network checks

On the Gateworks Ubuntu based pre-built images the watchdog package is installed and configured to start on boot as below, however if you are using a different OS you may need to do these steps manually.

To configure the watchdog daemon to run on boot, setup the watchdog for a 30sec timeout and simply tickle the watchdog every 5 seconds:

  1. Create a conf file:
    cat << EOF > /etc/watchdog.conf
    watchdog-device = /dev/watchdog
    realtime = yes
    priority = 1
    interval = 5
    watchdog-timeout = 30
    EOF
    
  2. create an executable init script (if not already present in your filesystem):
    cat << EOF > /etc/init.d/watchdog
    #!/bin/sh
    
    watchdog
    EOF
    chmod +x /etc/init.d/watchdog
    
  3. configure as a system service to start on boot, priority 1 (if not already present in your filesyste):
    update-rc.d watchdog defaults 1
    sync
    

For more details on configuring the traditional linux userspace watchdog see the man pages:

OpenWrt procd Watchdog

While older versions of OpenWrt used the watchdog daemon from busybox, newer versions (including the Gateworks BSP's from 13-06 and forward) implement the watchdog daemon via procd, which is the init process (PID1). Therefore on modern OpenWrt, you will never see the watchdog process when doing a ps.

Note that the procd watchdog functionality does not implement any specific system checks - if procd is simply running, it will tickle/reset the watchdog based on its configured period.

The procd watchdog code always uses the primary watchdog device /dev/watchdog. You can configure what watchdog that is (ie GSC Watchdog or SoC watchdog) by disabling all but the desired watchdog in the kernel configuration.

You can see the current configuration of the watchdog service via ubus:

root@OpenWrt:/# ubus call system watchdog
{
        "status": "running",
        "timeout": 30,
        "frequency": 5
}

While there is no uci configuration available for these options you could change them in an rc script such as rc.local if you wish:

ubus call system watchdog '{ "timeout": 60 }'   # change to 60s timeout
ubus call system watchdog '{ "frequency": 1 }'   # change to 1s frequency

To sop the service:

ubus call system watchdog '{ "stop": true }'   # watchdog will cause a reset after it expires

Android watchdog daemon

The Android OS watchdog daemon is /sbin/watchdog and is implemented in /system/core/init/watchdogd.c. It is kicked off by init and does not perform any specific checks.

Note: See TracWiki for help on using the wiki.