Cisco Nexus Stateful Fault Recovery

CCNA Data Center

As we discussed in the previous post in this category, we know that Cisco NX-OS Software provides isolation between the control and data plane within a Nexus device. This isolation means that a failure within one plane does not disrupt the other plane. Great!

In this post, let’s elaborate more on the fact that a process cannot only be restarted if it fails but restarted statefully. Meaning the process has information that existed prior to the failure.

When a restartable service fails, it is restarted on the same supervisor. If the new instance of the service determines that the operating system abnormally terminated the previous instance, the service then determines whether a persistent context exists.

The initialization of the new instance attempts to read the persistent context to build a run-time context that makes the new instance appear like the previous one. After the initialization is complete, the service resumes the tasks that it was performing when it stopped. During the restart and initialization of the new instance, other services are unaware of the service failure. Any messages that are sent to the failed service by other services are available from the Message and Transaction Services (MTS) when the service resumes.

The success of the new instance in surviving the stateful initialization depends on the cause of failure of the previous instance. If the service is unable to survive a few subsequent restart attempts, the restart is considered as failed.

In cases where the stateful restart fails, the System Manager performs the action that is specified by the high-availability (HA) policy of the services. This action forces one of the following:

  • Stateless restart
  • No restart
  • A supervisor switchover
  • A reset

During a successful stateful restart, there is no delay while the system reaches a consistent state. Stateful restarts reduce the system recovery time after a failure.

Let’s examine a step by step example of stateful restart in action!

  1. During normal operation, the running services make a checkpoint of their run-time state information to the Persistent Storage Service (PSS)
  2. During normal operation, the system manager monitors the health of the running services using heartbeats
  3. The service encounters a fatal error
  4. The system manager restarts the service instantly when it crashes or stops responding
  5. After restarting, the service recovers its state information from the PSS and resumes all pending transactions
  6. If the service does not resume a stable operation after multiple restarts, the system manager initiates a reset or switchover of the supervisor
  7. Cisco NX-OS collects the process stack and core for debugging purposes with an option to transfer core files to a remote location

I hope this has been informative for you, and I would like to thank you for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *