Category Archives: CCNA Data Center

Cisco Nexus Stateful Fault Recovery

CCNA Data Center

As we discussed in the previous post in this category, we know that Cisco NX-OS Software provides isolation between the control and data plane within a Nexus device. This isolation means that a failure within one plane does not disrupt the other plane. Great!

In this post, let’s elaborate more on the fact that a process cannot only be restarted if it fails but restarted statefully. Meaning the process has information that existed prior to the failure.

When a restartable service fails, it is restarted on the same supervisor. If the new instance of the service determines that the operating system abnormally terminated the previous instance, the service then determines whether a persistent context exists.

The initialization of the new instance attempts to read the persistent context to build a run-time context that makes the new instance appear like the previous one. After the initialization is complete, the service resumes the tasks that it was performing when it stopped. During the restart and initialization of the new instance, other services are unaware of the service failure. Any messages that are sent to the failed service by other services are available from the Message and Transaction Services (MTS) when the service resumes.

The success of the new instance in surviving the stateful initialization depends on the cause of failure of the previous instance. If the service is unable to survive a few subsequent restart attempts, the restart is considered as failed.

In cases where the stateful restart fails, the System Manager performs the action that is specified by the high-availability (HA) policy of the services. This action forces one of the following:

  • Stateless restart
  • No restart
  • A supervisor switchover
  • A reset

During a successful stateful restart, there is no delay while the system reaches a consistent state. Stateful restarts reduce the system recovery time after a failure.

Let’s examine a step by step example of stateful restart in action!

  1. During normal operation, the running services make a checkpoint of their run-time state information to the Persistent Storage Service (PSS)
  2. During normal operation, the system manager monitors the health of the running services using heartbeats
  3. The service encounters a fatal error
  4. The system manager restarts the service instantly when it crashes or stops responding
  5. After restarting, the service recovers its state information from the PSS and resumes all pending transactions
  6. If the service does not resume a stable operation after multiple restarts, the system manager initiates a reset or switchover of the supervisor
  7. Cisco NX-OS collects the process stack and core for debugging purposes with an option to transfer core files to a remote location

I hope this has been informative for you, and I would like to thank you for reading!

Cisco Nexus Functional Planes

Cisco Nexus

One of the key Cisco Nexus switch features to ensure great availability and high performance is the separation of traffic and processing of traffic into what are called different planes. The three main planes are:

  • Data
  • Control
  • Management

Data refers to packets that are being transferred between systems – for example, the packets that make up a website that a client is accessing. Control traffic is that traffic that helps make the infrastructure functional and intelligent. For example, Spanning Tree Protocol traffic at Layer 2 and OSPF traffic at Layer 3. Finally, management traffic might consist of SSH access and SNMP packets.

Notice the illustration above – it shows different traffic forms flowing through the device. From the bottom up – these traffic flows shown are data, services, control, and management traffic. Notice how interface Access Control Lists can restrict all of these traffic forms on ingress. Control Plane Policing (CoPP) permits the limiting of control, services, and management traffic to ensure the CPU does not experience a Denial of Service (malicious or otherwise) during network activity.

Notice also from the graphic the intentional separation of the control plane traffic and the data traffic. By design, the data traffic is switched through the system while bypassing the control plane. This adds stability and performance to the system.

Something else to consider in the Nexus architecture is the ability for failed services to restart and (hopefully) not affect forwarding on the device. A System Manager watches over the processes running on the system and can restart them in a stateful manner (thanks to a setting called the HA Policy). The process can restart with state information thanks to a Persistent Storage Service that the System Manager can access for the previous state information for the process.

This post represents a high-level overview of this subject covered in detail in the 200-155 course at CBT Nuggets releasing in June of 2018.