Failure Processing

The ERFT ring mode state machine performs the following failure processing. A failure can be defined as an unrecoverable fault that does not enable the EXNET-ONE card from participating on the ring. Anything that causes the Ring Controller State Machine to reset other than the host bringing the ring out of service (OOS) is considered a fault.

Ring Status Report

When a failure occurs, the host is notified that the EXNET-ONE card has gone Out of Service by the Ring Status Report (0x72) message. A global failure count is incremented and internal loop back diagnostics are performed. If the diagnostics were successful and the failure threshold was not reached, the ring state machine is reset.

Reset

All slave nodes transition to the "waiting for addition" state, while the master node brings the ring back up. Before starting ring initialization, the master sends a "reset" message to all slaves. This generates an "in service" event, causing all slaves to re-participate on the ring. After the third unsuccessful attempt to bring the ring up, the master node, will no longer be the master node.

A slave node will not intentionally bring down the ring. If the slave node fails to be brought onto the ring, the slave will notify the host it has reset and is waiting to be passively added.

If there is a standby node available, a switchover is performed. If there is no standby node, mastership re-arbitration is performed.

Failed state

If at anytime, diagnostics fail, or the failure count exceeds a threshold (currently 12), the EXNET-ONE alarms to the host and transitions to the "failed" state. Once in the "failed" state, the EXNET-ONE will not participate on the ring until the "host service state" is toggled and it can pass diagnostics.

Mastership Re-Arbitration Logic

With ERFT, a node can be removed from the ring without that node intentionally bringing down the ring. Only the master node can induce a re-arbitration. If the master node fails to bring the ring up after three attempts or transitions to the "failed" state, the master node will no longer be the master node. If the master EXNET-ONE is removed or brought out of service, the Matrix Controller card induces a re-arbitration.

When the master node induces a re-arbitration, it does not participate unless a new master has not declared itself within 30 seconds. After 30 seconds, if another node has not become master, the previous master node tries to become master again. This happens when there is only a single master configurable node.

There is one case where a slave node can force a re-arbitration. This occurs if the slave node stops hearing from the master node (for example, the master node chassis is off). In this case, all slave nodes force a re-arbitration.

Real-time Software Link Validation

ERFT provides the state machine the ability to detect, isolate, and recover from a single point of failure without dropping existing connections. Software link validation is identical to ring initialization, except that the data layer is not being validated. Only the physical and link layer are validated. A faulty link can be isolated and the ring healed while calls are maintained. Each link can be revalidated at a rate of approximately 20 milliseconds per node, resulting in an online revalidation of a 32 node system within about 700 milliseconds.