Post-Mortem Mar/3 endpoints down on shard 0, 1

post-mortem for March 3rd end points down accident

Summary

On March 3rd aroun‌d 12:50pm, we found 30+ mainnet nodes were "red" on uptimerobot. Those nodes include the nodes behind the load balancers of public end points, https://api.s0.t.hmny, and https://api.s1.t.hmny.io. Thus the public end points were not available for a few hours before they are restored. External wallets like TrustWallet won't be able to report the proper balance of the mainnet token to customers. The shard 0 end point was restored at 2:10pm, shard 1 end point was restored at 6pm.

Customer Impact

‌All wallets customers including TrustWallet and MathWallet weren't able to check their balances. The consensus on both shard0 and shard1 weren't impacted. Foundational node runners were still able to join in the consensus and earn block rewards.

People Involved

  • Leo

  • Daniel

  • Andy

Timeline

  • ‌Mar/3 11:50am On devop host, Daniel usedrun_on_shard.sh script to terminate node process and removed harmony_db directories on 30+ mainnet nodes by accident.

  • Mar/3 12:50pm We found the offline nodes on mainnet in uptimerobot

  • Mar/3 1pm Foundation node runners report RPC issues on mainnet

  • Mar/3 2:10pm RPC end point api.s0.t.hmny.io is restored

  • Mar/3 5:30pm RPC end point api.s1.t.hmny.io is restored

  • Mar/4 1am All nodes on mainnet are restored

Root Cause Analysis

The reason of the end point failure is the harmony node process on the explorer node of shard 0 and shard 1 which are served as end points ‌were terminated by accidental operational script. The engineer was meant to terminate process on the open staking testnet, but the operation script took the default value of mainnet configuration to terminate the process on mainnet. The engineer interrupted the script after some long time execution, but the damage is already caused on the mainnet. The same devop host is used on both testnet and mainnet operation, that's why the script had impact on mainnet.

Questions to ask

1. How was the problem reported? How long did it take to respond to the issue? How could we cut that time in half?

The problem was found by our engineer monitoring the status.harmony.one page and reported by foundational node runners in the #foundational-node channel around 45 minutes later after the incident. The time could be cut in half if the pagerduty paged the on-call engineer on the termination of the processes immediately.

2. How long did it take to mitigate the issue? How could we cut that time in half?

It took 1.5 hour to restore the end point for shard 0 which had unblocked the balance checking on TrustWallet and exchange. It took 6 hours to restore the end point for shard1. The time could be cut in half if we have a backup node for end points. We have setup two end points, but both of them were terminated and DB were removed. It took around 1 hour to fully rclone the full DB for explorer node.

3. Are there any pending items that can help prevent this issue? If so, why wasn't it prioritized?

We have a pending item to create a separate devop host for mainnet only. In the original design, every developer should have their own devop host to do any operational task. Unfortunately, this policy wasn't enforced to everyone.

4. Can this issue be detected in a test environment? If not, why?

No. This is an operational mistake.

5 Whys?

  1. ‌Why the end points of shard 0 and shard 1 are down? Because the explorer node process that behind the RPC end points were accidentally killed and the harmony db directory were cleared by engineer. That's why the RPC end points can't serve the clients anymore.

  2. Why engineer terminated the node process and cleared the db on explorer nodes? Our devop engineer were trying to run a script to terminate the node process and cleared the db directory on open staking testnet. However, the script executed the command on mainnet nodes.

  3. Why the script was executed on mainnet nodes instead of the open staking testnet nodes? The run_on_shard.sh script takes a default profile set by HMY_PROFILE environment variable. In the execution environment, the HMY_PROFILE variable was set to mainnet profile by default. When the devop engineer forget to override the default value, the script was executed on mainnet nodes accordingly.

  4. Why the script can execute command on both open staking testnet and mainnet? For all the nodes we launched, they share the same set of public key pair. Thus the script is able to execute commands on all the nodes, regardless of testnet or mainnet.

Action Items

Category

Description

Owner

Tracker

Corrective

​Restore RPC end points

Leo

Done

Preventive

Enforce profile parameter on script

Daniel

PR

Detective

Restore Pagerduty Paging

Leo

Done

Preventive

Build a separate devop host

Leo

Issue

Preventive

Use different set of keypair for mainnet/testnet

Leo

Issue