At approximately 3:37pm Eastern Time on January 2nd, 2018, internal monitoring alerted our team to a critical failure of our search infrastructure. Parakeet's search infrastructure powers a wide range of capabilities, including call logs, call summaries, and contact lookups. This system is fault tolerant and highly available, and issues like these should not occur. As a result, we want to share our post-mortem analysis of the failure with you, so you can understand what happened and how we will work to avoid this problem in the future.
In the week of December 18th, as part of routine maintenance, our team performed an upgrade of our search cluster. As per our standard operating procedures, we increased our cluster capacity from 3 instances to 6 using our standard cloud management tooling. After migrating data from the old nodes and shutting them down, we closed up and moved on to other tasks. Unfortunately, due to human error, these manual changes were not committed to our cloud management software. As a result, our cloud provider restarted 3 additional instances of the old version, leaving us in a hybrid state, and causing a small number of errors to trickle in over the holidays.
On January 2nd, at approximately 3:30pm, our team became aware of the small increase in errors as a result of the hybrid cluster. We took steps to shut down the incorrect instances, by manually logging in to our cloud provider dashboard and terminating them. Immediately afterwards, while the nodes were shutting down, we decreased the cluster size to 3 and committed our infrastructure management tooling changes. Unfortunately, since the old nodes were still in the process of shutting down, our infrastructure management tooling believed that we still had 6 running nodes, and elected to terminate the 3 remaining running nodes in an effort to bring the total count down to 3. The result was a complete shutdown of all nodes in our search cluster, resulting in a loss of service for all customers.
Within minutes, our monitoring tools notified us of the problem. Thankfully, we have backups of critical infrastructure components available at all times. The team scrambled to restore the disks from the terminated instances and migrate the data to new instances. Within 10 minutes, we had restored two nodes in the cluster and brought service back to customers. The third node came up shortly after.
Going forward, we will be implementing new infrastructure change policies requiring our cloud infrastructure to stabilize before running any manual or automated changes. Had we been more patient after shutting down the old nodes, our infrastructure management tools would have done the right thing and left the cluster running. Additionally, we will be working to remove the strong dependency on this particular component in future Parakeet system upgrades.
Since our launch in August of 2017, this issue represents one of only two system-wide outages affecting all Parakeet customers. Each was resolved by our team within minutes, but each was equally painful to our customers that we let down. Our team takes these issues seriously, and each represents an opportunity to learn from our mistakes and improve the service we offer you. I sincerely apologize to anyone we may have let down with this service issue, and hope you give us the opportunity to continue serving you in the future.
Founder & CEO