Elevated Frontpage Errors

Incident Report for Terms of Service; Didn't Read

Postmortem

Executive Summary

On the 11th of October, our organization experienced a service disruption due to an invalid reverse proxy configuration deployment. This incident resulted in a downtime of about 15 minutes and affected one service. The purpose of this postmortem is to document the incident, identify the root causes, and outline the steps taken to prevent a similar occurrence in the future.

Incident Overview

Incident Timeline (Central European Time):

21:44 - The reverse proxy configuration change was deployed.
21:57 - Users began reporting issues.
21:59 - Operations team was alerted to the incident.
21:59 - Investigation began.
22:00 - Invalid reverse proxy configuration identified.
22:04 - Rollback to the previous configuration initiated.
22:10 - Services were fully restored.
22:15 - Post-incident analysis commenced.

Description:

On the 11th of October, a deployment was made to update the reverse proxy configuration for tosdr.org. Shortly after the deployment, users started reporting errors and issues accessing the service. The operations team promptly initiated an investigation.

Incident Analysis

Root Cause:

Invalid Configuration: The root cause of this incident was an invalid reverse proxy configuration. The configuration change introduced errors, causing the reverse proxy to misroute requests and disrupt service availability.

Contributing Factors:

Lack of Validation: There was a lack of thorough validation and testing of the configuration change before deployment. The change was made without proper testing and validation procedures, which could have identified the issues in advance.
Inadequate Rollback Plan: While there was an automatic rollback procedure in place, it failed to execute immediately after the issues were detected, resulting in a longer downtime than necessary.

Incident Response

Immediate Actions Taken:

The operations team was alerted as soon as the issue was reported.
Investigation began immediately to identify the cause of the problem.
After identifying the invalid configuration, a rollback to the previous configuration was initiated to restore service.

Resolution Time:

The services were fully restored after 15 minutes.

Conclusion

By implementing the corrective actions and preventive measures such as investigating the automatic rollback system, we aim to reduce the likelihood of similar incidents in the future and improve the overall reliability and resilience of our systems.

We apologize for any inconvenience this incident may have caused and appreciate the dedication and hard work of the team members who responded swiftly to restore services.

Post-Incident Review

This postmortem will be reviewed periodically, and progress on corrective actions will be monitored. The incident response and resolution process will also be assessed and improved as needed.

Posted Oct 11, 2023 - 22:13 CEST

Resolved

A non self-healing error in a reverse proxy node configuration made tosdr.org unreachable for about 15 Minutes.

Posted Oct 11, 2023 - 22:00 CEST