Incident Report: 01/07/2024 AWS Server Inaccessibility and Data Migration

On 1st July 2024 at 8:30 AM IST, some of our customers instances on our AWS server became inaccessible because one of our AWS servers repeatedly failed despite multiple reboot attempts. To resolve the issue and ensure service continuity, it was necessary to migrate all data to a different server, resulting in a few hours of downtime. There were no hack attempt or data breach on the server.

 

Incident Details
Parent Incident #INC_DB_00003
# of Child Incidents 0
Date(s) 01/07/2024 to 02/07/2024
Start Time 08:30.AM IST
End Time 06.30.AM IST

Timeline:

[01/07/2024 08:30 AM IST]: Initial detection of server inaccessibility.

[01/07/2024 08:45 AM IST]: First reboot attempt by Server team.

[01/07/2024 09:00 AM IST]: Continued inaccessibility observed.

[01/07/2024 09:15 AM IST]: Subsequent reboot attempts and troubleshooting.

[01/07/2024 11:00 AM IST]: Decision made to migrate data to a new server.

[01/07/2024 11:15 AM IST]: Incident reported to the IT support team.

[01/07/2024 11:30 AM IST]: Setup new server & configured.

[01/07/2024 12:30 PM IST]: Data migration initiated & One by one instances were restored and made available.

[02/07/2024 02:30 AM IST]: Migration completed. 

[02/07/2024 03:00 AM IST]: Systems tested for stability and integrity.

[02/07/2024 05:30 AM IST]: Services gradually restored.

[02/07/2024 06:30 AM IST]: Full resolution and return to normal operations.

Root Cause:

The root cause of the repeated inaccessibility was determined to be underlying hardware issues. Despite multiple reboots and troubleshooting attempts, the server remained inaccessible, necessitating a complete data migration. There were no signs of any hack attempt or data breach.

Data Loss:

There was no data loss on any instance.

Resolution Steps

 

Immediate Actions:

Isolated the affected server to prevent further issues.

 

Diagnosis and Repair:

Conducted a thorough diagnostic to identify the issue.

Decision made to migrate data to a new server due to persistent inaccessibility.

 

Data Migration:

 

  • Initiated the data migration process to a new AWS instance.
  • Ensured data integrity and consistency during migration.
  • Brought accounts live one by one as data migration progresses to minimise downtime for individual users.

 

Testing and Verification:

Performed comprehensive tests to verify the new server's functionality.

Ensured all services were operating normally before bringing the new server online.

 

 

Preventive Measures:

Hardware Monitoring: Enhance monitoring systems to detect early signs of hardware failure.

Regular Maintenance: Schedule more frequent maintenance checks on critical AWS instances.

Backup and Redundancy: Enhance backup and redundancy strategies to minimize downtime in future incidents.

In case you face any problems, then please write to This email address is being protected from spambots. You need JavaScript enabled to view it., our awesome support team will surely help you!

Was this Article helpful?