On 27th June 2025 at 5:45 PM IST, one of our application instances on the AWS server encountered a critical issue caused by application cache not working properly and multiple MySQL queries running concurrently from the same instance. The combination of these factors led to a significant spike in server load, making the server temporarily inaccessible for approximately 30 minutes.
Incident Details | |
---|---|
Parent Incident | #INC_DB_00004 |
# of Child Incidents | 0 |
Date(s) | 27/06/2025 |
Start Time | 05:45 PM IST |
End Time | 10:00 PM IST |
Timeline:
[27/06/2025 05:45 PM IST]: High server load detected due to multiple stuck MySQL queries from a single instance.
[27/06/2025 05:50 PM IST]: Server team started analysis of logs and performance metrics.
[27/06/2025 06:00 PM IST]: Server rebooted to temporarily restore access.
[27/06/2025 06:15 PM IST]: The problematic instance was identified and shut down to prevent further impact.
[27/06/2025 06:45 PM IST]: Initiated analysis of cache logic and slow MySQL queries from the identified instance.
[27/06/2025 08:30 PM IST]: Completed cache optimization and query fixes.
[27/06/2025 09:30 PM IST]: Post-fix testing and monitoring completed.
[27/06/2025 10:00 PM IST]: Instance brought back online after full validation.
Root Cause:
The issue was triggered by multiple inefficient MySQL queries from a single application instance, which caused a spike in server load. After isolating and shutting down the specific instance, further analysis revealed a cache logic flaw and unoptimised queries, both of which were resolved before bringing the instance back online.
Data Loss:
There was no data loss on any instance.
Resolution Steps
- Immediate Actions:
Detected the root cause as query overload from one instance.
Rebooted the server to restore temporary access.
Shut down the affected instance to prevent repeated load spikes.
- Diagnosis and Repair:
Reviewed the logs and identified specific problematic queries.
Optimised query execution and improved cache logic.
Validated all fixes before resuming service.
- Service Recovery:
Brought the instance back online at 10:00 PM IST after applying all fixes.
- Testing and Verification:
Conducted service-level testing and stability checks.
Monitored query performance and server load post-recovery.
Preventive Measures:
Application Optimization: Fixed caching inefficiencies and optimised database queries to prevent similar issues.
Instance Monitoring: Enabled real-time monitoring of query volume per instance.
Load Threshold Alerts: Implemented automated alerts for high query load on any instance.
In case you face any problems, then please write to This email address is being protected from spambots. You need JavaScript enabled to view it. , our awesome support team will surely help you!