Resolved-Integrations Failure out of WD1 Data center – Production Tenants

Posted Aug 19, 2024 6:49 AM PDT Updated Aug 21, 2024 4:04 PM PDT Retirement date: Aug 19, 2025

Resolved

Alert typeOperations
Release2024R1
Using WorkdayAdministration & Integrations, Custom Integrations & Apps
ProductIntegration, Platform and Product Extensions
JIRAIPESRE-88289
Affected Data CentersProduction: WD1
Resolution InformationResolution Date Wed, 08/21/2024 – 16:00 America/Los Angeles (GMT-0700)We are resolving this alert since the situation has remained stable. We apologize for the disruption caused due to this issue and appreciate your understanding. An RCA document will be attached to this alert within 10 business days. 

UPDATE Tuesday, August 20, 2024 / 3:00 pm America/Los Angeles -07:00 GMT

The clean up process that we ran on the backend server is showing sustained improvement and stability. Print processing is now working as expected. We will be keeping this alert open till 4 pm PT tomorrow to continue closely monitoring the print pool. Thank you for your understanding throughout this issue. 


UPDATE Tuesday, August 20, 2024 / 12:30 pm America/Los Angeles -07:00 GMT

Good News! We triggered clean up processes on our backend server and are seeing improvement in performance. We will continue to monitor this for the next few hours to ensure that the progress is sustained. We will update you again in three hours if everything continues to function well.  If not, we will update you sooner.  Thank you again for you patience throughout this issue.  Rest assured that we will be providing a full root cause analysis including preventative actions within ten business days of resolution.  


UPDATE Tuesday, August 20, 2024 / 11:30 am America/Los Angeles -07:00 GMT

We are currently testing a potential remediation by engaging additional pools of servers to help handle the backlog of print processing tasks. This requires additional steps to ensure that other areas of the application are safe and continue normal processing.  We will have additional details for you within the next hour.  Thank you again for your continued patience. 


UPDATE Tuesday, August 20, 2024 / 10:30 am America/Los Angeles -07:00 GMT

We appreciate the impact that this current issue is having on our customers ability to do business and we are treating it with the utmost urgency. We have identified extra volume from some of our subprocesses impacting database functionality which is resulting in slow agent distribution for our print services. Teams that own these subprocesses are engaged and reviewing actions to relieve the stress and restore performance. We will provide another update for you within the next hour.


UPDATE Tuesday, August 20, 2024 / 09:30 am America/Los Angeles -07:00 GMT

The restart of print services did not correct the performance issues that we are seeing. Our investigation found latency is assigning the print agents to tasks which is impacting the capacity of servers we have processing tasks. We are in the process of addressing stress that we see impacting these services to alleviate pressure and allow print nodes to stabilize. We will have further updates for you within the next hour. Thank you for your patience


UPDATE Tuesday, August 20, 2024 / 08:30 am America/Los Angeles -07:00 GMT

Thank you for your patience. We currently restarting the services that manage print processing. We will have further information in the next hour regarding the success of the restart in improving print performance.  I will keep you updated as get more information.  Next update will be at 9:30 am PT. 


UPDATE Tuesday, August 20, 2024 / 07:30 am America/Los Angeles -07:00 GMT

We are seeing an increase in queued print processes this morning and have reconvened a major incident to identify why the queue is climbing once more.  We will keep you updated as we get more information. We apologize for the disruption that this is causing and will keep you updated at hourly intervals until we have this issue resolved.


UPDATE Tuesday, August 20, 2024 / 02:00 am America/Los Angeles -07:00 GMT

Thank you for your continued patience. Our monitoring indicates that the print queue has drained and remains stable. We will continue to monitor and provide our next update at 11:00 AM PT.


UPDATE Monday, August 19, 2024 / 10:00 pm America/Los Angeles -07:00 GMT

Thank you for your continued patience. Our monitoring indicates that the print queue is continuing to drain gradually. We will continue to monitor and provide our next update at 2:00 am PT.


UPDATE Monday, August 19, 2024 / 5:30 pm America/Los Angeles -07:00 GMT

Thank you for your continued patience. We are actively monitoring the print queue, which is continuing to drain gradually. As we work through both new and backlogged print processes, we are still expecting full recovery to take approximately 24 more hours before print services return to normal processing speeds. We will provide our next update at 10 pm PT.


UPDATE Monday, August 19, 2024 / 4:30 pm America/Los Angeles -07:00 GMT

Thank you for your patience. We are still currently working to resolve the issue involving print serves. At the moment we are seeing the queue gradually drain however, in recovering from the earlier document storage issue, we built up a backlog of print processes. This caused a delay in system performance as the system is working its way not only through new processes but those that that got backed up. We expect full recovery to take approximately 25 hours before print services are back up to normal speeds. We will update this Alert again by 5:30 pm PT.


UPDATE Monday, August 19, 2024 / 3:30 pm America/Los Angeles -07:00 GMT

We appreciate your understanding. Our engineering team is currently working on a solution to address the print processing service.  We have restarted them and are looking at potential actions to improve their processing capabilities in order to address the backlog of print tasks.  We will update this Alert again at 4:30 pm PT. 


UPDATE Monday, August 19, 2024 / 2:15 pm America/Los Angeles -07:00 GMT

Thank you for your patience. We are seeing an influx of catch up jobs on the print servers causing some print processes to queue, delaying and in some cases failing those print processes. We are working at the highest priority to address this concern and will continue to keep you updated as we take action. We will update this Alert again by 3:30 pm PT.


UPDATE Monday, August 19, 2024 / 12:45 pm America/Los Angeles -07:00 GMT

While the majority of integrations are now completing successfully we are still getting reports of print job failures out of the WD1 Data Center. Our development teams are engaged and treating this as a top priority to restore print services. We will update this Alert again by 2:00 pm PT


UPDATE Monday, August 19, 2024 / 09:45 am America/Los Angeles -07:00 GMT

We have identified a change that went into effect during this last maintenance window impacting performance on our backend document storage servers. We rolled back the change and are now seeing improved performance.  At this time, please rerun any failed integration. You may also use the Mass Actions feature to resubmit multiple integrations.


UPDATE Monday, August 19, 2024 / 09:00 am America/Los Angeles -07:00 GMT

Our tests for the potential fix were unsuccessful. We engaged additional teams to review code changes that went into effect this last maintenance window. We will continue to provide you with updates every hour till this issue is resolved.  We appreciate your understanding.


UPDATE Monday, August 19, 2024 / 08:00 am America/Los Angeles -07:00 GMT

Thank you for your patience. We want to assure you that we are working to get this resolved. We are in the process of testing a potential solution. Once completed it will take approximately an hour to determine if the fix is successful. We will update by 9:00 am PT with our results. 


ORIGINAL ALERT MESSAGE: 
We wanted to inform you of a issue detected by our internal monitoring regarding Integration failures in Production tenants.

Our document storage system is detecting poor server performance.  We are are working at the highest priority to address this issue and will keep you updated at hourly intervals until this is resolved.

We apologize for the disruption.  

URL: https://community.workday.com/alerts/customer/1204061