Over the last few days, I’ve noticed that some flows are getting stuck in an auto-retry loop.
I had not seen this before with these flows. To stop it, I canceled the queue and made sure that “Auto-resolve errors with matching trace key” was turned off.
I’m not sure whether others are seeing this as well, but I wanted to point it out and check whether this is a broader issue. It also seems to cause other flows on the same connection to get stuck in the queue, which increases the overall start-to-finish execution time significantly — basically, concurrency hell. I wanted to note this out here on the forum to double check with staff and community members.
Thanks for flagging this. We are working on getting a hotfix deployed as soon as possible. We'll update this thread once the fix is live.
This issue affects the intermittent error auto-retry functionality. What's happening is that retry job records are being created, but the corresponding messages are never actually posted to the queue. This results in orphaned jobs that appear stuck "in-queue" indefinitely.
A few important clarifications:
Your actual flow runs are not impacted. Even though the auto-retries are stuck, successful records will continue to process normally. The stuck retries should also not be blocking your connection queue, since the messages were never posted to it in the first place.
The connection-level auto-recovery for rate limits is a separate mechanism and is working fine — so no need to disable that.
Recommended workaround until the hotfix: We'd suggest keeping "Auto-resolve errors with matching trace key" turned off for now. This prevents new intermittent errors from hitting the same issue. Any errors that come in while it's off will surface normally and can be retried manually. Once the hotfix is deployed, you can safely re-enable it.
The hotfix has been deployed and auto-retry for new flow runs is working normally again in NA and EU. For the retry jobs that were already stuck in "in-queue" before the fix, we plan to run a migration script today to cancel them.
Quick update — the migration script has been run and all previously stuck "in-queue" retry jobs have been canceled. You can safely re-enable "Auto-resolve errors with matching trace key" if you haven't already. Everything should be back to normal.