Hello Ben,
Thank you for reaching out. We are exploring the cause of the refresh token request failures. We recognize that the oauth2 token refresh flow is sensitive and take these issues very seriously.
I’ll be in touch as soon as I have any findings that helpful.
@Michael B101 - turning off the process that was running every 15 minutes had no effect on the error (I was concerned about maybe hitting the refresh twice at the same time but that doesn’t seem to be the case), we’re continuing to get the errors when trying to refresh.
This is a production issue, I appreciate any assistance you can provide.
For the request you posted originallly, here is the sequence of events that occured. All times in `UTC` and `hashes` refer to non-identifying values which represent an instance of a token. We do not ever have access to the actual token.
- 2024-04-24T00:31:49.000+00:00
- A user token refreshed many times is successfully refreshed again, generating token hash `3de3198111`.
- We return a 200 success with the new token pair
- 2024-04-24T01:32:22+00:00
- A request to refresh token hash `3de319811` is received. Our system refreshes the token and updates the data store with the new pair.
- While updating the token history, a data store connection issue occurred leading to an unhandled exception.
- We return a 500 internal server error
- 2024-04-24T01:46:35+00:00
- Another request is received to refresh token hash `3de319811`
- Because the client didn’t get the refreshed token from step 2, they still see `3de319811` as the current token pair.
- But, because our data store was updated before the exception was thrown, we see `3de319811` as a previously used token outside of the 1 minute grace period.
- Per the oauth2 spec, when a refresh request is made with a previously used token, the entire chain of user and refresh tokens is revoked
- We revoke the original user token, and the current refresh token.
- We return a 400 bad request (invalid client,……..)
- Every request since:
- We now consider the token invalid, and return the 400.
This is clearly an issue on our side, and an unfortunate sequence of events. The takeaways that we are taking and workibng no now:
- We need to evaluate the connection issue, how it occured, and why it was unhandled.
- This investigation is underway.
- We need to evaluate the oauth2 refresh flow and ensure that anywhere it could fail will appropriate roll back any updates, or at least ensure the new pair is always returned.
- This investigation is also underay.
- We need to update our documentation (which is already going through an overhaul) to state that when a 5xx response is received (or the connection fails), the client should do a retry.
- This is one of the goals of the 1 minute grace period. Although we had updated the refresh pair, there is 1 minute where we still consider the previous pair valid and will return the new pair.
- Even with the fixes we will implement for points 1 and 2, there is still the risk of a connection issue causing the client to not receive the new pair. This retry will protect against these scenarios.
I recognize that the recommendation in step 3 is a band-aid. But, if you can introduce a retry mechanism in your auth flow it will protect from this happening again. Unfortunately, the original token (and any others that have had similar problems) is invalid now and can’t be restored. That is why you only see 400s now.
One point of clarification. Have you seen this with multiple user token flows, or just one that is completely broken?
Thanks @Michael B101 - I appreciate the transparency and communication. I’ll add in a retry and see if that fixes things. Please let me know if there are any other recommendations.
This has been affecting multiple users, I don’t have access to the logs (I’m a consultant and don’t have direct production access) but reviewing the logs with the client yesterday and today we’re seeing it with most if not all users.
Hmmm, concerning. My interpretation of your original post was that it was more than just the one token. I will expand my research to all tokens by your client.
I have received nebulous reports of dropped connections in some of our microservices that I am investigating now. When I originally saw your post I thought it was going to be the same thing. It turned out not to be with that token, but I am concerned that your other failures might. If it is dropped connections, I believe the retry will protect you as well. But, that is clearly not the solution to the problem.
I’ll be in touch with any additional information I learn. Let me know any status changes on your side, good or bad.
Ok, thanks @Michael B101 , I added in the retry, just awaiting deployment to prod (someone else will be deploying it) we should have some results later today.
Sounds great. I am adding logic to make it easy to see all failures for a given client and why. Right now, I have to find and look them up one at a time via logs. I expect by sometime tomorrow to be able to immediately get a list of client refreshes, and the outcomes. That should make it easy to see macro trends that aren’t visible when looking at individual requests.
You mention trying to refresh twice and that made me think of a possibility. We recently had an embed based integration getting refresh token failures. It turned out that in cases where their application had many embeds on the same page, they would all attempt to seperately refresh the underlying user token at the same time. (each on their own thread). Because of the multiple parallel requests being so close together, it was hitting a perfect timing window in our logic that lead to a race condition and ultimately a duplication error in our data store.
A fix for that problem is underway now. We are refactoring the refresh token request to be entirely atomic, while also being able to scale and protect the endpoint from being overwhelmed.
I will look at your requests tomorrow to see if it could be the same thing, but wanted to send this out in case it rings any bells. This is another scenario where the retry should work, but can get a token in a bad state if the requests line up being handled at just the wrong moment in a sub-micro window.
@Michael B101 we got the retry deployed and operational around 6:15 PM CDT - we’ll have users testing it tonight and tomorrow and I’ll try to confirm tomorrow afternoon one way or another what our logs / functionality is looking like.
After deploying the retry functionality we removed all user tokens so users will re-authorize the integration and we can start from scratch and the auth reset flow.
The integration was working during our initial test but we didn’t do anything beyond embedding a single document with a single user.
We have left the scheduled process I mentioned earlier turned off (it authenticates on behalf of users every 15 minutes to retrieve meta data on their embedded documents). We plan to turn it on tomorrow afternoon so we can get most of the day tested without it on in case the backend API calls combined with user interactions caused a problem.
Sounds great. Keep me posted.
** Reposting this response as I am not sure it went through:
You mention trying to refresh twice and that made me think of a possibility. We recently had an embed based integration getting refresh token failures. It turned out that in cases where their application had many embeds on the same page, they would all attempt to separately refresh the underlying user token at the same time. (each on their own thread). Because of the multiple parallel requests being so close together, it was hitting a perfect timing window in our logic that lead to a race condition and ultimately a duplication error in our data store.
A fix for that problem is underway now. We are refactoring the refresh token request to be entirely atomic, but its not an insignificant change as we need to ensure it will be able to scale and protect the endpoint from being overwhelmed. I will look at your requests tomorrow to see if it could be the same thing, but wanted to send this out in case it rings any bells. This is another scenario where the retry should work, but can get a token in a bad state if the requests line up being handled at just the wrong moment in a sub-micro window.
Thanks for the additional thought @Michael B101, Our account level authentication is my biggest concern because multiple users could be using it at a time (our embeds are authenticated at the account scope), we won’t have more than one operation per UI user at any given time (we only load one embed at a time - whichever is the one actively being viewed), but with the system process and any number of users there could be up to 2 refreshes at the user level (if a user happens to be creating a document while the system is updating metadata on their documents) and any number at the account level.
I added in some additional logic on the retry to ensure that it checks back to the saved token values if the call to Lucid fails. That way if we do happen to have two processes running they’ll check back in with each other if one fails. I think that this combined with the 1 minute grace period you mentioned earlier in the thread should handle thing well.
So far today everything has been working well - no further issues since implementing the retry logic and resetting everything.
Thanks for all the help.
Super glad to hear it. As a matter of professional experience, I remain a bit concerned you will have problems. Hopefully, next week will go smoothly. Of course, you can resolve this thread whenever you feel the original question has been handled.
We have deployed a fix that ensures all requests to refresh an oauth2 token will be atomic, even when processed with sub-second timing. No action is needed, I just wanted to follow up that the underlying issue was resolved for posterity.