Boosting Performance: Resolving Gemini 1.5 Flash Online Prediction Quota Errors

Published On Thu Nov 21 2024

Re: Gemini 1.5 Flash Online Prediction Quota Exceed Error - Google Cloud Community

We're encountering a significant roadblock with our Gemini 1.5 Flash models. We have five models working together, and we're constantly hitting the "Online prediction request quota exceeded for gemini-1.5-flash" error. This is severely impacting our project's progress.

We've checked our project's quotas and system limits in the Google Cloud console, and none of the relevant quotas appear to be anywhere near their maximum. We're struggling to find any documentation specifying what these online prediction quotas are or how to increase them.

Has anyone else encountered this issue? Does anyone know where we can find information about these quotas and how to request a limit increase? Any help or pointers would be greatly appreciated!

Telefonica Tech · Blog · Telefónica Tech

Hi @orkhestrai_1, Welcome to Google Cloud Community! It seems that the number of your requests exceeds the capacity allocated to process requests. This capacity is shared among a thousand users. If you are receiving error code 429, you may try to send a request at a later time when resources are freed. Also as mentioned on this documentation, Gemini 1.5 Flash has dynamic quota which means that quota distributes on-demand capacity among all queries being processed by Google Cloud services. As a workaround, I suggest reserving the capacity by using Provisioned Throughput as a subscription. For quota increase, you may check this documentation. Hope this helps.

Same problem here. So annoying. I have to switch to use another AI service provider to keep my services up and running.
I checked no quota warning in the console. And I requested the quota to 30,000 already.

Listen to Let's Know Things podcast | Deezer

I am sure that my case does not over tokens per minute and request per minute. We received the same issue last Monday, with version 002, after a week of successful Gemini requests. Our solution was to go back to 001 during the meantime. There is also a possibility to use Langchain to have a backup region if one region is clogged. If it is a Production Environment, the suggestion is to buy dedicated GSU's via Provisioned Throughput. The quota warning will not send you an email as this is on the Google Data Center side.

You have pointed out a good point to fallback to 001. I didn't think about this approach. Anything, I stopped using Gemini for production until it becomes stable.

Papers Explained 142: Gemini 1.5 Flash | by Ritvik Rastogi | Medium

Hi everyone. Thanks for all the answers and insights, we've gone the way of using backoff and several regions for mitigating the issue. It's working better now, although more slowly. We can't fallback to 001 because the results are much worse than using 002 unfortunately. Cheers.

Could you mind sharing your backoff approach? Set a delay for each request? That sounds like an interesting solution - I am also interested in the backoff approach.

Google Cloud Vertex AI updates focus on the practical with Context ...