Maximizing OpenAI Service Capacity with Azure API Management

OpenAI at Scale: Azure API Management Circuit Breaker and Load ...

In this blog post, I will demonstrate how to leverage Azure API Management to enhance the resiliency and capacity of your OpenAI Service.

Azure API Management is a tool that assists in creating, publishing, managing, and securing APIs. It offers features like routing, caching, throttling, authentication, transformation, and more.

API Management with Circuit Breaker Implementation

By utilizing Azure API Management, you can:

Diagram 1: API Management with circuit breaker implementation. Note: Backends in lower priority groups will only be used when all backends in higher priority groups are unavailable because circuit breaker rules are tripped.

Circuit Breaker Deployment with API Management and Azure Open AI Services

In the following section, I will guide you through circuit breaker deployment with API Management and Azure Open AI services. You can use the same solution with the native OpenAI service. The GitHub repository for this article can be found in github.com/eladtpro/api-management-ai-policies.

Note: You can learn more about the Microsoft.ApiManagement service/backends. Also about the CircuitBreakerRule.

Note: To view failed operations, filter operations with the 'Failed' state. az deployment operation group list --resource-group <resource-group-name> --name apim-deployment --query "[?properties.provisioningState=='Failed']"

The deploy.bicep backend circuit breaker and load balancer configuration:

And the part for the backend pool:

Note: The following policy can be used in existing APIs or new APIs. The important part is to set the backend service to the backend pool created in the previous step. All you need to do is to add the following set-backend-service and the retry policies for activating the Load Balancer with Circuit Breaker module:

Note: The URL suffix is the path that will be appended to the API Management URL. For example, if the API Management URL is 'https://apim-ai-features.azure-api.net', the URL suffix is 'openai', and the full URL will be 'https://apim-ai-features.azure-api.net/openai'.

Note: The 'catch all' operation is planned to match all OpenAI requests, we achieve this by setting the URL template to '/{*path}'. For example: Base URL will be: https://my-apim.azure-api.net/openai
Postfix URL will be: /deployments/gpt-4o/chat/completions?api-version=2024-06-01
The full URL will be: https://my-apim.azure-api.net/openai/deployments/gpt-4o/chat/completions?api-version=2024-06-01

This policy is set up to distribute requests across the backend pool and retry requests if the backend service is unavailable:

Important: The main policies taking part in the load balancing that will distribute the requests to the backend pool created in the previous step are the following: set-backend-service: This policy sets the backend service to the backend pool created in the previous step. retry: This policy retries the request if the backend service is unavailable. In case the circuit breaker gets triggered, the request will be retried immediately to the next available backend service.

Important: The value of count should be equal to the number of backend services in the backend pool.

Note: The 'backend-host' header is the host of the backend service that the request was actually sent to. The 'Retry-After' header is the time in seconds that the client should wait before retrying the request sent by the Open AI service overriding tripDuration of the backend circuit breaker setting.

Note: Also, you can add the request and response body to the HTTP requests in the 'Advanced Options' section.

Important: In order to use the load balancer configuration seamlessly, all the OpenAI services should have the same model deployed. The model should be deployed with the same name and version across all the services.