Queuing

Every Gradio app comes with a built-in queuing system that can scale to thousands of concurrent users. Because many of your event listeners may involve heavy processing, Gradio automatically creates a queue to handle every event listener in the backend. Every event listener in your app automatically has a queue to process incoming events.

Configuring the Queue

By default, each event listener has its own queue, which handles one request at a time. This can be configured via two arguments:

concurrency_limit: This sets the maximum number of concurrent executions for an event listener. By default, the limit is 1 unless configured otherwise in Blocks.queue(). You can also set it to None for no limit (i.e., an unlimited number of concurrent executions). For example:

import gradio as gr

with gr.Blocks() as demo:
    prompt = gr.Textbox()
    image = gr.Image()
    generate_btn = gr.Button("Generate Image")
    generate_btn.click(image_gen, prompt, image, concurrency_limit=5)

In the code above, up to 5 requests can be processed simultaneously for this event listener. Additional requests will be queued until a slot becomes available.

If you want to manage multiple event listeners using a shared queue, you can use the concurrency_id argument:

concurrency_id: This allows event listeners to share a queue by assigning them the same ID. For example, if your setup has only 2 GPUs but multiple functions require GPU access, you can create a shared queue for all those functions. Here's how that might look:

import gradio as gr

with gr.Blocks() as demo:
    prompt = gr.Textbox()
    image = gr.Image()
    generate_btn_1 = gr.Button("Generate Image via model 1")
    generate_btn_2 = gr.Button("Generate Image via model 2")
    generate_btn_3 = gr.Button("Generate Image via model 3")
    generate_btn_1.click(image_gen_1, prompt, image, concurrency_limit=2, concurrency_id="gpu_queue")
    generate_btn_2.click(image_gen_2, prompt, image, concurrency_id="gpu_queue")
    generate_btn_3.click(image_gen_3, prompt, image, concurrency_id="gpu_queue")

In this example, all three event listeners share a queue identified by "gpu_queue". The queue can handle up to 2 concurrent requests at a time, as defined by the concurrency_limit.

Notes

To ensure unlimited concurrency for an event listener, set concurrency_limit=None. This is useful if your function is calling e.g. an external API which handles the rate limiting of requests itself.
The default concurrency limit for all queues can be set globally using the default_concurrency_limit parameter in Blocks.queue().

These configurations make it easy to manage the queuing behavior of your Gradio app.