Building QuiLLMan: How We Used Modal's Serverless Platform to Create a Voice Chat App

Published On Mon May 08 2023

Building QuiLLMan for Voice Chat with Modal's Serverless Platform

QuiLLMan is a complete chat app that transcribes audio in real-time using Whisper, streams back a response from a language model, and synthesizes this response as natural-sounding speech. It is powered by Vicuna, one of the latest open-source chatbots that approach the quality of proprietary models like GPT-4, but in addition, it can be self-hosted at a fraction of the cost.

The project has a simple project structure that looks like this:

src/

backend/
frontend/
models/

quillman/

Models

GPTQ-for-LLaMa is used as the language model to quantize the model to 4 bits for faster inference. In this file, we use 'from_dockerhub' to select the official CUDA container as the base image and install python and build requirements. We also clone the FastChat repo and build GPTQ-for-LLaMa.

The 'generate' function constructs a prompt using the current input, previous history, and a prompt template. Then, it simply yields tokens as they are produced. Python generators just work out-of-the-box in Modal, so building streaming interactions is easy.

The class that uses OpenAI's Whisper to transcribe audio in real-time;

The text-to-speech service is adapted from tortoise-tts-modal-api, a Modal deployment of Tortoise TTS. It is used to synthesize natural-sounding speech.

Backend API

The backend API is a FastAPI app. Modal provides an '@asgi_app' decorator that lets us serve this app on the internet without any extra effort.

Of the 4 endpoints in the file, POST /generate is the most interesting. It calls Vicuna.generate and streams the text results back. When a sentence is completed, it also calls Tortoise.speak asynchronously to generate audio, and return a handle to the function call. This handle can be used to poll for the audio later. If Tortoise is not enabled, we return the sentence directly so that the frontend can use the browser’s built-in text-to-speech.

In order to send different types of messages over the same stream, each is sent as a serialized JSON consisting of a type and a payload. The ASCII record separator character (\x1e) is used to delimit the messages since it cannot appear in JSON. Additionally, the function checks if the body contains a noop flag. This is used for warming the containers when the user first loads the page, so that the models can be loaded into memory ahead of time.

Frontend

The Frontend maintains a state machine to manage the state of the conversation and transcription progress which is implemented with the help of the XState library. The Web Audio API is used to record snippets of audio from the user’s microphone. Pending text-to-speech syntheses are stored in a queue. For the next item in the queue, we use the GET /audio/{call_id} endpoint to poll for the audio data.

The code for this entire example is available on GitHub. Follow the instructions in the README for how to run or deploy it yourself on Modal.