LLM APIs for Integrating Large Language Models
This article explores the most popular large language models and their integration capabilities for building chatbots, natural language search, and other LLM-based products. We’ll explain how to choose the right LLM for your business goals and look into real-world use cases, including AltexSoft's experience.
Flagship LLMs Comparison for Product Integration: Main Features and Capabilities
Understanding the core metrics and features is essential when building a business application on top of a large language model. Here, we explore the key attributes that can impact both performance and user experience.
Size in Parameters
Parameters are the variables that the generative AI model learns during training. Their quantity indicates the AI's capacity to understand human language and other data. Bigger models can capture more intricate patterns and nuances. When is a model considered large? The definition is vague, but one of the first models recognized as LLM was BERT (110M parameters). The modern LLMs have hundreds of billions of parameters.
Number of Languages Supported
Pay attention to the fact that some models work with 4-5 languages while others are real polyglots. It’s crucial, for example, if you want multilingual customer support.
Context Window (Max. Input)
It’s an amount of text or other data (images, code, audio) that a language model can process when generating a response. The bigger context window allows users to input more custom, relevant data (for example, project documentation) so the system will consider the full context and give a more precise answer. For textual data, a context window is counted in tokens. You can think of tokens as sequences of words, where 1,000 tokens are about 750 English words.
Access
You can integrate with most of the popular LLMs via API. However, some of them are available for download and can be deployed on-premises.
Input and Output Modality
Modality is the type of data the model can process. LLMs primarily handle text and code, but multimodal ones can take images, video, and audio as input; the output is mainly text.
Fine-tuning
An LLM model can gain specific domain knowledge and become more effective for your business tasks via fine-tuning. This option is typically available for downloadable models, but some providers allow users to customize their LLMs on the cloud, defining the maximum size of the fine-tuning dataset. In any case, this process requires investment and well-prepared data.
Pricing
The price is counted per million tokens; image files and other non-text modalities can also be tokenized, or counted per unit/per second. Here is an OpenAI pricing calculator for the most popular LLMs. The cheapest in the list is Llama 3.2 11b Vision Instruct API, and the most expensive is GPT-4o RealTime API, which supports voice generation.
OpenAI released its first GPT model in 2018, and since then, it has set the industry standard for performance in complex language tasks. The LLM remains unrivaled in performance, reasoning skills, and fine-tuning ease. The flagship model so far is the GPT-4o, which has a smaller, faster, and cheaper version called the GPT-4o mini. Both variations understand over 50 languages and are multimodal: They take text and image input and generate text output, including code, mathematical equations, and JSON data.
Recently, OpenAI introduced a new series of o1 models (o1 and o1-mini), currently in beta mode. They are trained through reinforcement learning, which enables deeper reasoning and tackling more complex tasks, especially in science, coding, and math. However, for most common use cases, GPT-4o will be more capable in the near future, as the new generation lacks many features (like surfing the Internet).
Besides LLMs, Open AI represents image models DALL-E and audio models Whisper and TTS. To know more about how they work, read our article on AI image generators and another one explaining sound, music and voice generation.
GPT-4o is estimated to have hundreds of billions of parameters; some sources claim 1.8 trillion, although exact details are proprietary. In both versions, the context window can hold up to 128,000 tokens, which is the equivalent of 300 pages of text.
GPT models are only available as a service in the cloud; you can’t deploy them on-premises. They are accessible via Open AI APIs, using Python, Node.js, and .Net, or through Azure OpenAI Service, which also supports C#, Go, and Java. You can call API with other languages as well, thanks to community libraries.
Below, we’ll list API products offered directly by Open AI.
Chat completions API allows you to quickly embed text generation capabilities into your app, chatbot, or other conversational interface.
Assistants API (in beta testing mode) is designed for designing powerful virtual assistants. It comes with built-in tools like file search to retrieve relevant content from your documents and a code interpreter, which helps solve complex math and code problems. The API can access multiple tools in parallel.
Batch API is perfect for tasks that don't require immediate responses, like sentiment analysis for hotel reviews or large-scale text processing. A single batch may include up to 50,000 requests, and a batch input file shouldn’t be bigger than 100 MB. Batch API costs 50 percent less compared to synchronous APIs.
Realtime API (in beta testing mode) supports text and audio as both input and output, which means you can build a low-latency speech-to-speech, text-to-speech, or speech-to-text chatbot that will support audio conversations with clients. You can choose one of six male or female voices. The virtual interlocutor will use any tone you like, such as warm, engaging, or thoughtful; “he” or “she” can even laugh and whisper.
To customize GPT-4o/GPT-4o mini models, you have to prepare a dataset that contains at least 10 conversational patterns, though the OpenAI recommendation is to start with 50 training examples. The dataset for fine-tuning must be in JSON format and up to 1GB in size (though you don’t need a set that large to see improvements). It can also contain images. To upload the dataset, use Files API or Uploads API for files larger than 512 MB. Fine-tuning can be performed via OpenAI UI or programmatically, using OpenAI SDK. Azure OpenAI currently supports only text-to-text fine-tuning. GPT-4o API usage for business costs $2.5/1M input tokens and $10/1M output tokens, while GPT-4o mini is much cheaper — $0.15/1M input tokens and $0.6/1M output tokens.
O1-preview costs $15.00/1M input tokens and $60.00/1M output tokens. You can find the detailed pricing information here.
Gemini (former Bard) model family is optimized for high-level reasoning and understanding not only texts but also image, video, and audio data. The model's name was inspired by NASA's early moonshot program, breakthrough Project Gemini. It was also associated with the Gemini astrology sign since people born under it are highly adaptable, effortlessly connect with diverse individuals, and naturally view situations from multiple perspectives.
The flagship products are Gemini 1.5 Pro and Gemini 1.5 Flash. Flash is a mid-size multimodal model optimized for a wide range of reasoning tasks. Pro can handle large amounts of data. Both models support over 100 languages.
Estimates suggest Gemini models operate with 1.56 trillion parameters. Gemini 1.5 Pro has an unprecedented context window of two million tokens, which allows it to fit 10 Harry Potter novels (well, existing seven plus fan-dreamed three) in one prompt. Or one Harry Potter movie (2 hours of video) or 19 hours of audio. The Gemini 1.5 Flash’s context window is one million tokens.
Gemini models are cloud-based only. Google provides two ways to access its LLMs — on Google AI and Vertex AI (Google’s end-to-end AI development platform). Both APIs support function calling. Google AI Gemini API provides a fast way to explore the LLM capabilities and get started with prototyping and crea