Vector Databases Are the Wrong Abstraction
Vector databases have been developed as specialized systems to manage large volumes of vector embeddings for text, image, and multi-modal data. As the use of vector embeddings has grown, many general-purpose databases like PostgreSQL (with pgvector), MySQL, MongoDB, and Oracle have added support for vector search either officially or through extensions.
However, the abstraction of vector search capabilities, whether in a standalone system or integrated into an existing database, is flawed. Once embeddings are inserted into the database, the connection between the unstructured data being embedded and the vector embeddings themselves is lost. This results in embeddings being treated as standalone data atoms that developers must manage, rather than recognizing them as derived data.
The Problem with Current Vector Database Abstractions
Vector databases treat embeddings as independent data, separate from the source data from which they are created. This abstraction leads to unnecessary complexity in managing AI systems in production. A common issue faced by engineering teams is the need to juggle multiple systems to handle different aspects of the AI application, leading to synchronization challenges and increased maintenance costs.
For example, teams may find themselves using one system for storing vector embeddings, another for handling application data, and yet another for lexical search. The process of syncing data across these systems becomes a nightmare, fraught with risks of errors, oversights, and unnecessary expenses.
A Better Approach: The "Vectorizer" Abstraction
To address the shortcomings of current vector database abstractions, we propose a new approach called the "vectorizer" abstraction. This approach treats embeddings more like database indexes, automatically keeping them in sync with their source data. By implementing this vectorizer abstraction, the maintenance costs associated with managing embeddings are significantly reduced.
In PostgreSQL, we have developed an open-source tool called pgai Vectorizer, which works in conjunction with the open-source pgvector and pgvectorscale extensions for vector search. PostgreSQL, known for its versatility in handling various data types, serves as an ideal platform for implementing vectorizers and supporting AI applications.
The Benefits of the "Vectorizer" Abstraction
By treating embeddings as derived data and implementing the vectorizer abstraction, developers can automate the process of generating and updating embeddings as the underlying data changes. This automation reduces the burden on developers, eliminates the need for manual synchronization logic, and ensures that embeddings remain up to date without requiring constant intervention.
Overall, reconceptualizing embeddings as derived data and leveraging the "vectorizer" abstraction not only simplifies the management of AI systems but also enhances their reliability and scalability. This shift allows developers to focus on core business objectives instead of getting bogged down in manual synchronization tasks.