Vectors: Building support for a nasty data type
Wednesday, September 11 at 15:50–16:40
Vectors are a centuries old, well-studied mathematical concept, yet they pose many challenges around efficient storage and retrieval in database systems. The heightened ease-of-use of AI/ML has lead to a surge of interested of storing vector data alongside application data, leading to some unique challenges.
While we've seen other data types emerge and PostgreSQL handle them effectively (geospatial, JSON), vectors pose unique challenges:
- Comparing two vectors requires comparing the "distance" between each dimension. This is costly when your vector has 1,536 dimensions
- You can't compress or reduce the size of vectors without losing information
- To find your most similar vectors, you have to look at EVERY vector in your data set (sequential scan). Any approximation techniques may miss information.
In other words: there are no shortcuts.
pgvector, an open source PostgreSQL extension, has emerged as one of the most popular tools for storing and retrieving vector data. Over 2023, pgvector made big advances in performance and recall and can efficiently handle workloads of billions of vectors. However, there is still more work we can do in PostgreSQL and pgvector to efficiently store and process vectors.
In this session, we'll have a very brief overview of how vectors are being used with databases. We'll then dive deep into the challenges vectors pose to databases and how existing pgvector functionality works to address these challenges. We'll then focus on emerging challenges with vector data, and how we can work to resolve these challenges in both PostgreSQL and pgvector.