Using Pgvector in PostgreSQL for Vector similarity Search

Last update on December 23 2024 07:42:19 (UTC/GMT +8 hours)

Pgvector in PostgreSQL: Vector Similarity for Machine Learning

The pgvector extension in PostgreSQL enables storage, manipulation, and similarity search for high-dimensional vectors directly in a PostgreSQL database. This is highly useful in machine learning and recommendation systems, where storing embeddings (numerical representations of items like images or text) is essential. With pgvector, PostgreSQL can efficiently perform vector similarity searches, making it a viable option for tasks involving nearest neighbor searches and AI-based recommendations.

Syntax and Setup for Pgvector

1. Install Pgvector:

If your PostgreSQL installation supports extensions, you can add pgvector with the following command:

-- Add pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

Ensure you have permissions to create extensions and that your PostgreSQL version supports pgvector.

2. Creating a Table with a Vector Column:

Define a vector column using the vector data type, specifying the dimension (number of values in each vector).

Code:

-- Create a table to store item embeddings (vectors)
CREATE TABLE items (
  id serial PRIMARY KEY,     -- Primary key for each item
  embedding vector(3)        -- Vector column with 3 dimensions
);

3. Inserting Data into Pgvector Table:

Store high-dimensional vectors in the table, typically generated by machine learning models as embeddings.

Code:

-- Insert an example vector into the items table
INSERT INTO items (embedding) VALUES ('[0.1, 0.2, 0.3]');

Example: Vector Similarity Search with Pgvector

To perform vector similarity searches, use the <-> operator. This operator computes the Euclidean distance between vectors, which can be used to find the most similar items.

Code:

-- Query to find the top 5 similar items based on vector similarity
SELECT id, embedding
FROM items
ORDER BY embedding <-> '[0.1, 0.2, 0.3]'  -- Using a vector as the search reference
LIMIT 5;

Explanation:

<-> operator: Calculates Euclidean distance between stored vectors and the target vector [0.1, 0.2, 0.3].
Order by similarity: Sorts items by proximity to the target vector, returning the closest matches.

Additional Vector Operations

1. Cosine Similarity:

pgvector also supports cosine similarity, which is often preferred for normalized vectors.

Code:

-- Retrieve items based on cosine similarity
SELECT id, embedding
FROM items
ORDER BY embedding <=> '[0.1, 0.2, 0.3]'
LIMIT 5;

2. Indexing for Performance:

For faster similarity search on large datasets, create an index on the vector column.

Code:

-- Create an index to speed up similarity searches
CREATE INDEX ON items USING ivfflat (embedding);

Note: You must specify a vector index type such as ivfflat to support similarity search.

Explanation of Pgvector and use Cases

1. What is pgvector?

pgvector is a PostgreSQL extension that adds vector data types and similarity search capabilities, essential for handling embeddings directly in SQL.

2. Use Cases:

Recommendation Systems: Use embeddings of items (e.g., user preferences, products) to recommend similar items.
Image/Document Search: Store vector representations of documents or images for fast, similarity-based retrieval.
AI and ML Pipelines: Integrate with AI applications by storing and querying high-dimensional vector embeddings.

All PostgreSQL Questions, Answers, and Code Snippets Collection.