Working with AI data stored in Postgres tables

Suggest edits

The examples on this page are about working with AI data stored in columns in the Postgres table.

To see how to use AI data stored in S3-compatible object storage, skip to working with AI data in S3.

Begin by creating a Postgres table for some test AI data:

CREATE TABLE products (
   product_id SERIAL PRIMARY KEY,
   product_name TEXT NOT NULL,
   description TEXT,
   last_updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
Output
CREATE TABLE

Working with auto embedding

Next, you are going to create a retriever with the just created products table as the source using the pgai.create_pg_retriever function which has this syntax:

pgai.create_pg_retriever(
    retriever_name text,
    schema_name text,
    primary_key text,
    model_name text,
    data_type text,
    source_table text,
    columns text[],
    auto_embedding boolean
)
  • The retriever_name is used to identify and reference the retriever; set it to product_embeddings_auto for this example.
  • The schema_name is the schema where the source table is located; set this to public.
  • The primary_key is the primary key column of the source table.
  • The model_name is the name of the embeddings encoder model for similarity data; set it to all-MiniLM-L6-v2 to use the open encoder model for text data from HuggingFace.
  • The data_type is the type of data in the source table, which could be either img or text. Set it to text.
  • The source_table is the name of the source table. The source table created previously, is products so set it to that.
  • The columns is an array of columns to use for the similarity search by the retriever. Set this to ARRAY['product_name', 'description'] to use the product_name and description columns.
  • The auto_embedding is a boolean value to set a trigger for auto embeddings. Set it to TRUE so that any future insert, update or delete to the source table shall automatically generate, update or delete also the corresponding embedding.

This gives the following SQL command:

SELECT pgai.create_pg_retriever(
    'product_embeddings_auto', -- Retriever name
    'public', -- Schema
    'product_id', -- Primary key
    'all-MiniLM-L6-v2', -- embedding model
    'text', -- data type
    'products', -- Source table
    ARRAY['product_name', 'description'], -- Columns to vectorize
    TRUE -- auto embeddings TRUE to set trigger
);
Output
 create_pg_retriever 
---------------------
 
(1 row)

You have now created a retriever for the products table. The next step is to insert some AI data records into the products table. Since you set auto_embedding to true, the retriever shall automatically generate all embeddings in real-time for each inserted record:

INSERT INTO products (product_name, description) VALUES
      ('Hamburger', 'Tasty'),
      ('Cheesburger', 'Very tasty'),
      ('Fish n Chips', 'Naa'),
      ('Fries', 'Dunno'),
      ('Burrito', 'Always'),
      ('Pizza', 'Mkay'),
      ('Sandwich', 'So what'),
      ('Veggie Burger', 'Go away'),
      ('Kebab', 'Maybe');
Output
INSERT 0 9

Now you can use the retriever, by specifying the retriever name, to perform a similarity retrieval of the top K most relevant, in this case most similar, AI data items. You can do this by running the pgai.retrieve function with the required parameters:

pgai.retrieve(
    query text,
    top_k integer,
    retriever_name text
)
  • The query is the text to use to retrieve the top similar data. Set it to I like it.
  • The top_k is the number of top similar data items to retrieve. Set this to 5
  • The retriever_name is the name of the retriever. The retriever's name is product_embeddings_auto.

This gives the following SQL command:

SELECT data FROM pgai.retrieve(
'I like it', -- The query text to retrieve the top similar data
 5, -- top K
'product_embeddings_auto' -- retriever's name
);
Output
data                 
-------------------------------------
 {'data': 'Hamburger - Tasty'}
 {'data': 'Cheesburger - Very tasty'}
 {'data': 'Fries - Dunno'}
 {'data': 'Sandwich - So what'}
 {'data': 'Kebab - Maybe'}
(5 rows)

Working without auto embedding

You can now create a retriever without auto embedding. This means that the application has control over when the embeddings computation occurs. It also means that the computation is a bulk operation. For demonstration you can simply create a second retriever for the same products table that you just previously created the first retriever for, but setting auto_embedding to false.

SELECT pgai.create_pg_retriever(
    'product_embeddings_bulk', -- Retriever name
    'public', -- Schema
    'product_id', -- Primary key
    'all-MiniLM-L6-v2', -- embedding model
    'text', -- data type
    'products', -- Source table
    ARRAY['product_name', 'description'], -- Columns to vectorize
    FALSE -- auto embeddings FALSE
);
Output
 create_pg_retriever 
---------------------
 
(1 row)

The AI records are already in the table though. As this second retriever is newly created, it won't have created any embeddings. Running pgai.retrieve using the retriever now doesn't return any results:

SELECT data FROM pgai.retrieve(
'I like it', -- The query text to retrieve the top similar data
 5, -- top K
'product_embeddings_bulk' -- retriever's name
);
Output
data 
------
(0 rows)

You need to run a bulk generation of embeddings before performing any retrieval. You can do this using the pgai.refresh_retriever function:

pgai.refresh_retriever(
    retriever_name text
)

The retriever_name is the name of the retriever. Our retriever's name is product_embeddings_bulk. So the SQL command is:

SELECT pgai.refresh_retriever(
'product_embeddings_bulk' -- name of the retriever
);
Output
INFO:  inserted table name public._pgai_embeddings_product_embeddings_bulk
 refresh_retriever
-------------------

(1 row)

You can now run that retrieve operation using the second retriever and get the same results as with the first retriever:

SELECT data FROM pgai.retrieve(
'I like it', -- The query text to retrieve the top similar data
 5, -- top K
'product_embeddings_bulk' -- retriever's name
);
Output
data                 
-------------------------------------
 {'data': 'Hamburger - Tasty'}
 {'data': 'Cheesburger - Very tasty'}
 {'data': 'Fries - Dunno'}
 {'data': 'Sandwich - So what'}
 {'data': 'Kebab - Maybe'}
(5 rows)

The next step is to see what happens if when you add more AI data records:

INSERT INTO products (product_name, description) VALUES
      ('Chicken Nuggets', 'Never'),
      ('Ramen', 'Delicious');
Output
INSERT 0 2

The new data is automatically picked up in the retrieval from the first retriever with auto embeddings:

SELECT data FROM pgai.retrieve(
'I like it', -- The query text to retrieve the top similar data
 5, -- top K
'product_embeddings_auto' -- retriever's name
);
Output
data                 
--------------------------------------
 {'data': 'Hamburger - Tasty'}
 {'data': 'Cheesburger - Very tasty'}
 {'data': 'Sandwich - So what'}
 {'data': 'Kebab - Maybe'}
 {'data': 'Ramen - Delicious'}
(5 rows)

The second retriever without auto embedding doesn't reflect the new data. It can only do so when once there has been another explicit call to pgai.refresh_retriever. Until then, the results don't change:

SELECT data FROM pgai.retrieve(
'I like it', -- The query text to retrieve the top similar data
 5, -- top K
'product_embeddings_bulk' -- retriever's name
);
Output
data                 
-------------------------------------
 {'data': 'Hamburger - Tasty'}
 {'data': 'Cheesburger - Very tasty'}
 {'data': 'Fries - Dunno'}
 {'data': 'Sandwich - So what'}
 {'data': 'Kebab - Maybe'}
(5 rows)

If you now call pgai.refresh_retriever() again, the embeddings computation uses the new data to refresh the embeddings:

SELECT pgai.refresh_retriever(
'product_embeddings_bulk' -- name of the retriever
);
Output
INFO:  inserted table name public._pgai_embeddings_product_embeddings_bulk
 refresh_retriever
-------------------

And the new data shows up in the results of the query when you call the pgai.retrieve function again:

SELECT data FROM pgai.retrieve(
'I like it', -- The query text to retrieve the top similar data
 5, -- top K
'product_embeddings_bulk' -- retriever's name
);
Output
data                 
--------------------------------------
 {'data': 'Hamburger - Tasty'}
 {'data': 'Cheesburger - Very tasty'}
 {'data': 'Sandwich - So what'}
 {'data': 'Kebab - Maybe'}
 {'data': 'Ramen - Delicious'}
(5 rows)

You used the two different retrievers for the same source data just to demonstrate the workings of auto embedding compared to explicit refresh_retriever().

In practice you may want to combine auto embedding and refresh_retriever() in a single retriever to conduct an initial embedding of data that existed before you created the retriever and then rely on auto embedding for any future data that's ingested, updated, or deleted.

You should consider relying on pgai.refresh_retriever, and not using auto embedding, if you typically ingest a lot of AI data at once as a batch.


Could this page be better? Report a problem or suggest an addition!