Notebook

Using Postgres as memory¶

This notebook shows how to use Postgres as a memory store in Semantic Kernel.

The code below pulls the most recent papers from ArviX, creates embeddings from the paper abstracts, and stores them in a Postgres database.

In the future, we can use the Postgres vector store to search the database for similar papers based on the embeddings - stay tuned!

In [1]:

import textwrap
import xml.etree.ElementTree as ET
from dataclasses import dataclass
from datetime import datetime
from typing import Annotated, Any

import numpy as np
import requests

from semantic_kernel.connectors.ai.open_ai.prompt_execution_settings.open_ai_prompt_execution_settings import (
    OpenAIEmbeddingPromptExecutionSettings,
)
from semantic_kernel.connectors.ai.open_ai.services.azure_text_embedding import AzureTextEmbedding
from semantic_kernel.connectors.ai.open_ai.services.open_ai_text_embedding import OpenAITextEmbedding
from semantic_kernel.connectors.memory.postgres.postgres_collection import PostgresCollection
from semantic_kernel.data.const import DistanceFunction, IndexKind
from semantic_kernel.data.vector_store_model_decorator import vectorstoremodel
from semantic_kernel.data.vector_store_record_fields import (
    VectorStoreRecordDataField,
    VectorStoreRecordKeyField,
    VectorStoreRecordVectorField,
)
from semantic_kernel.data.vector_store_record_utils import VectorStoreRecordUtils
from semantic_kernel.kernel import Kernel

Set up your environment¶

You'll need to set up your environment to provide connection information to Postgres, as well as OpenAI or Azure OpenAI.

To do this, copy the .env.example file to .env and fill in the necessary information.

Postgres configuration¶

You'll need to provide a connection string to a Postgres database. You can use a local Postgres instance, or a cloud-hosted one. You can provide a connection string, or provide environment variables with the connection information. See the .env.example file for POSTGRES_ settings.

Using Docker¶

You can also use docker to bring up a Postgres instance by following the steps below:

Create an init.sql that has the following:

CREATE EXTENSION IF NOT EXISTS vector;

Now you can start a postgres instance with the following:

docker pull pgvector/pgvector:pg16
docker run --rm -it --name pgvector -p 5432:5432 -v ./init.sql:/docker-entrypoint-initdb.d/init.sql -e POSTGRES_PASSWORD=example pgvector/pgvector:pg16

Note: Use .\init.sql on Windows and ./init.sql on WSL or Linux/Mac.

Then you could use the connection string:

POSTGRES_CONNECTION_STRING="host=localhost port=5432 dbname=postgres user=postgres password=example"

OpenAI configuration¶

You can either use OpenAI or Azure OpenAI APIs. You provide the API key and other configuration in the .env file. Set either the OPENAI_ or AZURE_OPENAI_ settings.

In [2]:

# Path to the environment file
env_file_path = ".env"

Here we set some additional configuration.

In [3]:

# -- ArXiv settings --

# The search term to use when searching for papers on arXiv. All metadata fields for the papers are searched.
SEARCH_TERM = "generative ai"

# The category of papers to search for on arXiv. See https://arxiv.org/category_taxonomy for a list of categories.
ARVIX_CATEGORY = "cs.AI"

# The maximum number of papers to search for on arXiv.
MAX_RESULTS = 10

# -- OpenAI settings --

# Set this flag to False to use the OpenAI API instead of Azure OpenAI
USE_AZURE_OPENAI = True

# The name of the OpenAI model or Azure OpenAI deployment to use
EMBEDDING_MODEL = "text-embedding-3-small"

Here we define a vector store model. This model defines the table and column names for storing the embeddings. We use the @vectorstoremodel decorator to tell Semantic Kernel to create a vector store definition from the model. The VectorStoreRecordField annotations define the fields that will be stored in the database, including key and vector fields.

In [4]:

@vectorstoremodel
@dataclass
class ArxivPaper:
    id: Annotated[str, VectorStoreRecordKeyField()]
    title: Annotated[str, VectorStoreRecordDataField()]
    abstract: Annotated[str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name="abstract_vector")]
    published: Annotated[datetime, VectorStoreRecordDataField()]
    authors: Annotated[list[str], VectorStoreRecordDataField()]
    link: Annotated[str | None, VectorStoreRecordDataField()]

    abstract_vector: Annotated[
        np.ndarray | None,
        VectorStoreRecordVectorField(
            embedding_settings={"embedding": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)},
            index_kind=IndexKind.HNSW,
            dimensions=1536,
            distance_function=DistanceFunction.COSINE,
            property_type="float",
            serialize_function=np.ndarray.tolist,
            deserialize_function=np.array,
        ),
    ] = None

    @classmethod
    def from_arxiv_info(cls, arxiv_info: dict[str, Any]) -> "ArxivPaper":
        return cls(
            id=arxiv_info["id"],
            title=arxiv_info["title"].replace("\n  ", " "),
            abstract=arxiv_info["abstract"].replace("\n  ", " "),
            published=arxiv_info["published"],
            authors=arxiv_info["authors"],
            link=arxiv_info["link"],
        )

Below is a function that queries the ArviX API for the most recent papers based on our search query and category.

In [5]:

def query_arxiv(search_query: str, category: str = "cs.AI", max_results: int = 10) -> list[dict[str, Any]]:
    """
    Query the ArXiv API and return a list of dictionaries with relevant metadata for each paper.

    Args:
        search_query: The search term or topic to query for.
        category: The category to restrict the search to (default is "cs.AI").
        See https://arxiv.org/category_taxonomy for a list of categories.
        max_results: Maximum number of results to retrieve (default is 10).
    """
    response = requests.get(
        "http://export.arxiv.org/api/query?"
        f"search_query=all:%22{search_query.replace(' ', '+')}%22"
        f"+AND+cat:{category}&start=0&max_results={max_results}&sortBy=lastUpdatedDate&sortOrder=descending"
    )

    root = ET.fromstring(response.content)
    ns = {"atom": "http://www.w3.org/2005/Atom"}

    return [
        {
            "id": entry.find("atom:id", ns).text.split("/")[-1],
            "title": entry.find("atom:title", ns).text,
            "abstract": entry.find("atom:summary", ns).text,
            "published": entry.find("atom:published", ns).text,
            "link": entry.find("atom:id", ns).text,
            "authors": [author.find("atom:name", ns).text for author in entry.findall("atom:author", ns)],
            "categories": [category.get("term") for category in entry.findall("atom:category", ns)],
            "pdf_link": next(
                (link_tag.get("href") for link_tag in entry.findall("atom:link", ns) if link_tag.get("title") == "pdf"),
                None,
            ),
        }
        for entry in root.findall("atom:entry", ns)
    ]

We use this function to query papers and store them in memory as our model types.

In [ ]:

arxiv_papers: list[ArxivPaper] = [
    ArxivPaper.from_arxiv_info(paper)
    for paper in query_arxiv(SEARCH_TERM, category=ARVIX_CATEGORY, max_results=MAX_RESULTS)
]

print(f"Found {len(arxiv_papers)} papers on '{SEARCH_TERM}'")

Create a PostgresCollection, which represents the table in Postgres where we will store the paper information and embeddings.

In [7]:

collection = PostgresCollection[str, ArxivPaper](
    collection_name="arxiv_papers", data_model_type=ArxivPaper, env_file_path=env_file_path
)

Create a Kernel and add the TextEmbedding service, which will be used to generate embeddings of the abstract for each paper.

In [8]:

kernel = Kernel()
if USE_AZURE_OPENAI:
    text_embedding = AzureTextEmbedding(
        service_id="embedding", deployment_name=EMBEDDING_MODEL, env_file_path=env_file_path
    )
else:
    text_embedding = OpenAITextEmbedding(
        service_id="embedding", ai_model_id=EMBEDDING_MODEL, env_file_path=env_file_path
    )

kernel.add_service(text_embedding)

Here we use VectorStoreRecordUtils to add embeddings to our models.

In [9]:

records = await VectorStoreRecordUtils(kernel).add_vector_to_records(arxiv_papers, data_model_type=ArxivPaper)

Now that the models have embeddings, we can write them into the Postgres database.

In [10]:

async with collection:
    await collection.create_collection_if_not_exists()
    keys = await collection.upsert_batch(records)

Here we retrieve the first few models from the database and print out their information.

In [ ]:

async with collection:
    results = await collection.get_batch(keys[:3])
    if results:
        for result in results:
            print(f"# {result.title}")
            print()
            wrapped_abstract = textwrap.fill(result.abstract, width=80)
            print(f"Abstract: {wrapped_abstract}")
            print(f"Published: {result.published}")
            print(f"Link: {result.link}")
            print(f"PDF Link: {result.link}")
            print(f"Authors: {', '.join(result.authors)}")
            print(f"Embedding: {result.abstract_vector}")
            print()
            print()

...searching Postgres memory coming soon, to be continued!