Notebook

OpenAI Agent + 查询引擎实验手册¶

在这个笔记本中，我们将尝试在各种查询引擎工具和数据集上使用OpenAIAgent。我们将探索OpenAIAgent如何比较/替代我们的检索器/查询引擎解决的现有工作流程。

自动检索
联合SQL和向量搜索

注意： 任何文本到SQL应用程序都应该意识到执行任意SQL查询可能存在安全风险。建议采取必要的预防措施，比如使用受限角色、只读数据库、沙盒等。

从向量数据库自动检索¶

我们现有的“自动检索”功能（在VectorIndexAutoRetriever中）允许一个LLM推断出向量数据库的正确查询参数，包括查询字符串和元数据过滤器。

由于OpenAI函数API可以推断函数参数，我们在这里探索它在执行自动检索方面的能力。

如果您在Colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
%pip install llama-index-readers-wikipedia
%pip install llama-index-vector-stores-pinecone

In [ ]:

!pip install llama-index

In [ ]:

import pinecone
import os

api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west4-gcp-free")

In [ ]:

import os
import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import openai

openai.api_key = "sk-<your-key>"

In [ ]:

# 文本嵌入ada-002的维度
尝试:
    pinecone.create_index(
        "quickstart-index", dimension=1536, metric="euclidean", pod_type="p1"
    )
除非 Exception:
    # 最有可能索引已经存在
    pass

In [ ]:

pinecone_index = pinecone.Index("quickstart-index")

In [ ]:

# 可选：删除pinecone索引中的数据
pinecone_index.delete(deleteAll=True, namespace="test")

Out[ ]:

{}

In [ ]:

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore

In [ ]:

from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text=(
            "Michael Jordan is a retired professional basketball player,"
            " widely regarded as one of the greatest basketball players of all"
            " time."
        ),
        metadata={
            "category": "Sports",
            "country": "United States",
            "gender": "male",
            "born": 1963,
        },
    ),
    TextNode(
        text=(
            "Angelina Jolie is an American actress, filmmaker, and"
            " humanitarian. She has received numerous awards for her acting"
            " and is known for her philanthropic work."
        ),
        metadata={
            "category": "Entertainment",
            "country": "United States",
            "gender": "female",
            "born": 1975,
        },
    ),
    TextNode(
        text=(
            "Elon Musk is a business magnate, industrial designer, and"
            " engineer. He is the founder, CEO, and lead designer of SpaceX,"
            " Tesla, Inc., Neuralink, and The Boring Company."
        ),
        metadata={
            "category": "Business",
            "country": "United States",
            "gender": "male",
            "born": 1971,
        },
    ),
    TextNode(
        text=(
            "Rihanna is a Barbadian singer, actress, and businesswoman. She"
            " has achieved significant success in the music industry and is"
            " known for her versatile musical style."
        ),
        metadata={
            "category": "Music",
            "country": "Barbados",
            "gender": "female",
            "born": 1988,
        },
    ),
    TextNode(
        text=(
            "Cristiano Ronaldo is a Portuguese professional footballer who is"
            " considered one of the greatest football players of all time. He"
            " has won numerous awards and set multiple records during his"
            " career."
        ),
        metadata={
            "category": "Sports",
            "country": "Portugal",
            "gender": "male",
            "born": 1985,
        },
    ),
]

In [ ]:

vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index, namespace="test"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [ ]:

index = VectorStoreIndex(nodes, storage_context=storage_context)

Upserted vectors: 100%|██████████| 5/5 [00:00<00:00,  5.79it/s]

定义函数工具¶

在这里，我们定义了函数接口，该接口将传递给OpenAI以执行自动检索。

我们无法让OpenAI与嵌套的pydantic对象或元组作为参数一起工作，因此我们将元数据过滤键和值转换为列表，以便函数API能够使用。

In [ ]:

# 定义函数工具
from llama_index.core.tools import FunctionTool
from llama_index.core.vector_stores import (
    VectorStoreInfo,
    MetadataInfo,
    MetadataFilter,
    MetadataFilters,
    FilterCondition,
    FilterOperator,
)
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

from typing import List, Tuple, Any
from pydantic import BaseModel, Field

# 暂时硬编码 top k
top_k = 3

# 定义描述向量存储架构的向量存储信息
vector_store_info = VectorStoreInfo(
    content_info="名人简介",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description=(
                "名人的类别，包括[体育、娱乐、商业、音乐]之一"
            ),
        ),
        MetadataInfo(
            name="country",
            type="str",
            description=(
                "名人的国家，包括[美国、巴巴多斯、葡萄牙]之一"
            ),
        ),
        MetadataInfo(
            name="gender",
            type="str",
            description=("名人的性别，包括[男性、女性]之一"),
        ),
        MetadataInfo(
            name="born",
            type="int",
            description=("名人的出生年份，可以是任意整数"),
        ),
    ],
)

In [ ]:

# 定义用于自动检索功能的pydantic模型
class AutoRetrieveModel(BaseModel):
    query: str = Field(..., description="自然语言查询字符串")
    filter_key_list: List[str] = Field(
        ..., description="元数据过滤字段名称列表"
    )
    filter_value_list: List[Any] = Field(
        ...,
        description=(
            "元数据过滤字段值列表（对应于filter_key_list中指定的名称）"
        ),
    )
    filter_operator_list: List[str] = Field(
        ...,
        description=(
            "元数据过滤条件（可以是<、<=、>、>=、==、!=中的一个）"
        ),
    )
    filter_condition: str = Field(
        ...,
        description=("元数据过滤条件值（可以是AND或OR）"),
    )


description = f"""\
使用此工具查找有关名人的传记信息。
以下是向量数据库模式：
{vector_store_info.json()}
"""

定义自动检索函数

In [ ]:

def auto_retrieve_fn(
    query: str,
    filter_key_list: List[str],
    filter_value_list: List[any],
    filter_operator_list: List[str],
    filter_condition: str,
):
    """自动检索函数。

    从向量数据库中执行自动检索，然后应用一组过滤器。

    """
    query = query or "Query"

    metadata_filters = [
        MetadataFilter(key=k, value=v, operator=op)
        for k, v, op in zip(
            filter_key_list, filter_value_list, filter_operator_list
        )
    ]
    retriever = VectorIndexRetriever(
        index,
        filters=MetadataFilters(
            filters=metadata_filters, condition=filter_condition
        ),
        top_k=top_k,
    )
    query_engine = RetrieverQueryEngine.from_args(retriever)

    response = query_engine.query(query)
    return str(response)


auto_retrieve_tool = FunctionTool.from_defaults(
    fn=auto_retrieve_fn,
    name="celebrity_bios",
    description=description,
    fn_schema=AutoRetrieveModel,
)

初始化代理程序¶

In [ ]:

from llama_index.agent.openai import OpenAIAgent
from llama_index.llms.openai import OpenAI

agent = OpenAIAgent.from_tools(
    [auto_retrieve_tool],
    llm=OpenAI(temperature=0, model="gpt-4-0613"),
    verbose=True,
)

In [ ]:

response = agent.chat("Tell me about two celebrities from the United States. ")
print(str(response))

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: celebrity_bios with args: {
"query": "celebrities from the United States",
"filter_key_list": ["country"],
"filter_value_list": ["United States"],
"filter_operator_list": ["=="],
"filter_condition": "and"
}
Got output: Angelina Jolie and Michael Jordan are both celebrities from the United States.
========================

STARTING TURN 2
---------------

Here are two celebrities from the United States:

1. **Angelina Jolie**: She is an American actress, filmmaker, and humanitarian. The recipient of numerous accolities, including an Academy Award and three Golden Globe Awards, she has been named Hollywood's highest-paid actress multiple times.

2. **Michael Jordan**: He is a former professional basketball player and the principal owner of the Charlotte Hornets of the National Basketball Association (NBA). He played 15 seasons in the NBA, winning six championships with the Chicago Bulls. He is considered one of the greatest players in the history of the NBA.

In [ ]:

response = agent.chat("Tell me about two celebrities born after 1980. ")
print(str(response))

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: celebrity_bios with args: {
"query": "celebrities born after 1980",
"filter_key_list": ["born"],
"filter_value_list": [1980],
"filter_operator_list": [">"],
"filter_condition": "and"
}
Got output: Rihanna and Cristiano Ronaldo are both celebrities who were born after 1980.
========================

STARTING TURN 2
---------------

Here are two celebrities who were born after 1980:

1. **Rihanna**: She is a Barbadian singer, actress, and businesswoman. Born in Saint Michael and raised in Bridgetown, Barbados, Rihanna was discovered by American record producer Evan Rogers who invited her to the United States to record demo tapes. She rose to fame with her debut album "Music of the Sun" and its follow-up "A Girl like Me".

2. **Cristiano Ronaldo**: He is a Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal

In [ ]:

response = agent.chat(
    "Tell me about few celebrities under category business and born after 1950. "
)
print(str(response))

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: celebrity_bios with args: {
"query": "business celebrities born after 1950",
"filter_key_list": ["category", "born"],
"filter_value_list": ["Business", 1950],
"filter_operator_list": ["==", ">"],
"filter_condition": "and"
}
Got output: Elon Musk is a notable business celebrity who was born in 1971.
========================

STARTING TURN 2
---------------

Elon Musk is a business celebrity who was born after 1950. He is a business magnate and investor. He is the founder, CEO, CTO, and chief designer of SpaceX; early investor, CEO and product architect of Tesla, Inc.; founder of The

文本到SQL和语义搜索的联合¶

这目前由我们的 SQLAutoVectorQueryEngine 处理。

让我们尝试通过让我们的 OpenAIAgent 访问两个查询工具来实现这一点：SQL 和 Vector

注意： 任何文本到SQL应用程序都应该意识到执行任意的SQL查询可能存在安全风险。建议采取必要的预防措施，比如使用受限角色、只读数据库、沙盒等。

加载和索引结构化数据¶

我们将样本结构化数据点加载到 SQL 数据库中并进行索引。

In [ ]:

from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    select,
    column,
)
from llama_index.core import SQLDatabase
from llama_index.core.indices import SQLStructStoreIndex

engine = create_engine("sqlite:///:memory:", future=True)
metadata_obj = MetaData()

In [ ]:

# 创建城市SQL表
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)

metadata_obj.create_all(engine)

In [ ]:

# 打印表格
metadata_obj.tables.keys()

Out[ ]:

dict_keys(['city_stats'])

In [ ]:

from sqlalchemy import insert

rows = [
    {"city_name": "Toronto", "population": 2930000, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
    {"city_name": "Berlin", "population": 3645000, "country": "Germany"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.begin() as connection:
        cursor = connection.execute(stmt)

In [ ]:

with engine.connect() as connection:
    cursor = connection.exec_driver_sql("SELECT * FROM city_stats")
    print(cursor.fetchall())

[('Toronto', 2930000, 'Canada'), ('Tokyo', 13960000, 'Japan'), ('Berlin', 3645000, 'Germany')]

In [ ]:

sql_database = SQLDatabase(engine, include_tables=["city_stats"])

In [ ]:

from llama_index.core.query_engine import NLSQLTableQueryEngine

In [ ]:

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["city_stats"],
)

加载和索引非结构化数据¶

我们将非结构化数据加载到由Pinecone支持的向量索引中。

In [ ]:

# 安装维基百科的Python包
!pip install wikipedia

Requirement already satisfied: wikipedia in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (1.4.0)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from wikipedia) (2.28.2)
Requirement already satisfied: beautifulsoup4 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from wikipedia) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.26.15)
Requirement already satisfied: soupsieve>1.2 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from beautifulsoup4->wikipedia) (2.4.1)

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: pip install --upgrade pip

In [ ]:

from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

In [ ]:

cities = ["Toronto", "Berlin", "Tokyo"]
wiki_docs = WikipediaReader().load_data(pages=cities)

In [ ]:

# 定义松果索引
import pinecone
import os

api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west1-gcp")

# 维度适用于文本嵌入-ada-002
# pinecone.create_index("quickstart", dimension=1536, metric="euclidean", pod_type="p1")
pinecone_index = pinecone.Index("quickstart")

In [ ]:

# 可选：删除所有
pinecone_index.delete(deleteAll=True)

Out[ ]:

{}

In [ ]:

from llama_index.core import Settings
from llama_index.core import StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.llms.openai import OpenAI

# 定义节点解析器和LLM
Settings.llm = OpenAI(temperature=0, model="gpt-4")
Settings.node_parser = TokenTextSplitter(chunk_size=1024)

# 定义Pinecone向量索引
vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index, namespace="wiki_cities"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex([], storage_context=storage_context)

In [ ]:

# 将文档插入向量索引
# 每个文档都附带城市的元数据
for city, wiki_doc in zip(cities, wiki_docs):
    nodes = Settings.node_parser.get_nodes_from_documents([wiki_doc])
    # 为每个节点添加元数据
    for node in nodes:
        node.metadata = {"title": city}
    vector_index.insert_nodes(nodes)

Upserted vectors: 100%|█████████████████████████████████████████████████| 20/20 [00:00<00:00, 38.13it/s]
Upserted vectors: 100%|████████████████████████████████████████████████| 21/21 [00:00<00:00, 101.89it/s]
Upserted vectors: 100%|█████████████████████████████████████████████████| 13/13 [00:00<00:00, 97.91it/s]

定义查询引擎/工具¶

In [ ]:

from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.tools import QueryEngineTool


vector_store_info = VectorStoreInfo(
    content_info="articles about different cities",
    metadata_info=[
        MetadataInfo(
            name="title", type="str", description="The name of the city"
        ),
    ],
)
vector_auto_retriever = VectorIndexAutoRetriever(
    vector_index, vector_store_info=vector_store_info
)

retriever_query_engine = RetrieverQueryEngine.from_args(
    vector_auto_retriever,
)

In [ ]:

sql_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="sql_tool",
    description=(
        "Useful for translating a natural language query into a SQL query over"
        " a table containing: city_stats, containing the population/country of"
        " each city"
    ),
)
vector_tool = QueryEngineTool.from_defaults(
    query_engine=retriever_query_engine,
    name="vector_tool",
    description=(
        f"Useful for answering semantic questions about different cities"
    ),
)

初始化代理¶

In [ ]:

from llama_index.agent.openai import OpenAIAgent
from llama_index.llms.openai import OpenAI

agent = OpenAIAgent.from_tools(
    [sql_tool, vector_tool],
    llm=OpenAI(temperature=0, model="gpt-4-0613"),
    verbose=True,
)

In [ ]:

# 注意：gpt-3.5给出了错误的答案，但gpt-4能够对两个循环进行推理
response = agent.chat(
    "告诉我关于人口最多的城市的艺术和文化"
)
print(str(response))

=== Calling Function ===
Calling function: sql_tool with args: {
  "input": "SELECT city FROM city_stats ORDER BY population DESC LIMIT 1"
}
Got output:  The city with the highest population is Tokyo.
========================
=== Calling Function ===
Calling function: vector_tool with args: {
  "input": "Tell me about the arts and culture of Tokyo"
}
Got output: Tokyo has a rich arts and culture scene, with many theaters for performing arts, including national and private theaters for traditional forms of Japanese drama. Noteworthy theaters are the National Noh Theatre for noh and the Kabuki-za for Kabuki. Symphony orchestras and other musical organizations perform modern and traditional music. The New National Theater Tokyo in Shibuya is the national center for the performing arts, including opera, ballet, contemporary dance, and drama. Tokyo also hosts modern Japanese and international pop and rock music at various venues, ranging from intimate clubs to internationally known areas such as the Nippon Budokan.

Many different festivals occur throughout Tokyo, with major events including the Sannō at Hie Shrine, the Sanja at Asakusa Shrine, and the biennial Kanda Festivals. Annually on the last Saturday of July, a massive fireworks display over the Sumida River attracts over a million viewers. Once cherry blossoms bloom in spring, residents gather in Ueno Park, Inokashira Park, and the Shinjuku Gyoen National Garden for picnics under the blossoms. Harajuku, a neighborhood in Shibuya, is known internationally for its youth style, fashion, and cosplay.

Tokyo is also renowned for its fine dining, with Michelin awarding a significant number of stars to the city's restaurants. As of 2017, 227 restaurants in Tokyo have been awarded Michelin stars, surpassing the number awarded in Paris.
========================
Tokyo, the city with the highest population, has a rich arts and culture scene. It is home to many theaters for performing arts, including national and private theaters for traditional forms of Japanese drama such as Noh and Kabuki. The New National Theater Tokyo in Shibuya is the national center for the performing arts, including opera, ballet, contemporary dance, and drama.

Tokyo also hosts modern Japanese and international pop and rock music at various venues, ranging from intimate clubs to internationally known areas such as the Nippon Budokan.

The city is known for its festivals, with major events including the Sannō at Hie Shrine, the Sanja at Asakusa Shrine, and the biennial Kanda Festivals. Once cherry blossoms bloom in spring, residents gather in Ueno Park, Inokashira Park, and the Shinjuku Gyoen National Garden for picnics under the blossoms.

Harajuku, a neighborhood in Shibuya, is known internationally for its youth style, fashion, and cosplay. Tokyo is also renowned for its fine dining, with Michelin awarding a significant number of stars to the city's restaurants. As of 2017, 227 restaurants in Tokyo have been awarded Michelin stars, surpassing the number awarded in Paris.

In [ ]:

response = agent.chat("Tell me about the history of Berlin")
print(str(response))

=== Calling Function ===
Calling function: vector_tool with args: {
  "input": "Tell me about the history of Berlin"
}
Got output: Berlin's history dates back to the 15th century when it was established as the capital of the Margraviate of Brandenburg. The Hohenzollern family ruled Berlin until 1918, first as electors of Brandenburg, then as kings of Prussia, and eventually as German emperors. In 1443, Frederick II Irontooth started the construction of a new royal palace in the twin city Berlin-Cölln, which later became the permanent residence of the Brandenburg electors of the Hohenzollerns.

The Thirty Years' War between 1618 and 1648 devastated Berlin, with the city losing half of its population. Frederick William, known as the "Great Elector", initiated a policy of promoting immigration and religious tolerance. In 1701, the dual state of the Margraviate of Brandenburg and the Duchy of Prussia formed the Kingdom of Prussia, with Berlin as its capital. Under the rule of Frederick II, Berlin became a center of the Enlightenment.

The Industrial Revolution in the 19th century transformed Berlin, expanding its economy and population. In 1871, Berlin became the capital of the newly founded German Empire. The early 20th century saw Berlin as a fertile ground for the German Expressionist movement. At the end of the First World War in 1918, a republic was proclaimed, and in 1920, the Greater Berlin Act incorporated dozens of suburban cities, villages, and estates around Berlin.
========================

Out[ ]:

Response(response='Berlin\'s history dates back to the 15th century when it was established as the capital of the Margraviate of Brandenburg. The Hohenzollern family ruled Berlin until 1918, first as electors of Brandenburg, then as kings of Prussia, and eventually as German emperors. In 1443, Frederick II Irontooth started the construction of a new royal palace in the twin city Berlin-Cölln.\n\nThe Thirty Years\' War between 1618 and 1648 devastated Berlin, with the city losing half of its population. Frederick William, known as the "Great Elector", initiated a policy of promoting immigration and religious tolerance. In 1701, the dual state of the Margraviate of Brandenburg and the Duchy of Prussia formed the Kingdom of Prussia, with Berlin as its capital. Under the rule of Frederick II, Berlin became a center of the Enlightenment.\n\nThe Industrial Revolution in the 19th century transformed Berlin, expanding its economy and population. In 1871, Berlin became the capital of the newly founded German Empire. The early 20th century saw Berlin as a fertile ground for the German Expressionist movement. At the end of the First World War in 1918, a republic was proclaimed, and in 1920, the Greater Berlin Act incorporated dozens of suburban cities, villages, and estates around Berlin.', source_nodes=[], extra_info=None)

In [ ]:

response = agent.chat(
    "Can you give me the country corresponding to each city?"
)
print(str(response))

=== Calling Function ===
Calling function: sql_tool with args: {
  "input": "SELECT city, country FROM city_stats"
}
Got output:  The cities Toronto, Tokyo, and Berlin are located in the countries Canada, Japan, and Germany respectively.
========================

Out[ ]:

Response(response='Sure, here are the countries corresponding to each city:\n\n- Toronto is in Canada\n- Tokyo is in Japan\n- Berlin is in Germany', source_nodes=[], extra_info=None)