Notebook

使用自查询检索构建酒店客房搜索¶

在这个示例中，我们将介绍如何构建和迭代一个酒店客房搜索服务，该服务利用LLM生成结构化的过滤查询，然后将这些查询传递给向量存储。

要了解自查询检索的介绍，请查看文档。

导入和数据准备¶

在这个示例中，我们使用 ChatOpenAI 作为模型，ElasticsearchStore 作为向量存储，但可以用LLM/ChatModel和支持自查询的任何VectorStore进行替换。

从以下链接下载数据：https://www.kaggle.com/datasets/keshavramaiah/hotel-recommendation

In [ ]:

# 安装所需的Python包
!pip install langchain langchain-elasticsearch lark openai elasticsearch pandas

In [1]:

# 导入 pandas 库，约定别名为 pd
import pandas as pd

In [2]:

# 读取Hotel_details.csv文件并去除重复的hotelid行，然后将hotelid列设为索引
details = (
    pd.read_csv("~/Downloads/archive/Hotel_details.csv")
    .drop_duplicates(subset="hotelid")
    .set_index("hotelid")
)

# 读取Hotel_Room_attributes.csv文件，将id列设为索引
attributes = pd.read_csv(
    "~/Downloads/archive/Hotel_Room_attributes.csv", index_col="id"
)

# 读取hotels_RoomPrice.csv文件，将id列设为索引
price = pd.read_csv("~/Downloads/archive/hotels_RoomPrice.csv", index_col="id")

In [3]:

# 将最新的价格数据按照"refid"列去重，保留最后一条数据
latest_price = price.drop_duplicates(subset="refid", keep="last")[
    [
        "hotelcode",
        "roomtype",
        "onsiterate",
        "roomamenities",
        "maxoccupancy",
        "mealinclusiontype",
    ]
]

# 从属性数据中获取"ratedescription"列的值，并添加到最新的价格数据中
latest_price["ratedescription"] = attributes.loc[latest_price.index]["ratedescription"]

# 将最新的价格数据与详情数据中的["hotelname", "city", "country", "starrating"]列进行连接
latest_price = latest_price.join(
    details[["hotelname", "city", "country", "starrating"]], on="hotelcode"
)

# 重命名列名"ratedescription"为"roomdescription"
latest_price = latest_price.rename({"ratedescription": "roomdescription"}, axis=1)

# 添加新列"mealsincluded"，表示是否包含餐食
latest_price["mealsincluded"] = ~latest_price["mealinclusiontype"].isnull()

# 删除列"hotelcode"和"mealinclusiontype"
latest_price.pop("hotelcode")
latest_price.pop("mealinclusiontype")

# 重置索引并返回一个新的DataFrame
latest_price = latest_price.reset_index(drop=True)

# 显示处理后的最新价格数据的前几行
latest_price.head()

Out[3]:

	roomtype	onsiterate	roomamenities	maxoccupancy	roomdescription	hotelname	city	country	starrating	mealsincluded
0	Vacation Home	636.09	Air conditioning: ;Closet: ;Fireplace: ;Free W...	4	Shower, Kitchenette, 2 bedrooms, 1 double bed ...	Pantlleni	Beddgelert	United Kingdom	3	False
1	Vacation Home	591.74	Air conditioning: ;Closet: ;Dishwasher: ;Firep...	4	Shower, Kitchenette, 2 bedrooms, 1 double bed ...	Willow Cottage	Beverley	United Kingdom	3	False
2	Guest room, Queen or Twin/Single Bed(s)	0.00	NaN	2	NaN	AC Hotel Manchester Salford Quays	Manchester	United Kingdom	4	False
3	Bargemaster King Accessible Room	379.08	Air conditioning: ;Free Wi-Fi in all rooms!: ;...	2	Shower	Lincoln Plaza London, Curio Collection by Hilton	London	United Kingdom	4	True
4	Twin Room	156.17	Additional toilet: ;Air conditioning: ;Blackou...	2	Room size: 15 m²/161 ft², Non-smoking, Shower,...	Ibis London Canning Town	London	United Kingdom	3	True

描述数据属性¶

我们将使用一个自查询检索器，这需要我们描述可以进行过滤的元数据。

或者，如果我们感到懒惰，我们可以让模型为我们撰写描述的草稿 :)

In [4]:

# 导入ChatOpenAI类
from langchain_openai import ChatOpenAI

# 创建ChatOpenAI对象，指定模型为"gpt-4"
model = ChatOpenAI(model="gpt-4")

# 使用模型预测，传入包含酒店房间信息的字符串，生成JSON列表
res = model.predict(
    "Below is a table with information about hotel rooms. "
    "Return a JSON list with an entry for each column. Each entry should have "
    '{"name": "column name", "description": "column description", "type": "column data type"}'
    f"\n\n{latest_price.head()}\n\nJSON:\n"
)

In [5]:

import json

# 使用json.loads()方法将res转换为Python对象
attribute_info = json.loads(res)
# 打印转换后的Python对象
attribute_info

Out[5]:

[{'name': 'roomtype', 'description': 'The type of the room', 'type': 'string'},
 {'name': 'onsiterate',
  'description': 'The rate of the room',
  'type': 'float'},
 {'name': 'roomamenities',
  'description': 'Amenities available in the room',
  'type': 'string'},
 {'name': 'maxoccupancy',
  'description': 'Maximum number of people that can occupy the room',
  'type': 'integer'},
 {'name': 'roomdescription',
  'description': 'Description of the room',
  'type': 'string'},
 {'name': 'hotelname', 'description': 'Name of the hotel', 'type': 'string'},
 {'name': 'city',
  'description': 'City where the hotel is located',
  'type': 'string'},
 {'name': 'country',
  'description': 'Country where the hotel is located',
  'type': 'string'},
 {'name': 'starrating',
  'description': 'Star rating of the hotel',
  'type': 'integer'},
 {'name': 'mealsincluded',
  'description': 'Whether meals are included or not',
  'type': 'boolean'}]

对于低基数特征，让我们在描述中包含有效值。

In [6]:

# 获取最新价格的唯一值数量，并筛选出唯一值数量小于40的数据
latest_price.nunique()[latest_price.nunique() < 40]

Out[6]:

maxoccupancy     19
country          29
starrating        3
mealsincluded     2
dtype: int64

In [7]:

# 将最新价格数据集中'starrating'列的唯一值按照升序排列后，添加到attribute_info列表倒数第二个元素的"description"字段中
attribute_info[-2]["description"] += (
    f". Valid values are {sorted(latest_price['starrating'].value_counts().index.tolist())}"
)

# 将最新价格数据集中'maxoccupancy'列的唯一值按照升序排列后，添加到attribute_info列表第4个元素的"description"字段中
attribute_info[3]["description"] += (
    f". Valid values are {sorted(latest_price['maxoccupancy'].value_counts().index.tolist())}"
)

# 将最新价格数据集中'country'列的唯一值按照升序排列后，添加到attribute_info列表倒数第3个元素的"description"字段中
attribute_info[-3]["description"] += (
    f". Valid values are {sorted(latest_price['country'].value_counts().index.tolist())}"
)

In [8]:

attribute_info

Out[8]:

[{'name': 'roomtype', 'description': 'The type of the room', 'type': 'string'},
 {'name': 'onsiterate',
  'description': 'The rate of the room',
  'type': 'float'},
 {'name': 'roomamenities',
  'description': 'Amenities available in the room',
  'type': 'string'},
 {'name': 'maxoccupancy',
  'description': 'Maximum number of people that can occupy the room. Valid values are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 24]',
  'type': 'integer'},
 {'name': 'roomdescription',
  'description': 'Description of the room',
  'type': 'string'},
 {'name': 'hotelname', 'description': 'Name of the hotel', 'type': 'string'},
 {'name': 'city',
  'description': 'City where the hotel is located',
  'type': 'string'},
 {'name': 'country',
  'description': "Country where the hotel is located. Valid values are ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom']",
  'type': 'string'},
 {'name': 'starrating',
  'description': 'Star rating of the hotel. Valid values are [2, 3, 4]',
  'type': 'integer'},
 {'name': 'mealsincluded',
  'description': 'Whether meals are included or not',
  'type': 'boolean'}]

创建查询构造器链¶

让我们来看一下将自然语言请求转换为结构化查询的链。

首先，我们只需加载提示并查看其外观。

In [9]:

from langchain.chains.query_constructor.base import (
    get_query_constructor_prompt,
    load_query_constructor_runnable,
)

In [10]:

# 定义一个包含酒店房间详细描述的字符串
doc_contents = "Detailed description of a hotel room"

# 调用函数 get_query_constructor_prompt，获取查询构造器的提示信息
prompt = get_query_constructor_prompt(doc_contents, attribute_info)

# 打印格式化后的提示信息，将"{query}"替换为实际的查询内容
print(prompt.format(query="{query}"))

Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling timestamp data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.

<< Example 1. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre

Structured Request:
```json
{
    "query": "teenager love",
    "filter": "and(or(eq(\"artist\", \"Taylor Swift\"), eq(\"artist\", \"Katy Perry\")), lt(\"length\", 180), eq(\"genre\", \"pop\"))"
}
```


<< Example 2. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs that were not published on Spotify

Structured Request:
```json
{
    "query": "",
    "filter": "NO_FILTER"
}
```


<< Example 3. >>
Data Source:
```json
{
    "content": "Detailed description of a hotel room",
    "attributes": {
    "roomtype": {
        "description": "The type of the room",
        "type": "string"
    },
    "onsiterate": {
        "description": "The rate of the room",
        "type": "float"
    },
    "roomamenities": {
        "description": "Amenities available in the room",
        "type": "string"
    },
    "maxoccupancy": {
        "description": "Maximum number of people that can occupy the room. Valid values are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 24]",
        "type": "integer"
    },
    "roomdescription": {
        "description": "Description of the room",
        "type": "string"
    },
    "hotelname": {
        "description": "Name of the hotel",
        "type": "string"
    },
    "city": {
        "description": "City where the hotel is located",
        "type": "string"
    },
    "country": {
        "description": "Country where the hotel is located. Valid values are ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom']",
        "type": "string"
    },
    "starrating": {
        "description": "Star rating of the hotel. Valid values are [2, 3, 4]",
        "type": "integer"
    },
    "mealsincluded": {
        "description": "Whether meals are included or not",
        "type": "boolean"
    }
}
}
```

User Query:
{query}

Structured Request:

In [11]:

# 使用 ChatOpenAI 类创建一个实例 chain，并传入参数 model="gpt-3.5-turbo" 和 temperature=0
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0), doc_contents, attribute_info
)

In [12]:

# 调用chain对象的invoke方法，传入一个包含查询信息的字典作为参数
chain.invoke({"query": "I want a hotel in Southern Europe and my budget is 200 bucks."})

Out[12]:

StructuredQuery(query='hotel', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Italy'), Comparison(comparator=<Comparator.LTE: 'lte'>, attribute='onsiterate', value=200)]), limit=None)

In [13]:

# 调用chain的invoke方法，传入一个字典作为参数
chain.invoke(
    {
        "query": "Find a 2-person room in Vienna or London, preferably with meals included and AC"
    }
)

Out[13]:

StructuredQuery(query='2-person room', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='Vienna'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='London')]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='maxoccupancy', value=2), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='mealsincluded', value=True), Comparison(comparator=<Comparator.CONTAIN: 'contain'>, attribute='roomamenities', value='AC')]), limit=None)

优化属性描述¶

我们可以看到上面至少有两个问题。首先是当我们要求一个南欧目的地时，我们只得到了对意大利的过滤，其次是当我们要求空调时，我们得到了对AC的字面字符串查找（这并不是太糟糕，但会错过像“空调”这样的内容）。

作为第一步，让我们尝试更新我们对“国家”属性的描述，以强调只有在提到特定国家时才应使用相等性。

In [14]:

# 将描述信息添加到attribute_info列表倒数第三个元素的"description"键中
attribute_info[-3]["description"] += (
    ". NOTE: Only use the 'eq' operator if a specific country is mentioned. If a region is mentioned, include all relevant countries in filter."
)

# 调用load_query_constructor_runnable函数，传入ChatOpenAI模型参数和其他参数
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    doc_contents,
    attribute_info,
)

In [15]:

# 调用chain对象的invoke方法，并传入一个字典作为参数
# 字典包含一个键值对，键为"query"，值为"I want a hotel in Southern Europe and my budget is 200 bucks."
chain.invoke({"query": "我想在南欧找一家酒店，我的预算是200美元。"})

Out[15]:

StructuredQuery(query='hotel', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='mealsincluded', value=False), Comparison(comparator=<Comparator.LTE: 'lte'>, attribute='onsiterate', value=200), Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Italy'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Spain'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Greece'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Portugal'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Croatia'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Cyprus'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Malta'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Bulgaria'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Romania'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Slovenia'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Czech Republic'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Slovakia'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Hungary'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Poland'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Estonia'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Latvia'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Lithuania')])]), limit=None)

精细化筛选属性¶

这似乎有所帮助！现在让我们尝试缩小我们筛选的属性范围。我们可以将更自由形式的属性留给主查询，这样可以更好地捕捉语义含义，而不是搜索特定的子字符串。

In [16]:

# 定义一个包含属性名称的列表
content_attr = ["roomtype", "roomamenities", "roomdescription", "hotelname"]
# 定义一个包含酒店房间详细描述的字符串
doc_contents = "A detailed description of a hotel room, including information about the room type and room amenities."
# 使用列表推导式创建一个过滤后的属性信息元组
filter_attribute_info = tuple(
    ai for ai in attribute_info if ai["name"] not in content_attr
)
# 调用load_query_constructor_runnable函数，传入ChatOpenAI模型、文档内容、过滤后的属性信息
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    doc_contents,
    filter_attribute_info,
)

In [17]:

# 调用chain的invoke方法，传入一个包含查询信息的字典作为参数
chain.invoke(
    {
        "query": "Find a 2-person room in Vienna or London, preferably with meals included and AC"
    }
)

Out[17]:

StructuredQuery(query='2-person room', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='Vienna'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='London')]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='maxoccupancy', value=2), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='mealsincluded', value=True)]), limit=None)

添加特定于我们用例的示例¶

我们已经移除了对“AC”的严格过滤，但它仍未包含在查询字符串中。我们的链式提示是一个带有一些默认示例的少拍提示。让我们看看添加特定于用例的示例是否会有所帮助：

In [18]:

examples = [
    (
        "我想要在巴尔干地区的一家带有特大号床和热水浴缸的酒店。预算是每晚300美元",
        {
            "query": "特大号床，热水浴缸",
            "filter": 'and(in("country", ["保加利亚", "希腊", "克罗地亚", "塞尔维亚"]), lte("价格", 300))',
        },
    ),
    (
        "一间希尔顿酒店包含早餐的房间，适合3个人",
        {
            "query": "希尔顿",
            "filter": 'and(eq("包含餐食", true), gte("最大入住人数", 3))',
        },
    ),
]
prompt = get_query_constructor_prompt(
    doc_contents, filter_attribute_info, examples=examples
)
print(prompt.format(query="{query}"))

Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling timestamp data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.

<< Data Source >>
```json
{
    "content": "A detailed description of a hotel room, including information about the room type and room amenities.",
    "attributes": {
    "onsiterate": {
        "description": "The rate of the room",
        "type": "float"
    },
    "maxoccupancy": {
        "description": "Maximum number of people that can occupy the room. Valid values are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 24]",
        "type": "integer"
    },
    "city": {
        "description": "City where the hotel is located",
        "type": "string"
    },
    "country": {
        "description": "Country where the hotel is located. Valid values are ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom']. NOTE: Only use the 'eq' operator if a specific country is mentioned. If a region is mentioned, include all relevant countries in filter.",
        "type": "string"
    },
    "starrating": {
        "description": "Star rating of the hotel. Valid values are [2, 3, 4]",
        "type": "integer"
    },
    "mealsincluded": {
        "description": "Whether meals are included or not",
        "type": "boolean"
    }
}
}
```


<< Example 1. >>
User Query:
I want a hotel in the Balkans with a king sized bed and a hot tub. Budget is $300 a night

Structured Request:
```json
{
    "query": "king-sized bed, hot tub",
    "filter": "and(in(\"country\", [\"Bulgaria\", \"Greece\", \"Croatia\", \"Serbia\"]), lte(\"onsiterate\", 300))"
}
```


<< Example 2. >>
User Query:
A room with breakfast included for 3 people, at a Hilton

Structured Request:
```json
{
    "query": "Hilton",
    "filter": "and(eq(\"mealsincluded\", true), gte(\"maxoccupancy\", 3))"
}
```


<< Example 3. >>
User Query:
{query}

Structured Request:

In [19]:

# 加载查询构造器可运行对象
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0),  # 使用ChatOpenAI模型，模型为"gpt-3.5-turbo"，温度为0
    doc_contents,  # 文档内容
    filter_attribute_info,  # 过滤属性信息
    examples=examples,  # 示例
)

In [20]:

# 调用chain对象的invoke方法，传入一个包含查询信息的字典作为参数
chain.invoke(
    {
        "query": "Find a 2-person room in Vienna or London, preferably with meals included and AC"
    }
)

Out[20]:

StructuredQuery(query='2-person room, meals included, AC', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='Vienna'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='London')]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='mealsincluded', value=True)]), limit=None)

这似乎有所帮助！让我们尝试另一个复杂的查询：

In [21]:

# 调用chain的invoke方法，传入一个字典作为参数
chain.invoke(
    {
        "query": "I want to stay somewhere highly rated along the coast. I want a room with a patio and a fireplace."
    }
)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/langchain/libs/langchain/langchain/chains/query_constructor/base.py:53, in StructuredQueryOutputParser.parse(self, text)
     52 else:
---> 53     parsed["filter"] = self.ast_parse(parsed["filter"])
     54 if not parsed.get("limit"):

File ~/langchain/.venv/lib/python3.9/site-packages/lark/lark.py:652, in Lark.parse(self, text, start, on_error)
    635 """Parse the given text, according to the options provided.
    636 
    637 Parameters:
   (...)
    650 
    651 """
--> 652 return self.parser.parse(text, start=start, on_error=on_error)

File ~/langchain/.venv/lib/python3.9/site-packages/lark/parser_frontends.py:101, in ParsingFrontend.parse(self, text, start, on_error)
    100 stream = self._make_lexer_thread(text)
--> 101 return self.parser.parse(stream, chosen_start, **kw)

File ~/langchain/.venv/lib/python3.9/site-packages/lark/parsers/lalr_parser.py:41, in LALR_Parser.parse(self, lexer, start, on_error)
     40 try:
---> 41     return self.parser.parse(lexer, start)
     42 except UnexpectedInput as e:

File ~/langchain/.venv/lib/python3.9/site-packages/lark/parsers/lalr_parser.py:171, in _Parser.parse(self, lexer, start, value_stack, state_stack, start_interactive)
    170     return InteractiveParser(self, parser_state, parser_state.lexer)
--> 171 return self.parse_from_state(parser_state)

File ~/langchain/.venv/lib/python3.9/site-packages/lark/parsers/lalr_parser.py:184, in _Parser.parse_from_state(self, state, last_token)
    183 for token in state.lexer.lex(state):
--> 184     state.feed_token(token)
    186 end_token = Token.new_borrow_pos('$END', '', token) if token else Token('$END', '', 0, 1, 1)

File ~/langchain/.venv/lib/python3.9/site-packages/lark/parsers/lalr_parser.py:150, in ParserState.feed_token(self, token, is_end)
    148     s = []
--> 150 value = callbacks[rule](s)
    152 _action, new_state = states[state_stack[-1]][rule.origin.name]

File ~/langchain/.venv/lib/python3.9/site-packages/lark/parse_tree_builder.py:153, in ChildFilterLALR_NoPlaceholders.__call__(self, children)
    152         filtered.append(children[i])
--> 153 return self.node_builder(filtered)

File ~/langchain/.venv/lib/python3.9/site-packages/lark/parse_tree_builder.py:325, in apply_visit_wrapper.<locals>.f(children)
    323 @wraps(func)
    324 def f(children):
--> 325     return wrapper(func, name, children, None)

File ~/langchain/.venv/lib/python3.9/site-packages/lark/visitors.py:501, in _vargs_inline(f, _data, children, _meta)
    500 def _vargs_inline(f, _data, children, _meta):
--> 501     return f(*children)

File ~/langchain/.venv/lib/python3.9/site-packages/lark/visitors.py:479, in _VArgsWrapper.__call__(self, *args, **kwargs)
    478 def __call__(self, *args, **kwargs):
--> 479     return self.base_func(*args, **kwargs)

File ~/langchain/libs/langchain/langchain/chains/query_constructor/parser.py:79, in QueryTransformer.func_call(self, func_name, args)
     78 if self.allowed_attributes and args[0] not in self.allowed_attributes:
---> 79     raise ValueError(
     80         f"Received invalid attributes {args[0]}. Allowed attributes are "
     81         f"{self.allowed_attributes}"
     82     )
     83 return Comparison(comparator=func, attribute=args[0], value=args[1])

ValueError: Received invalid attributes description. Allowed attributes are ['onsiterate', 'maxoccupancy', 'city', 'country', 'starrating', 'mealsincluded']

During handling of the above exception, another exception occurred:

OutputParserException                     Traceback (most recent call last)
Cell In[21], line 1
----> 1 chain.invoke({"query": "I want to stay somewhere highly rated along the coast. I want a room with a patio and a fireplace."})

File ~/langchain/libs/langchain/langchain/schema/runnable/base.py:1113, in RunnableSequence.invoke(self, input, config)
   1111 try:
   1112     for i, step in enumerate(self.steps):
-> 1113         input = step.invoke(
   1114             input,
   1115             # mark each step as a child run
   1116             patch_config(
   1117                 config, callbacks=run_manager.get_child(f"seq:step:{i+1}")
   1118             ),
   1119         )
   1120 # finish the root run
   1121 except BaseException as e:

File ~/langchain/libs/langchain/langchain/schema/output_parser.py:173, in BaseOutputParser.invoke(self, input, config)
    169 def invoke(
    170     self, input: Union[str, BaseMessage], config: Optional[RunnableConfig] = None
    171 ) -> T:
    172     if isinstance(input, BaseMessage):
--> 173         return self._call_with_config(
    174             lambda inner_input: self.parse_result(
    175                 [ChatGeneration(message=inner_input)]
    176             ),
    177             input,
    178             config,
    179             run_type="parser",
    180         )
    181     else:
    182         return self._call_with_config(
    183             lambda inner_input: self.parse_result([Generation(text=inner_input)]),
    184             input,
    185             config,
    186             run_type="parser",
    187         )

File ~/langchain/libs/langchain/langchain/schema/runnable/base.py:633, in Runnable._call_with_config(self, func, input, config, run_type, **kwargs)
    626 run_manager = callback_manager.on_chain_start(
    627     dumpd(self),
    628     input,
    629     run_type=run_type,
    630     name=config.get("run_name"),
    631 )
    632 try:
--> 633     output = call_func_with_variable_args(
    634         func, input, run_manager, config, **kwargs
    635     )
    636 except BaseException as e:
    637     run_manager.on_chain_error(e)

File ~/langchain/libs/langchain/langchain/schema/runnable/config.py:173, in call_func_with_variable_args(func, input, run_manager, config, **kwargs)
    171 if accepts_run_manager(func):
    172     kwargs["run_manager"] = run_manager
--> 173 return func(input, **kwargs)

File ~/langchain/libs/langchain/langchain/schema/output_parser.py:174, in BaseOutputParser.invoke.<locals>.<lambda>(inner_input)
    169 def invoke(
    170     self, input: Union[str, BaseMessage], config: Optional[RunnableConfig] = None
    171 ) -> T:
    172     if isinstance(input, BaseMessage):
    173         return self._call_with_config(
--> 174             lambda inner_input: self.parse_result(
    175                 [ChatGeneration(message=inner_input)]
    176             ),
    177             input,
    178             config,
    179             run_type="parser",
    180         )
    181     else:
    182         return self._call_with_config(
    183             lambda inner_input: self.parse_result([Generation(text=inner_input)]),
    184             input,
    185             config,
    186             run_type="parser",
    187         )

File ~/langchain/libs/langchain/langchain/schema/output_parser.py:225, in BaseOutputParser.parse_result(self, result, partial)
    212 def parse_result(self, result: List[Generation], *, partial: bool = False) -> T:
    213     """Parse a list of candidate model Generations into a specific format.
    214 
    215     The return value is parsed from only the first Generation in the result, which
   (...)
    223         Structured output.
    224     """
--> 225     return self.parse(result[0].text)

File ~/langchain/libs/langchain/langchain/chains/query_constructor/base.py:60, in StructuredQueryOutputParser.parse(self, text)
     56     return StructuredQuery(
     57         **{k: v for k, v in parsed.items() if k in allowed_keys}
     58     )
     59 except Exception as e:
---> 60     raise OutputParserException(
     61         f"Parsing text\n{text}\n raised following error:\n{e}"
     62     )

OutputParserException: Parsing text
```json
{
    "query": "highly rated, coast, patio, fireplace",
    "filter": "and(eq(\"starrating\", 4), contain(\"description\", \"coast\"), contain(\"description\", \"patio\"), contain(\"description\", \"fireplace\"))"
}
```
 raised following error:
Received invalid attributes description. Allowed attributes are ['onsiterate', 'maxoccupancy', 'city', 'country', 'starrating', 'mealsincluded']

自动忽略无效查询¶

看起来我们的模型在这个更复杂的查询上出现了问题，并尝试搜索一个不存在的属性（'description'）。通过在我们的查询构造器链中设置 fix_invalid=True，我们可以自动移除任何无效的筛选条件（即使用了不允许的操作、比较或属性）。

In [22]:

# 导入所需的库
from openai import ChatOpenAI

# 调用load_query_constructor_runnable函数，加载查询构造器
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0),  # 使用ChatOpenAI模型，选择"gpt-3.5-turbo"模型，设置温度为0
    doc_contents,  # 文档内容
    filter_attribute_info,  # 过滤属性信息
    examples=examples,  # 示例
    fix_invalid=True,  # 修复无效值
)

In [23]:

# 调用chain的invoke方法，传入一个包含查询信息的字典作为参数
chain.invoke(
    {
        "query": "I want to stay somewhere highly rated along the coast. I want a room with a patio and a fireplace."
    }
)

Out[23]:

StructuredQuery(query='highly rated, coast, patio, fireplace', filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='starrating', value=4), limit=None)

使用自查询检索器¶

现在我们的查询构造链已经基本就位，让我们尝试将其与实际的检索器一起使用。在这个例子中，我们将使用ElasticsearchStore。

In [24]:

from langchain_elasticsearch import ElasticsearchStore  # 导入ElasticsearchStore模块
from langchain_openai import OpenAIEmbeddings  # 导入OpenAIEmbeddings模块

embeddings = OpenAIEmbeddings()  # 创建OpenAIEmbeddings对象并赋值给embeddings变量

填充向量存储¶

第一次运行时，请取消下面单元格的注释，首先对数据进行索引。

In [25]:

# 创建一个空列表docs用于存储文档
docs = []

# 遍历latest_price中的每一行数据，_表示索引，room表示每一行的数据
for _, room in latest_price.fillna("").iterrows():
    # 创建一个Document对象，其中page_content是将room转换为字典后使用json.dumps转换为字符串的结果，
    # metadata是room转换为字典的结果
    doc = Document(
        page_content=json.dumps(room.to_dict(), indent=2),
        metadata=room.to_dict()
    )
    # 将doc添加到docs列表中
    docs.append(doc)

# 使用ElasticsearchStore.from_documents方法将docs中的文档存储到Elasticsearch中
vecstore = ElasticsearchStore.from_documents(
    docs,
    embeddings,
    es_url="http://localhost:9200",
    index_name="hotel_rooms",
    # strategy=ElasticsearchStore.ApproxRetrievalStrategy(
    #     hybrid=True,
    # )
)

In [26]:

# 创建一个名为"hotel_rooms"的ElasticsearchStore对象
# 参数embedding指定嵌入向量
# 参数es_url指定Elasticsearch的URL地址
vecstore = ElasticsearchStore(
    "hotel_rooms",
    embedding=embeddings,
    es_url="http://localhost:9200",
    # strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=True) # 在社区版本中似乎不可用
)

In [27]:

from langchain.retrievers import SelfQueryRetriever

retriever = SelfQueryRetriever(
    query_constructor=chain, vectorstore=vecstore, verbose=True
)

In [28]:

# 调用retriever的invoke方法，并传入一个字符串作为参数
results = retriever.invoke(
    "I want to stay somewhere highly rated along the coast. I want a room with a patio and a fireplace."
)

# 遍历results中的每个结果
for res in results:
    # 打印结果的页面内容
    print(res.page_content)
    # 打印分隔线
    print("\n" + "-" * 20 + "\n")

{
  "roomtype": "Three-Bedroom House With Sea View",
  "onsiterate": 341.75,
  "roomamenities": "Additional bathroom: ;Additional toilet: ;Air conditioning: ;Closet: ;Clothes dryer: ;Coffee/tea maker: ;Dishwasher: ;DVD/CD player: ;Fireplace: ;Free Wi-Fi in all rooms!: ;Full kitchen: ;Hair dryer: ;Heating: ;High chair: ;In-room safe box: ;Ironing facilities: ;Kitchenware: ;Linens: ;Microwave: ;Private entrance: ;Refrigerator: ;Seating area: ;Separate dining area: ;Smoke detector: ;Sofa: ;Towels: ;TV [flat screen]: ;Washing machine: ;",
  "maxoccupancy": 6,
  "roomdescription": "Room size: 125 m\u00b2/1345 ft\u00b2, 2 bathrooms, Shower and bathtub, Shared bathroom, Kitchenette, 3 bedrooms, 1 double bed or 2 single beds or 1 double bed",
  "hotelname": "Downings Coastguard Cottages - Type B-E",
  "city": "Downings",
  "country": "Ireland",
  "starrating": 4,
  "mealsincluded": false
}

--------------------

{
  "roomtype": "Three-Bedroom House With Sea View",
  "onsiterate": 774.05,
  "roomamenities": "Additional bathroom: ;Additional toilet: ;Air conditioning: ;Closet: ;Clothes dryer: ;Coffee/tea maker: ;Dishwasher: ;DVD/CD player: ;Fireplace: ;Free Wi-Fi in all rooms!: ;Full kitchen: ;Hair dryer: ;Heating: ;High chair: ;In-room safe box: ;Ironing facilities: ;Kitchenware: ;Linens: ;Microwave: ;Private entrance: ;Refrigerator: ;Seating area: ;Separate dining area: ;Smoke detector: ;Sofa: ;Towels: ;TV [flat screen]: ;Washing machine: ;",
  "maxoccupancy": 6,
  "roomdescription": "Room size: 125 m\u00b2/1345 ft\u00b2, 2 bathrooms, Shower and bathtub, Shared bathroom, Kitchenette, 3 bedrooms, 1 double bed or 2 single beds or 1 double bed",
  "hotelname": "Downings Coastguard Cottages - Type B-E",
  "city": "Downings",
  "country": "Ireland",
  "starrating": 4,
  "mealsincluded": false
}

--------------------

{
  "roomtype": "Four-Bedroom Apartment with Sea View",
  "onsiterate": 501.24,
  "roomamenities": "Additional toilet: ;Air conditioning: ;Carpeting: ;Cleaning products: ;Closet: ;Clothes dryer: ;Clothes rack: ;Coffee/tea maker: ;Dishwasher: ;DVD/CD player: ;Fireplace: ;Free Wi-Fi in all rooms!: ;Full kitchen: ;Hair dryer: ;Heating: ;High chair: ;In-room safe box: ;Ironing facilities: ;Kitchenware: ;Linens: ;Microwave: ;Private entrance: ;Refrigerator: ;Seating area: ;Separate dining area: ;Smoke detector: ;Sofa: ;Toiletries: ;Towels: ;TV [flat screen]: ;Wake-up service: ;Washing machine: ;",
  "maxoccupancy": 9,
  "roomdescription": "Room size: 110 m\u00b2/1184 ft\u00b2, Balcony/terrace, Shower and bathtub, Kitchenette, 4 bedrooms, 1 single bed or 1 queen bed or 1 double bed or 2 single beds",
  "hotelname": "1 Elliot Terrace",
  "city": "Plymouth",
  "country": "United Kingdom",
  "starrating": 4,
  "mealsincluded": false
}

--------------------

{
  "roomtype": "Three-Bedroom Holiday Home with Terrace and Sea View",
  "onsiterate": 295.83,
  "roomamenities": "Air conditioning: ;Dishwasher: ;Free Wi-Fi in all rooms!: ;Full kitchen: ;Heating: ;In-room safe box: ;Kitchenware: ;Private entrance: ;Refrigerator: ;Satellite/cable channels: ;Seating area: ;Separate dining area: ;Sofa: ;Washing machine: ;",
  "maxoccupancy": 1,
  "roomdescription": "Room size: 157 m\u00b2/1690 ft\u00b2, Balcony/terrace, 3 bathrooms, Shower, Kitchenette, 3 bedrooms, 1 queen bed or 1 queen bed or 1 queen bed or 1 sofa bed",
  "hotelname": "Seaside holiday house Artatore (Losinj) - 17102",
  "city": "Mali Losinj",
  "country": "Croatia",
  "starrating": 4,
  "mealsincluded": false
}

--------------------

In [ ]: