Notebook

OpenAI检索API（通过助理代理）的基准测试¶

本指南通过使用我们的OpenAIAssistantAgent对OpenAI助理API中的检索工具进行基准测试。我们对Llama 2论文进行了测试，并将生成质量与一个简单的RAG流水线进行了比较。

In [ ]:

%pip install llama-index-readers-file pymupdf
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai

In [ ]:

!pip install llama-index

In [ ]:

import nest_asyncio

nest_asyncio.apply()

设置数据¶

这里我们加载Llama 2论文并对其进行分块。

In [ ]:

!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

--2023-11-08 21:53:52--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’

data/llama2.pdf     100%[===================>]  13.03M   141KB/s    in 1m 48s  

2023-11-08 21:55:42 (123 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]

In [ ]:

from pathlib import Path
from llama_index.core import Document, VectorStoreIndex
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI

In [ ]:

loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]

In [ ]:

node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)

In [ ]:

len(nodes)

Out[ ]:

定义评估模块¶

我们设置评估模块，包括数据集和评估器。

设置“黄金数据集”¶

这里我们加载一个“黄金”数据集。

选项1：获取现有数据集¶

注意：我们从Dropbox中获取数据集。有关如何生成数据集的详细信息，请参阅我们的DatasetGenerator模块。

In [ ]:

!wget "https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1" -O data/llama2_eval_qr_dataset.json

--2023-11-08 22:20:10--  https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6057:18::a27d:d12, 162.125.13.18
Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6057:18::a27d:d12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1# [following]
--2023-11-08 22:20:11--  https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1
Resolving uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)... 2620:100:6057:15::a27d:d0f, 162.125.13.15
Connecting to uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)|2620:100:6057:15::a27d:d0f|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60656 (59K) [application/binary]
Saving to: ‘data/llama2_eval_qr_dataset.json’

data/llama2_eval_qr 100%[===================>]  59.23K  --.-KB/s    in 0.02s   

2023-11-08 22:20:12 (2.87 MB/s) - ‘data/llama2_eval_qr_dataset.json’ saved [60656/60656]

In [ ]:

from llama_index.core.evaluation import QueryResponseDataset# 可选eval_dataset = QueryResponseDataset.from_json(    "data/llama2_eval_qr_dataset.json")

选项2：生成新数据集¶

如果选择此选项，您可以选择从头开始生成一个新的数据集。这样可以让您调整我们的DatasetGenerator设置，以确保它符合您的需求。

In [ ]:

from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI

In [ ]:

# 注意：如果数据集尚未保存，请运行此代码# 注意：我们只从前20个节点生成，因为其余的是引用llm = OpenAI(model="gpt-4-1106-preview")dataset_generator = DatasetGenerator(    nodes[:20],    llm=llm,    show_progress=True,    num_questions_per_chunk=3,)eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)eval_dataset.save_json("data/llama2_eval_qr_dataset.json")

In [ ]:

# 可选eval_dataset = QueryResponseDataset.from_json(    "data/llama2_eval_qr_dataset.json")

评估模块¶

我们定义了两个评估模块：正确性和语义相似度 - 两者都用于比较预测响应与实际响应的质量。

In [ ]:

from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    BatchEvalRunner,
)
from llama_index.llms.openai import OpenAI

In [ ]:

eval_llm = OpenAI(model="gpt-4-1106-preview")
evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)
evaluator_dict = {
    "correctness": evaluator_c,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

In [ ]:

import numpy as npimport timeimport osimport picklefrom tqdm import tqdmdef get_responses_sync(    eval_qs, query_engine, show_progress=True, save_path=None):    if show_progress:        eval_qs_iter = tqdm(eval_qs)    else:        eval_qs_iter = eval_qs    pred_responses = []    start_time = time.time()    for eval_q in eval_qs_iter:        print(f"eval q: {eval_q}")        pred_response = agent.query(eval_q)        print(f"predicted response: {pred_response}")        pred_responses.append(pred_response)        if save_path is not None:            # save intermediate responses (to cache in case something breaks)            avg_time = (time.time() - start_time) / len(pred_responses)            pickle.dump(                {"pred_responses": pred_responses, "avg_time": avg_time},                open(save_path, "wb"),            )    return pred_responsesasync def run_evals(    query_engine,    eval_qa_pairs,    batch_runner,    disable_async_for_preds=False,    save_path=None,):    # then evaluate    # TODO: evaluate a sample of generated results    eval_qs = [q for q, _ in eval_qa_pairs]    eval_answers = [a for _, a in eval_qa_pairs]    if save_path is not None:        if not os.path.exists(save_path):            start_time = time.time()            if disable_async_for_preds:                pred_responses = get_responses_sync(                    eval_qs,                    query_engine,                    show_progress=True,                    save_path=save_path,                )            else:                pred_responses = get_responses(                    eval_qs, query_engine, show_progress=True                )            avg_time = (time.time() - start_time) / len(eval_qs)            pickle.dump(                {"pred_responses": pred_responses, "avg_time": avg_time},                open(save_path, "wb"),            )        else:            # [optional] load            pickled_dict = pickle.load(open(save_path, "rb"))            pred_responses = pickled_dict["pred_responses"]            avg_time = pickled_dict["avg_time"]    else:        start_time = time.time()        pred_responses = get_responses(            eval_qs, query_engine, show_progress=True        )        avg_time = (time.time() - start_time) / len(eval_qs)    eval_results = await batch_runner.aevaluate_responses(        eval_qs, responses=pred_responses, reference=eval_answers    )    return eval_results, {"avg_time": avg_time}

使用内置检索构建助手¶

让我们在构建助手的同时，也将内置的OpenAI检索工具传递给它。

在这里，我们在创建助手的过程中上传并传递文件。

In [ ]:

from llama_index.agent.openai import OpenAIAssistantAgent

In [ ]:

agent = OpenAIAssistantAgent.from_new(
    name="SEC Analyst",
    instructions="You are a QA assistant designed to analyze sec filings.",
    openai_tools=[{"type": "retrieval"}],
    instructions_prefix="Please address the user as Jerry.",
    files=["data/llama2.pdf"],
    verbose=True,
)

In [ ]:

response = agent.query(
    "What are the key differences between Llama 2 and Llama 2-Chat?"
)

In [ ]:

print(str(response))

The key differences between Llama 2 and Llama 2-Chat, as indicated by the document, focus on their performance in safety evaluations, particularly when tested with adversarial prompts. Here are some of the differences highlighted within the safety evaluation section of Llama 2-Chat:

1. Safety Human Evaluation: Llama 2-Chat was assessed with roughly 2000 adversarial prompts, among which 1351 were single-turn and 623 were multi-turn. The responses were judged for safety violations on a five-point Likert scale, where a rating of 1 or 2 indicated a violation. The evaluation aimed to gauge the model’s safety by its rate of generating responses with safety violations and its helpfulness to users.

2. Violation Percentage and Mean Rating: Llama 2-Chat exhibited a low overall violation percentage across different model sizes and a high mean rating for safety and helpfulness, which suggests a strong performance in safety evaluations.

3. Inter-Rater Reliability: The reliability of the safety assessments was measured using Gwet’s AC1/2 statistic, showing a high degree of agreement among annotators with an average inter-rater reliability score of 0.92 for Llama 2-Chat annotations.

4. Single-turn and Multi-turn Conversations: The evaluation revealed that multi-turn conversations generally lead to more safety violations across models, but Llama 2-Chat performed well compared to baselines, particularly in multi-turn scenarios.

5. Violation Percentage Per Risk Category: Llama 2-Chat had a relatively higher number of violations in the unqualified advice category, possibly due to a lack of appropriate disclaimers in its responses.

6. Improvements in Fine-Tuned Llama 2-Chat: The document also mentions that the fine-tuned Llama 2-Chat showed significant improvement over the pre-trained Llama 2 in terms of truthfulness and toxicity. The percentage of toxic generations dropped to effectively 0% for Llama 2-Chat of all sizes, which was the lowest among all compared models, indicating a notable enhancement in safety.

These points detail the evaluations and improvements emphasizing safety that distinguish Llama 2-Chat from Llama 2【9†source】.

基准测试¶

我们在评估数据集上运行代理程序。我们使用gpt-4-turbo对标准的top-k RAG管道（k=2）进行基准测试。

注意：在我们进行测试的时候（2023年11月），助手API受到严格的速率限制，生成超过60个数据点的响应可能需要大约1-2小时的时间。

定义基准指数 + RAG管道¶

In [ ]:

llm = OpenAI(model="gpt-4-1106-preview")
base_index = VectorStoreIndex(nodes)
base_query_engine = base_index.as_query_engine(similarity_top_k=2, llm=llm)

运行基准评估¶

这个部分将运行基准评估，以便比较模型的性能。

In [ ]:

base_eval_results, base_extra_info = await run_evals(
    base_query_engine,
    eval_dataset.qr_pairs,
    batch_runner,
    save_path="data/llama2_preds_base.pkl",
)

In [ ]:

results_df = get_results_df(
    [base_eval_results],
    ["Base Query Engine"],
    ["correctness", "semantic_similarity"],
)
display(results_df)

	names	correctness	semantic_similarity
0	Base Query Engine	4.05	0.964245

运行助手API上的评估¶

In [ ]:

assistant_eval_results, assistant_extra_info = await run_evals(
    agent,
    eval_dataset.qr_pairs[:55],
    batch_runner,
    save_path="data/llama2_preds_assistant.pkl",
    disable_async_for_preds=True,
)

获取结果¶

在这里我们看到……我们的基本RAG管道表现更好。

对这些数字要持保留态度。这里的目标是为您提供一个脚本，以便您可以在自己的数据上运行它。

也就是说，令人惊讶的是检索API并没有立即提供更好的开箱即用性能。

In [ ]:

results_df = get_results_df(
    [assistant_eval_results, base_eval_results],
    ["Retrieval API", "Base Query Engine"],
    ["correctness", "semantic_similarity"],
)
display(results_df)
print(f"Base Avg Time: {base_extra_info['avg_time']}")
print(f"Assistant Avg Time: {assistant_extra_info['avg_time']}")

	names	correctness	semantic_similarity
0	Retrieval API	3.536364	0.952647
1	Base Query Engine	4.050000	0.964245

Base Avg Time: 0.25683316787083943
Assistant Avg Time: 75.43605598536405