本指南通过使用我们的OpenAIAssistantAgent
对OpenAI助理API中的检索工具进行基准测试。我们对Llama 2论文进行了测试,并将生成质量与一个简单的RAG流水线进行了比较。
%pip install llama-index-readers-file pymupdf
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
!pip install llama-index
import nest_asyncio
nest_asyncio.apply()
这里我们加载Llama 2论文并对其进行分块。
!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
--2023-11-08 21:53:52-- https://arxiv.org/pdf/2307.09288.pdf Resolving arxiv.org (arxiv.org)... 128.84.21.199 Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 13661300 (13M) [application/pdf] Saving to: ‘data/llama2.pdf’ data/llama2.pdf 100%[===================>] 13.03M 141KB/s in 1m 48s 2023-11-08 21:55:42 (123 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]
from pathlib import Path
from llama_index.core import Document, VectorStoreIndex
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)
len(nodes)
89
我们设置评估模块,包括数据集和评估器。
这里我们加载一个“黄金”数据集。
注意:我们从Dropbox中获取数据集。有关如何生成数据集的详细信息,请参阅我们的DatasetGenerator
模块。
!wget "https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1" -O data/llama2_eval_qr_dataset.json
--2023-11-08 22:20:10-- https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1 Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6057:18::a27d:d12, 162.125.13.18 Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6057:18::a27d:d12|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1# [following] --2023-11-08 22:20:11-- https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1 Resolving uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)... 2620:100:6057:15::a27d:d0f, 162.125.13.15 Connecting to uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)|2620:100:6057:15::a27d:d0f|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 60656 (59K) [application/binary] Saving to: ‘data/llama2_eval_qr_dataset.json’ data/llama2_eval_qr 100%[===================>] 59.23K --.-KB/s in 0.02s 2023-11-08 22:20:12 (2.87 MB/s) - ‘data/llama2_eval_qr_dataset.json’ saved [60656/60656]
from llama_index.core.evaluation import QueryResponseDataset# 可选eval_dataset = QueryResponseDataset.from_json( "data/llama2_eval_qr_dataset.json")
如果选择此选项,您可以选择从头开始生成一个新的数据集。这样可以让您调整我们的DatasetGenerator
设置,以确保它符合您的需求。
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
# 注意:如果数据集尚未保存,请运行此代码# 注意:我们只从前20个节点生成,因为其余的是引用llm = OpenAI(model="gpt-4-1106-preview")dataset_generator = DatasetGenerator( nodes[:20], llm=llm, show_progress=True, num_questions_per_chunk=3,)eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)eval_dataset.save_json("data/llama2_eval_qr_dataset.json")
# 可选eval_dataset = QueryResponseDataset.from_json( "data/llama2_eval_qr_dataset.json")
我们定义了两个评估模块:正确性和语义相似度 - 两者都用于比较预测响应与实际响应的质量。
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
)
from llama_index.core.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
BatchEvalRunner,
)
from llama_index.llms.openai import OpenAI
eval_llm = OpenAI(model="gpt-4-1106-preview")
evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)
evaluator_dict = {
"correctness": evaluator_c,
"semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)
import numpy as npimport timeimport osimport picklefrom tqdm import tqdmdef get_responses_sync( eval_qs, query_engine, show_progress=True, save_path=None): if show_progress: eval_qs_iter = tqdm(eval_qs) else: eval_qs_iter = eval_qs pred_responses = [] start_time = time.time() for eval_q in eval_qs_iter: print(f"eval q: {eval_q}") pred_response = agent.query(eval_q) print(f"predicted response: {pred_response}") pred_responses.append(pred_response) if save_path is not None: # save intermediate responses (to cache in case something breaks) avg_time = (time.time() - start_time) / len(pred_responses) pickle.dump( {"pred_responses": pred_responses, "avg_time": avg_time}, open(save_path, "wb"), ) return pred_responsesasync def run_evals( query_engine, eval_qa_pairs, batch_runner, disable_async_for_preds=False, save_path=None,): # then evaluate # TODO: evaluate a sample of generated results eval_qs = [q for q, _ in eval_qa_pairs] eval_answers = [a for _, a in eval_qa_pairs] if save_path is not None: if not os.path.exists(save_path): start_time = time.time() if disable_async_for_preds: pred_responses = get_responses_sync( eval_qs, query_engine, show_progress=True, save_path=save_path, ) else: pred_responses = get_responses( eval_qs, query_engine, show_progress=True ) avg_time = (time.time() - start_time) / len(eval_qs) pickle.dump( {"pred_responses": pred_responses, "avg_time": avg_time}, open(save_path, "wb"), ) else: # [optional] load pickled_dict = pickle.load(open(save_path, "rb")) pred_responses = pickled_dict["pred_responses"] avg_time = pickled_dict["avg_time"] else: start_time = time.time() pred_responses = get_responses( eval_qs, query_engine, show_progress=True ) avg_time = (time.time() - start_time) / len(eval_qs) eval_results = await batch_runner.aevaluate_responses( eval_qs, responses=pred_responses, reference=eval_answers ) return eval_results, {"avg_time": avg_time}
from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="SEC Analyst",
instructions="You are a QA assistant designed to analyze sec filings.",
openai_tools=[{"type": "retrieval"}],
instructions_prefix="Please address the user as Jerry.",
files=["data/llama2.pdf"],
verbose=True,
)
response = agent.query(
"What are the key differences between Llama 2 and Llama 2-Chat?"
)
print(str(response))
The key differences between Llama 2 and Llama 2-Chat, as indicated by the document, focus on their performance in safety evaluations, particularly when tested with adversarial prompts. Here are some of the differences highlighted within the safety evaluation section of Llama 2-Chat: 1. Safety Human Evaluation: Llama 2-Chat was assessed with roughly 2000 adversarial prompts, among which 1351 were single-turn and 623 were multi-turn. The responses were judged for safety violations on a five-point Likert scale, where a rating of 1 or 2 indicated a violation. The evaluation aimed to gauge the model’s safety by its rate of generating responses with safety violations and its helpfulness to users. 2. Violation Percentage and Mean Rating: Llama 2-Chat exhibited a low overall violation percentage across different model sizes and a high mean rating for safety and helpfulness, which suggests a strong performance in safety evaluations. 3. Inter-Rater Reliability: The reliability of the safety assessments was measured using Gwet’s AC1/2 statistic, showing a high degree of agreement among annotators with an average inter-rater reliability score of 0.92 for Llama 2-Chat annotations. 4. Single-turn and Multi-turn Conversations: The evaluation revealed that multi-turn conversations generally lead to more safety violations across models, but Llama 2-Chat performed well compared to baselines, particularly in multi-turn scenarios. 5. Violation Percentage Per Risk Category: Llama 2-Chat had a relatively higher number of violations in the unqualified advice category, possibly due to a lack of appropriate disclaimers in its responses. 6. Improvements in Fine-Tuned Llama 2-Chat: The document also mentions that the fine-tuned Llama 2-Chat showed significant improvement over the pre-trained Llama 2 in terms of truthfulness and toxicity. The percentage of toxic generations dropped to effectively 0% for Llama 2-Chat of all sizes, which was the lowest among all compared models, indicating a notable enhancement in safety. These points detail the evaluations and improvements emphasizing safety that distinguish Llama 2-Chat from Llama 2【9†source】.
我们在评估数据集上运行代理程序。我们使用gpt-4-turbo对标准的top-k RAG管道(k=2)进行基准测试。
注意:在我们进行测试的时候(2023年11月),助手API受到严格的速率限制,生成超过60个数据点的响应可能需要大约1-2小时的时间。
llm = OpenAI(model="gpt-4-1106-preview")
base_index = VectorStoreIndex(nodes)
base_query_engine = base_index.as_query_engine(similarity_top_k=2, llm=llm)
这个部分将运行基准评估,以便比较模型的性能。
base_eval_results, base_extra_info = await run_evals(
base_query_engine,
eval_dataset.qr_pairs,
batch_runner,
save_path="data/llama2_preds_base.pkl",
)
results_df = get_results_df(
[base_eval_results],
["Base Query Engine"],
["correctness", "semantic_similarity"],
)
display(results_df)
names | correctness | semantic_similarity | |
---|---|---|---|
0 | Base Query Engine | 4.05 | 0.964245 |
assistant_eval_results, assistant_extra_info = await run_evals(
agent,
eval_dataset.qr_pairs[:55],
batch_runner,
save_path="data/llama2_preds_assistant.pkl",
disable_async_for_preds=True,
)
在这里我们看到……我们的基本RAG管道表现更好。
对这些数字要持保留态度。这里的目标是为您提供一个脚本,以便您可以在自己的数据上运行它。
也就是说,令人惊讶的是检索API并没有立即提供更好的开箱即用性能。
results_df = get_results_df(
[assistant_eval_results, base_eval_results],
["Retrieval API", "Base Query Engine"],
["correctness", "semantic_similarity"],
)
display(results_df)
print(f"Base Avg Time: {base_extra_info['avg_time']}")
print(f"Assistant Avg Time: {assistant_extra_info['avg_time']}")
names | correctness | semantic_similarity | |
---|---|---|---|
0 | Retrieval API | 3.536364 | 0.952647 |
1 | Base Query Engine | 4.050000 | 0.964245 |
Base Avg Time: 0.25683316787083943 Assistant Avg Time: 75.43605598536405