这个笔记本展示了如何使用LangChain、Deep Lake作为向量存储和OpenAI嵌入来实现一个问答系统。我们将采取以下步骤来实现这一目标:
您也可以参考其他教程,比如在Deep Lake中存储的任何类型数据(PDF、json、csv、文本)上进行问答:与任何数据交谈,代码理解,或者在PDF上进行问答,或者推荐歌曲。
让我们安装以下包。
!pip install deeplake langchain openai tiktoken
在这里提供你的OpenAI API密钥:
import getpass
import os
os.environ['OPENAI_API_KEY'] = getpass.getpass()
··········
我们将在这个示例中使用cohere-wikipedia-22数据集的20000个样本子集。
import deeplake
ds = deeplake.load("hub://activeloop/cohere-wikipedia-22-sample")
ds.summary()
\
Opening dataset in read-only mode as you don't have write permissions.
-
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/cohere-wikipedia-22-sample
|
hub://activeloop/cohere-wikipedia-22-sample loaded successfully. Dataset(path='hub://activeloop/cohere-wikipedia-22-sample', read_only=True, tensors=['ids', 'metadata', 'text']) tensor htype shape dtype compression ------- ------- ------- ------- ------- ids text (20000, 1) str None metadata json (20000, 1) str None text text (20000, 1) str None
让我们来看几个示例:
ds[:3].text.data()["value"]
['The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.', 'A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say "Tuesday at 24:00" and "Wednesday at 00:00" to mean exactly the same time.', 'However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say "23:59", which is one minute before midnight.']
让我们定义一个dataset_path
,这是您的Deep Lake向量存储将存放文本嵌入的位置。
dataset_path = 'wikipedia-embeddings-deeplake'
我们将设置OpenAI的text-embedding-3-small
作为我们的嵌入函数,并在dataset_path
上初始化一个Deep Lake向量存储...
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
embedding = OpenAIEmbeddings(model="text-embedding-3-small")
db = DeepLake(dataset_path, embedding=embedding, overwrite=True)
...并使用add_texts
方法逐批次填充样本。
from tqdm.auto import tqdm
batch_size = 100
nsamples = 10 # 用于测试。将此替换为 len(ds) 以追加所有内容。
for i in tqdm(range(0, nsamples, batch_size)):
# 查找批处理结束位置
i_end = min(nsamples, i + batch_size)
batch = ds[i:i_end]
id_batch = batch.ids.data()["value"]
text_batch = batch.text.data()["value"]
meta_batch = batch.metadata.data()["value"]
db.add_texts(text_batch, metadatas=meta_batch, ids=id_batch)
0%| | 0/1 [00:00<?, ?it/s]
creating embeddings: 0%| | 0/1 [00:00<?, ?it/s] creating embeddings: 100%|██████████| 1/1 [00:02<00:00, 2.11s/it] 100%|██████████| 10/10 [00:00<00:00, 462.42it/s]
Dataset(path='wikipedia-embeddings-deeplake', tensors=['text', 'metadata', 'embedding', 'id']) tensor htype shape dtype compression ------- ------- ------- ------- ------- text text (10, 1) str None metadata json (10, 1) str None embedding embedding (10, 1536) float32 None id text (10, 1) str None
通过db.vectorstore.dataset
可以访问底层的Deep Lake数据集对象,可以使用db.vectorstore.summary()
来总结数据结构,该函数显示了4个张量和10个样本:
db.vectorstore.summary()
Dataset(path='wikipedia-embeddings-deeplake', tensors=['text', 'metadata', 'embedding', 'id']) tensor htype shape dtype compression ------- ------- ------- ------- ------- text text (10, 1) str None metadata json (10, 1) str None embedding embedding (10, 1536) float32 None id text (10, 1) str None
我们现在将在我们的向量存储库中使用GPT-3.5-Turbo来设置问答系统。
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
# Re-load the vector store in case it's no longer initialized
# db = DeepLake(dataset_path = dataset_path, embedding_function=embedding)
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model='gpt-3.5-turbo'), chain_type="stuff", retriever=db.as_retriever())
让我们尝试运行一个提示并检查输出。在内部,该API执行嵌入式搜索,以找到最相关的数据并输入到LLM上下文中。
query = 'Why does the military not say 24:00?'
qa.run(query)
'The military prefers not to say 24:00 because they do not like to have two names for the same thing. Instead, they always say "23:59", which is one minute before midnight.'
Et voilà!