本笔记本介绍了在本地使用llama-2与LlamaIndex的正确设置步骤。请注意,您需要一块性能良好的GPU来运行本笔记本,最好是具有至少40GB内存的A100型号。
具体来说,我们将研究如何使用向量存储索引。
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface
!pip install llama-index ipywidgets
重要提示:请使用具有访问lama2模型权限的帐户登录到HF hub,使用命令huggingface-cli login
在您的控制台中。更多详情,请参阅:https://ai.meta.com/resources/models-and-libraries/llama-downloads/%E3%80%82
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from IPython.display import Markdown, display
import torch
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import PromptTemplate
# 模型名称(确保您在HF上有访问权限)
LLAMA2_7B = "meta-llama/Llama-2-7b-hf"
LLAMA2_7B_CHAT = "meta-llama/Llama-2-7b-chat-hf"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"
LLAMA2_70B = "meta-llama/Llama-2-70b-hf"
LLAMA2_70B_CHAT = "meta-llama/Llama-2-70b-chat-hf"
selected_model = LLAMA2_13B_CHAT
SYSTEM_PROMPT = """您是一个AI助手,以友好的方式回答问题,基于给定的源文件。以下是您始终遵循的一些规则:
- 生成可读的人类输出,避免生成无意义的文本。
- 仅生成请求的输出,不要在请求的输出之前或之后包含任何其他语言。
- 不要说谢谢,不要说你乐意帮助,不要说你是一个AI代理等。直接回答问题。
- 生成在北美商业文件中通常使用的专业语言。
- 不生成冒犯性或粗言秽语。
"""
query_wrapper_prompt = PromptTemplate(
"[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)
llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=2048,
generate_kwargs={"temperature": 0.0, "do_sample": False},
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name=selected_model,
model_name=selected_model,
device_map="auto",
# 根据您的GPU更改以下设置
model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
下载数据
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
from llama_index.core import SimpleDirectoryReader
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
import time
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("What happened at interleaf?")
start_time = time.time()
token_count = 0
for token in response.response_gen:
print(token, end="")
token_count += 1
time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed
print(f"\n\nStreamed output at {tokens_per_second} tokens/s")
At Interleaf, a group of people worked on projects for customers. One of the employees told the narrator about a new thing called HTML, which was a derivative of SGML. The narrator left Interleaf to pursue art school at RISD, but continued to do freelance work for the group. Eventually, the narrator and two of his friends, Robert and Trevor, started a new company called Viaweb to create a web app that allowed users to build stores through the browser. They opened for business in January 1996 with 6 stores. The software had three main parts: the editor, the shopping cart, and the manager. Streamed output at 26.923490295496002 tokens/s