Notebook

本地Llama2 + VectorStoreIndex¶

本笔记本介绍了在本地使用llama-2与LlamaIndex的正确设置步骤。请注意，您需要一块性能良好的GPU来运行本笔记本，最好是具有至少40GB内存的A100型号。

具体来说，我们将研究如何使用向量存储索引。

设置¶

In [ ]:

%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface

In [ ]:

!pip install llama-index ipywidgets

初始化¶

重要提示：请使用具有访问lama2模型权限的帐户登录到HF hub，使用命令huggingface-cli login在您的控制台中。更多详情，请参阅：https://ai.meta.com/resources/models-and-libraries/llama-downloads/%E3%80%82

In [ ]:

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


from IPython.display import Markdown, display

In [ ]:

import torch
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import PromptTemplate

# 模型名称（确保您在HF上有访问权限）
LLAMA2_7B = "meta-llama/Llama-2-7b-hf"
LLAMA2_7B_CHAT = "meta-llama/Llama-2-7b-chat-hf"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"
LLAMA2_70B = "meta-llama/Llama-2-70b-hf"
LLAMA2_70B_CHAT = "meta-llama/Llama-2-70b-chat-hf"

selected_model = LLAMA2_13B_CHAT

SYSTEM_PROMPT = """您是一个AI助手，以友好的方式回答问题，基于给定的源文件。以下是您始终遵循的一些规则：
- 生成可读的人类输出，避免生成无意义的文本。
- 仅生成请求的输出，不要在请求的输出之前或之后包含任何其他语言。
- 不要说谢谢，不要说你乐意帮助，不要说你是一个AI代理等。直接回答问题。
- 生成在北美商业文件中通常使用的专业语言。
- 不生成冒犯性或粗言秽语。
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="auto",
    # 根据您的GPU更改以下设置
    model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
)

In [ ]:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

In [ ]:

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

下载数据

In [ ]:

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [ ]:

from llama_index.core import SimpleDirectoryReader

# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

In [ ]:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

查询¶

In [ ]:

# 将日志级别设置为DEBUG，以获得更详细的输出
query_engine = index.as_query_engine()

In [ ]:

response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))

Growing up, the author wrote short stories, programmed on an IBM 1401, and eventually convinced his father to buy him a TRS-80 microcomputer. He wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He studied philosophy in college, but eventually switched to AI. He wrote essays and published them online, and worked on spam filters and painting. He also hosted dinners for a group of friends every Thursday night and bought a building in Cambridge.

流支持¶

In [ ]:

import time

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("What happened at interleaf?")

start_time = time.time()

token_count = 0
for token in response.response_gen:
    print(token, end="")
    token_count += 1

time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed

print(f"\n\nStreamed output at {tokens_per_second} tokens/s")

At Interleaf, a group of people worked on projects for customers. One of the employees told the narrator about a new thing called HTML, which was a derivative of SGML. The narrator left Interleaf to pursue art school at RISD, but continued to do freelance work for the group. Eventually, the narrator and two of his friends, Robert and Trevor, started a new company called Viaweb to create a web app that allowed users to build stores through the browser. They opened for business in January 1996 with 6 stores. The software had three main parts: the editor, the shopping cart, and the manager.

Streamed output at 26.923490295496002 tokens/s