Chroma 是一个以人工智能为基础的开源向量数据库,专注于开发者的生产力和幸福感。Chroma 使用 Apache 2.0 许可证。
Chroma 是完全类型化、经过充分测试和充分文档化的。
使用以下命令安装 Chroma:
pip install chromadb
Chroma 可以以各种模式运行。请参见下面的示例,每个示例都与 LangChain 集成。
in-memory
- 在 Python 脚本或 Jupyter 笔记本中in-memory with persistance
- 在脚本或笔记本中,并保存/加载到磁盘在 Docker 容器中
- 作为在本地计算机或云中运行的服务器与任何其他数据库一样,您可以:
.add
.get
.update
.upsert
.delete
.peek
.query
运行相似性搜索。在 文档 中查看完整文档。
在这个基本示例中,我们将保罗·格雷厄姆的文章分成片段,使用开源嵌入模型进行嵌入,将其加载到 Chroma 中,然后进行查询。
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
%pip install llama-index-vector-stores-chroma
%pip install llama-index-embeddings-huggingface
!pip install llama-index
# !pip install llama-index chromadb --quiet
# !pip install chromadb
# !pip install sentence-transformers
# !pip install pydantic==1.10.11
# import
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from IPython.display import Markdown, display
import chromadb
# 设置OpenAI
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import openai
openai.api_key = os.environ["OPENAI_API_KEY"]
下载数据
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
# 创建客户端和新的集合
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")
# 定义嵌入函数
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# 设置ChromaVectorStore并加载数据
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, embed_model=embed_model
)
# 查询数据
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm /Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
The author worked on writing and programming growing up. They wrote short stories and tried writing programs on an IBM 1401 computer. Later, they got a microcomputer and started programming more extensively.
在扩展前面的示例时,如果你想要保存到磁盘,只需初始化Chroma客户端并传递要保存数据的目录即可。
注意
:Chroma会尽最大努力自动将数据保存到磁盘,但是多个内存中的客户端可能会互相干扰彼此的工作。作为最佳实践,任何时候只能有一个客户端在运行指定的路径下。
# 保存到磁盘
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, embed_model=embed_model
)
# 从磁盘加载
db2 = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db2.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(
vector_store,
embed_model=embed_model,
)
# 从持久化索引查询数据
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
display(Markdown(f"<b>{response}</b>"))
The author worked on writing and programming growing up. They wrote short stories and tried writing programs on an IBM 1401 computer. Later, they got a microcomputer and started programming games and a word processor.
您还可以在单独的Docker容器中运行Chroma服务器,创建一个客户端来连接它,然后将其传递给LlamaIndex。
以下是克隆、构建和运行Docker镜像的方法:
git clone git@github.com:chroma-core/chroma.git
docker-compose up -d --build
# 创建chroma客户端并添加我们的数据
import chromadb
remote_db = chromadb.HttpClient()
chroma_collection = remote_db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, embed_model=embed_model
)
# 从Chroma Docker索引中查询数据
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
display(Markdown(f"<b>{response}</b>"))
在构建真实应用程序时,您希望不仅仅是添加数据,还要能够更新和删除数据。
Chroma要求用户提供ids
来简化这里的簿记工作。ids
可以是文件名,也可以是类似于filename_paragraphNumber
的组合哈希值。
下面是一个基本示例,展示了如何进行各种操作:
doc_to_update = chroma_collection.get(limit=1)
doc_to_update["metadatas"][0] = {
**doc_to_update["metadatas"][0],
**{"author": "Paul Graham"},
}
chroma_collection.update(
ids=[doc_to_update["ids"][0]], metadatas=[doc_to_update["metadatas"][0]]
)
updated_doc = chroma_collection.get(limit=1)
print(updated_doc["metadatas"][0])
# 删除最后一个文档
print("删除前计数", chroma_collection.count())
chroma_collection.delete(ids=[doc_to_update["ids"][0]])
print("删除后计数", chroma_collection.count())
{'_node_content': '{"id_": "be08c8bc-f43e-4a71-ba64-e525921a8319", "embedding": null, "metadata": {}, "excluded_embed_metadata_keys": [], "excluded_llm_metadata_keys": [], "relationships": {"1": {"node_id": "2cbecdbb-0840-48b2-8151-00119da0995b", "node_type": null, "metadata": {}, "hash": "4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35"}, "3": {"node_id": "6a75604a-fa76-4193-8f52-c72a7b18b154", "node_type": null, "metadata": {}, "hash": "d6c408ee1fbca650fb669214e6f32ffe363b658201d31c204e85a72edb71772f"}}, "hash": "b4d0b960aa09e693f9dc0d50ef46a3d0bf5a8fb3ac9f3e4bcf438e326d17e0d8", "text": "", "start_char_idx": 0, "end_char_idx": 4050, "text_template": "{metadata_str}\\n\\n{content}", "metadata_template": "{key}: {value}", "metadata_seperator": "\\n"}', 'author': 'Paul Graham', 'doc_id': '2cbecdbb-0840-48b2-8151-00119da0995b', 'document_id': '2cbecdbb-0840-48b2-8151-00119da0995b', 'ref_doc_id': '2cbecdbb-0840-48b2-8151-00119da0995b'} count before 20 count after 19