本笔记涵盖了如何使用LangChain的SQLDatabase包连接Databricks运行时和Databricks SQL。它分为三个部分:安装和设置、连接到Databricks以及示例。
# 安装databricks-sql-connector库
!pip install databricks-sql-connector
您可以使用SQLDatabase.from_databricks()
方法连接到Databricks运行时和Databricks SQL。
SQLDatabase.from_databricks(
catalog: str,
schema: str,
host: Optional[str] = None,
api_token: Optional[str] = None,
warehouse_id: Optional[str] = None,
cluster_id: Optional[str] = None,
engine_args: Optional[dict] = None,
**kwargs: Any)
catalog
: Databricks数据库中的目录名称。schema
: 目录中的模式名称。以下参数是可选的。在Databricks笔记本中执行该方法时,在大多数情况下不需要提供它们。
host
: Databricks工作区主机名,不包括'https://%27%E9%83%A8%E5%88%86%E3%80%82%E9%BB%98%E8%AE%A4%E4%B8%BA%27DATABRICKS_HOST%27%E7%8E%AF%E5%A2%83%E5%8F%98%E9%87%8F%E6%88%96%E5%BD%93%E5%89%8D%E5%B7%A5%E4%BD%9C%E5%8C%BA%EF%BC%88%E5%A6%82%E6%9E%9C%E5%9C%A8Databricks%E7%AC%94%E8%AE%B0%E6%9C%AC%E4%B8%AD%EF%BC%89%E3%80%82api_token
: 用于访问Databricks SQL仓库或集群的Databricks个人访问令牌。默认为'DATABRICKS_TOKEN'环境变量,如果在Databricks笔记本中,则生成一个临时令牌。warehouse_id
: Databricks SQL中的仓库ID。cluster_id
: Databricks运行时中的集群ID。如果在Databricks笔记本中运行,并且'warehouse_id'和'cluster_id'都为None,则使用附加到笔记本的集群的ID。engine_args
: 连接Databricks时要使用的参数。**kwargs
: SQLDatabase.from_uri
方法的其他关键字参数。# 使用SQLDatabase包连接到Databricks
from langchain_community.utilities import SQLDatabase
# 创建SQLDatabase对象,并连接到Databricks
db = SQLDatabase.from_databricks(catalog="samples", schema="nyctaxi")
# 创建一个OpenAI Chat LLM包装器
from langchain_openai import ChatOpenAI
# 创建ChatOpenAI对象,并设置temperature为0,model_name为"gpt-4"
llm = ChatOpenAI(temperature=0, model_name="gpt-4")
from langchain_community.utilities import SQLDatabaseChain
db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)
# 调用db_chain的run函数,查询在午夜到早上6点之间开始的出租车行程的平均持续时间
db_chain.run(
"What is the average duration of taxi rides that start between midnight and 6am?"
)
> Entering new SQLDatabaseChain chain... What is the average duration of taxi rides that start between midnight and 6am? SQLQuery:SELECT AVG(UNIX_TIMESTAMP(tpep_dropoff_datetime) - UNIX_TIMESTAMP(tpep_pickup_datetime)) as avg_duration FROM trips WHERE HOUR(tpep_pickup_datetime) >= 0 AND HOUR(tpep_pickup_datetime) < 6 SQLResult: [(987.8122786304605,)] Answer:The average duration of taxi rides that start between midnight and 6am is 987.81 seconds. > Finished chain.
'The average duration of taxi rides that start between midnight and 6am is 987.81 seconds.'
from langchain.agents import create_sql_agent
from langchain_community.agent_toolkits import SQLDatabaseToolkit
toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)
# 运行代理程序,并向其传递一个问题作为参数
agent.run("What is the longest trip distance and how long did it take?")
> Entering new AgentExecutor chain... Action: list_tables_sql_db Action Input: Observation: trips Thought:I should check the schema of the trips table to see if it has the necessary columns for trip distance and duration. Action: schema_sql_db Action Input: trips Observation: CREATE TABLE trips ( tpep_pickup_datetime TIMESTAMP, tpep_dropoff_datetime TIMESTAMP, trip_distance FLOAT, fare_amount FLOAT, pickup_zip INT, dropoff_zip INT ) USING DELTA /* 3 rows from trips table: tpep_pickup_datetime tpep_dropoff_datetime trip_distance fare_amount pickup_zip dropoff_zip 2016-02-14 16:52:13+00:00 2016-02-14 17:16:04+00:00 4.94 19.0 10282 10171 2016-02-04 18:44:19+00:00 2016-02-04 18:46:00+00:00 0.28 3.5 10110 10110 2016-02-17 17:13:57+00:00 2016-02-17 17:17:55+00:00 0.7 5.0 10103 10023 */ Thought:The trips table has the necessary columns for trip distance and duration. I will write a query to find the longest trip distance and its duration. Action: query_checker_sql_db Action Input: SELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1 Observation: SELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1 Thought:The query is correct. I will now execute it to find the longest trip distance and its duration. Action: query_sql_db Action Input: SELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1 Observation: [(30.6, '0 00:43:31.000000000')] Thought:I now know the final answer. Final Answer: The longest trip distance is 30.6 miles and it took 43 minutes and 31 seconds. > Finished chain.
'The longest trip distance is 30.6 miles and it took 43 minutes and 31 seconds.'