This notebook demonstrates how to deploy and serve DeepSeek's DeepSeek-R1-Distill-Qwen-32B model on Vast.ai's cloud GPU platform, and how to effectively parse its outputs using Langchain.
DeepSeek is an open-source language model series that aims to advance open-source AI. The R1-Distill-Qwen-32B model we'll be using is a distilled version of their larger models, optimized for efficiency while maintaining strong performance. It's particularly notable for its reasoning responses, which includes separate 'thinking' and 'response' sections that we can parse programmatically.
Vast.ai provides a marketplace for renting GPU compute power, offering a cost-effective alternative to major cloud providers. It allows us to find and rent specific GPU configurations that match our model's requirements, making it ideal for serving AI models like DeepSeek.
In this notebook, we will:
This setup provides a production-ready environment for serving the DeepSeek model, with the ability to handle reasoning outputs in a programmatic way.
First we will install and set up the Vast AI API.
You can get your API key on the Account Page in the Vast Console and set it below in VAST_API_KEY
%%bash
#In an environment of your choice
pip install --upgrade vastai
%%bash
# Here we will set our api key
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY
Now we are going to search for GPUs on Vast.ai to run the DeepSeek-R1-Distill-Qwen-32B model. This model requires specific hardware capabilities to run efficiently with vLLM's optimizations. Here are our requirements:
A minimum of 80GB GPU RAM to accommodate:
A single GPU configuration, as DeepSeek-R1-Distill-Qwen-32B can be efficiently served on one GPU: Note: Multi-GPU configurations are supported if higher throughput is needed.
A static IP address for:
At least one direct port that we can forward for:
At least 120GB of disk space to hold the model and other things we might like to download
%%bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 80 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 120 \
rentable = true"
Choose a machine and copy and paste the id below to set INSTANCE_ID
.
We will deploy a template that:
vllm/vllm-openai:latest
docker image. This gives us an OpenAI-compatible server.8000
to the outside of the container, which is the default OpenAI server port--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --max-model-len 8192 --enforce-eager
on to the default entrypoint (the server itself)--tensor-parallel-size 1
by default.--gpu-memory-utilization 0.90
by defaultThese settings balance performance and stability for serving the DeepSeek model.
%%bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --disk 120 --template_hash eda062b3e0c9c36f09d9d9a294405ded
Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done.
Next, go to the Instances tab in the Vast AI Console and find the instance you just created.
At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. You should see something like:
Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp
Copy and paste the IP address (XX.XX.XXX.XX) into VAST_IP_ADDRESS
and the port (YYYY) into VAST_PORT
as inputs to the curl command below.
This curl command sends and OpenAI compatible request to your vLLM
server. You should see the response if everything is setup correctly.
Note: It may take a few minutes for the OpenAI server to initialize.
%%bash
export VAST_IP_ADDRESS="<your-ip-address>"
export VAST_PORT="<your-port-here>"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/v1/completions -H "Content-Type: application/json" -d '{"model" : "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "prompt": "Hello, how are you?", "max_tokens": 50}'
%%bash
pip install --upgrade langchain langchain-openai openai
DeepSeek's response contains two parts:
<think>
tagsLet's create a parser to handle this format.
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from typing import Optional, Tuple
from langchain.schema import BaseOutputParser
class R1OutputParser(BaseOutputParser[Tuple[Optional[str], str]]):
"""Parser for DeepSeek R1 model output that includes thinking and response sections."""
def parse(self, text: str) -> Tuple[Optional[str], str]:
"""Parse the model output into thinking and response sections.
Args:
text: Raw text output from the model
Returns:
Tuple containing (thinking_text, response_text)
- thinking_text will be None if no thinking section is found
"""
if "</think>" in text:
# Split on </think> tag
parts = text.split("</think>")
# Extract thinking text (remove <think> tag)
thinking_text = parts[0].replace("<think>", "").strip()
# Get response text
response_text = parts[1].strip()
return thinking_text, response_text
# If no thinking tags found, return None for thinking and full text as response
return None, text.strip()
@property
def _type(self) -> str:
"""Return type key for serialization."""
return "r1_output_parser"
VAST_IP_ADDRESS="<your-ip-address>"
VAST_PORT="<your-port-here>"
openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
model = ChatOpenAI(
base_url=openai_api_base,
api_key=openai_api_key,
model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
max_tokens=8000,
temperature=0.7
)
# Create prompt template
prompt = ChatPromptTemplate.from_messages([
("user", "{input}")
])
# Create parser
parser = R1OutputParser()
# Create chain
chain = (
{"input": RunnablePassthrough()}
| prompt
| model
| parser
)
prompt_text = "Explain quantum computing to a curious 10-year-old who loves video games."
thinking, response = chain.invoke(prompt_text)
print("\nTHINKING:\n")
print(thinking)
print("\nRESPONSE:\n")
print(response)