Notebook

Serving DeepSeek on Vast and Parsing with Langchain¶

This notebook demonstrates how to deploy and serve DeepSeek's DeepSeek-R1-Distill-Qwen-32B model on Vast.ai's cloud GPU platform, and how to effectively parse its outputs using Langchain.

DeepSeek is an open-source language model series that aims to advance open-source AI. The R1-Distill-Qwen-32B model we'll be using is a distilled version of their larger models, optimized for efficiency while maintaining strong performance. It's particularly notable for its reasoning responses, which includes separate 'thinking' and 'response' sections that we can parse programmatically.

Vast.ai provides a marketplace for renting GPU compute power, offering a cost-effective alternative to major cloud providers. It allows us to find and rent specific GPU configurations that match our model's requirements, making it ideal for serving AI models like DeepSeek.

In this notebook, we will:

Set up a Vast.ai instance with the right GPU specifications for DeepSeek
Deploy the model using a Vast Template for easy and repeatable deployments
Create a custom Langchain parser to handle DeepSeek's unique output format
Demonstrate how to interact with the deployed model through an OpenAI-compatible API

This setup provides a production-ready environment for serving the DeepSeek model, with the ability to handle reasoning outputs in a programmatic way.

Deploy DeepSeek on Vast¶

Install Vast¶

First we will install and set up the Vast AI API.

You can get your API key on the Account Page in the Vast Console and set it below in VAST_API_KEY

In [ ]:

%%bash
#In an environment of your choice
pip install --upgrade vastai

In [ ]:

%%bash
# Here we will set our api key
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY

Choosing the Right Hardware¶

Now we are going to search for GPUs on Vast.ai to run the DeepSeek-R1-Distill-Qwen-32B model. This model requires specific hardware capabilities to run efficiently with vLLM's optimizations. Here are our requirements:

A minimum of 80GB GPU RAM to accommodate:
- DeepSeek model weights (32B Parameters)
- KV Cache for handling of extra long output token lengths
A single GPU configuration, as DeepSeek-R1-Distill-Qwen-32B can be efficiently served on one GPU: Note: Multi-GPU configurations are supported if higher throughput is needed.
A static IP address for:
- Stable API endpoint hosting
- Consistent client connections
- Reliable Langchain integration
At least one direct port that we can forward for:
- vLLM's OpenAI-compatible API server
- External access to the model endpoint
- Secure request routing
At least 120GB of disk space to hold the model and other things we might like to download

In [ ]:

%%bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 80 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 120 \
rentable = true"

Deploying the Server via Vast Template¶

Choose a machine and copy and paste the id below to set INSTANCE_ID.

We will deploy a template that:

Uses vllm/vllm-openai:latest docker image. This gives us an OpenAI-compatible server.
Forwards port 8000 to the outside of the container, which is the default OpenAI server port
Forwards --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --max-model-len 8192 --enforce-eager on to the default entrypoint (the server itself)
Uses --tensor-parallel-size 1 by default.
Uses --gpu-memory-utilization 0.90 by default
Ensures that we have 120 GB of Disk space

These settings balance performance and stability for serving the DeepSeek model.

In [ ]:

%%bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --disk 120 --template_hash eda062b3e0c9c36f09d9d9a294405ded

Verify Setup and Get Instance IP Address and Port¶

Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done.

Next, go to the Instances tab in the Vast AI Console and find the instance you just created.

At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. You should see something like:

Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp

Copy and paste the IP address (XX.XX.XXX.XX) into VAST_IP_ADDRESS and the port (YYYY) into VAST_PORT as inputs to the curl command below.

This curl command sends and OpenAI compatible request to your vLLM server. You should see the response if everything is setup correctly.

Note: It may take a few minutes for the OpenAI server to initialize.

In [ ]:

%%bash

export VAST_IP_ADDRESS="<your-ip-address>"
export VAST_PORT="<your-port-here>"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/v1/completions -H "Content-Type: application/json"  -d '{"model" : "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "prompt": "Hello, how are you?", "max_tokens": 50}'

Call our Model¶

Setup Dependencies¶

First we will install Langchain and OpenAI packages.

In [ ]:

%%bash
pip install --upgrade langchain langchain-openai openai

Create an Output Parser¶

DeepSeek's response contains two parts:

Thinking - wrapped in <think> tags
Answer - follows after the tags

Let's create a parser to handle this format.

In [ ]:

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough

from typing import Optional, Tuple
from langchain.schema import BaseOutputParser

class R1OutputParser(BaseOutputParser[Tuple[Optional[str], str]]):
    """Parser for DeepSeek R1 model output that includes thinking and response sections."""
    
    def parse(self, text: str) -> Tuple[Optional[str], str]:
        """Parse the model output into thinking and response sections.
        
        Args:
            text: Raw text output from the model
            
        Returns:
            Tuple containing (thinking_text, response_text)
            - thinking_text will be None if no thinking section is found
        """
        if "</think>" in text:
            # Split on </think> tag
            parts = text.split("</think>")
            # Extract thinking text (remove <think> tag)
            thinking_text = parts[0].replace("<think>", "").strip()
            # Get response text
            response_text = parts[1].strip()
            return thinking_text, response_text
        
        # If no thinking tags found, return None for thinking and full text as response
        return None, text.strip()
    
    @property
    def _type(self) -> str:
        """Return type key for serialization."""
        return "r1_output_parser" 

Setup Model¶

In [ ]:

VAST_IP_ADDRESS="<your-ip-address>"
VAST_PORT="<your-port-here>"

openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"

model = ChatOpenAI(
    base_url=openai_api_base,
    api_key=openai_api_key,
    model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    max_tokens=8000,
    temperature=0.7
)

# Create prompt template
prompt = ChatPromptTemplate.from_messages([
    ("user", "{input}")
])

# Create parser
parser = R1OutputParser()

# Create chain
chain = (
    {"input": RunnablePassthrough()} 
    | prompt 
    | model 
    | parser
)

Call our Model¶

In [ ]:

prompt_text = "Explain quantum computing to a curious 10-year-old who loves video games."

thinking, response = chain.invoke(prompt_text)
print("\nTHINKING:\n")
print(thinking)
print("\nRESPONSE:\n")
print(response)