#!/usr/bin/env python # coding: utf-8 # # Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM # This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way. # # # TensorRT-LLM supports both models: # - `gpt-oss-20b` # - `gpt-oss-120b` # # In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md) deployment guide. # # Note: Your input prompts should use the [harmony response](http://cookbook.openai.com/articles/openai-harmony) format for the model to work properly, though this guide does not require it. # #### Launch on NVIDIA Brev # You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured. # # Once deployed, click on the "Open Notebook" button to get start with this guide # # [![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-30i1YjHsRWT109HL6eYxLUeHIwF) # ## Prerequisites # ### Hardware # To run the gpt-oss-20b model, you will need an NVIDIA GPU with at least 20 GB of VRAM. # # Recommended GPUs: NVIDIA Hopper (e.g., H100, H200), NVIDIA Blackwell (e.g., B100, B200), NVIDIA RTX PRO, NVIDIA RTX 50 Series (e.g., RTX 5090). # # ### Software # - CUDA Toolkit 12.8 or later # - Python 3.12 or later # ## Installing TensorRT-LLM # # There are multiple ways to install TensorRT-LLM. In this guide, we'll cover using a pre-built Docker container from NVIDIA NGC as well as building from source. # # If you're using NVIDIA Brev, you can skip this section. # ## Using NVIDIA NGC # # Pull the pre-built [TensorRT-LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for GPT-OSS from [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/). # This is the easiest way to get started and ensures all dependencies are included. # # ```bash # docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev # docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev # ``` # # ## Using Docker (Build from Source) # # Alternatively, you can build the TensorRT-LLM container from source. # This approach is useful if you want to modify the source code or use a custom branch. # For detailed instructions, see the [official documentation](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker). # TensorRT-LLM will be available through pip soon # > Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch. # # Verifying TensorRT-LLM Installation # In[ ]: from tensorrt_llm import LLM, SamplingParams # # Utilizing TensorRT-LLM Python API # In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to: # 1. Download the specified model weights from Hugging Face (using your HF_TOKEN for authentication). # 2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist. # 3. Load the model and prepare it for inference. # 4. Run a simple text generation example to verify everything is working. # # **Note**: The first run may take several minutes as it downloads the model and builds the engine. # Subsequent runs will be much faster, as the engine will be cached. # In[ ]: llm = LLM(model="openai/gpt-oss-20b") # In[ ]: prompts = ["Hello, my name is", "The capital of France is"] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) for output in llm.generate(prompts, sampling_params): print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}") # # Conclusion and Next Steps # Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API. # # In this notebook, you have learned how to: # - Set up your environment with the necessary dependencies. # - Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub. # - Automatically build a high-performance TensorRT engine tailored to your GPU. # - Run inference with the optimized model. # # # You can explore more advanced features to further improve performance and efficiency: # # - Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time. # # - Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware. # # - Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving. # #