Qwen2VL is the latest addition to the QwenVL series of multimodal large language models.
Key Enhancements of Qwen2VL:
Model Architecture Details:
More details about model can be found in model card, blog and original repo.
In this tutorial we consider how to convert and optimize Qwen2VL model for creating multimodal chatbot. Additionally, we demonstrate how to apply stateful transformation on LLM part and model optimization techniques like weights compression using NNCF
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
%pip install -q "transformers>=4.45" "torch>=2.1" "torchvision" "qwen-vl-utils" "Pillow" "gradio>=4.36" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -qU "openvino>=2024.4.0" "nncf>=2.13.0"
from pathlib import Path
import requests
if not Path("ov_qwen2_vl.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/qwen2-vl/ov_qwen2_vl.py")
open("ov_qwen2_vl.py", "w").write(r.text)
if not Path("notebook_utils.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py")
open("notebook_utils.py", "w").write(r.text)
There are multiple Qwen2VL models available in models collection. You can select one of them for conversion and optimization in notebook using widget bellow:
from ov_qwen2_vl import model_selector
model_id = model_selector()
model_id
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
2024-10-30 21:02:36.765098: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2024-10-30 21:02:36.777073: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1730307756.791531 559916 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1730307756.795971 559916 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-10-30 21:02:36.810854: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Dropdown(description='Model:', options=('Qwen/Qwen2-VL-2B-Instruct', 'Qwen/Qwen2-VL-7B-Instruct'), value='Qwen…
print(f"Selected {model_id.value}")
pt_model_id = model_id.value
model_dir = Path(pt_model_id.split("/")[-1])
Selected Qwen/Qwen2-VL-7B-Instruct
Qwen2VL is PyTorch model. OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation (IR). OpenVINO model conversion API should be used for these purposes. ov.convert_model
function accepts original PyTorch model instance and example input for tracing and returns ov.Model
representing this model in OpenVINO framework. Converted model can be used for saving on disk using ov.save_model
function or directly loading on device using core.compile_model
.
ov_qwen2_vl.py
script contains helper function for model conversion, please check its content if you interested in conversion details.
The inference flow has difference on first step and for the next. On the first step, model accept preprocessed input instruction and image, that transformed to the unified embedding space using input_embedding
and image_encoder
models, after that language model
, LLM-based part of model, runs on input embeddings to predict probability of next generated tokens. On the next step, language_model
accepts only next token id selected based on sampling strategy and processed by input_embedding
model and cached attention key and values. Since the output side is auto-regressive, an output token hidden state remains the same once computed for every further generation step. Therefore, recomputing it every time you want to generate a new token seems wasteful. With the cache, the model saves the hidden state once it has been computed. The model only computes the one for the most recently generated output token at each time step, re-using the saved ones for hidden tokens. This reduces the generation complexity from $O(n^3)$ to $O(n^2)$ for a transformer model. More details about how it works can be found in this article.
To sum up above, model consists of 4 parts:
back to top ⬆️ For reducing memory consumption, weights compression optimization can be applied using NNCF.
enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;
improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.
Neural Network Compression Framework (NNCF) provides 4-bit / 8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
nncf.compress_weights
function can be used for performing weights compression. The function accepts an OpenVINO model and other compression parameters. Compared to INT8 compression, INT4 compression improves performance even more, but introduces a minor drop in prediction quality.
More details about weights compression, can be found in OpenVINO documentation.
from ov_qwen2_vl import convert_qwen2vl_model
# uncomment these lines to see model conversion code
# convert_qwen2vl_model??
import nncf
compression_configuration = {
"mode": nncf.CompressWeightsMode.INT4_ASYM,
"group_size": 128,
"ratio": 1.0,
}
convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration)
⌛ Qwen/Qwen2-VL-7B-Instruct conversion started. Be patient, it may takes some time. ⌛ Load Original model
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
✅ Original model successfully loaded ⌛ Convert Input embedding model WARNING:nncf:NNCF provides best results with torch==2.4.*, while current torch version is 2.5.1+cu124. If you encounter issues, consider switching to torch==2.4.* ✅ Input embedding model successfully converted ⌛ Convert Language model
/home/ea/work/py311/lib/python3.11/site-packages/transformers/modeling_utils.py:4779: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead warnings.warn( /home/ea/work/py311/lib/python3.11/site-packages/transformers/cache_utils.py:447: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results. or len(self.key_cache[layer_idx]) == 0 # the layer has no cache /home/ea/work/py311/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py:477: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if sequence_length != 1: /home/ea/work/py311/lib/python3.11/site-packages/transformers/cache_utils.py:432: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results. elif len(self.key_cache[layer_idx]) == 0: # fills previously skipped layers; checking for tensor causes errors
✅ Language model successfully converted ⌛ Weights compression with int4_asym mode started INFO:nncf:Statistics of the bitwidth distribution: ┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑ │ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │ ┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥ │ 8 │ 8% (1 / 197) │ 0% (0 / 196) │ ├────────────────┼─────────────────────────────┼────────────────────────────────────────┤ │ 4 │ 92% (196 / 197) │ 100% (196 / 196) │ ┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
✅ Weights compression finished ⌛ Convert Image embedding model ⌛ Weights compression with int4_asym mode started INFO:nncf:Statistics of the bitwidth distribution: ┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑ │ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │ ┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥ │ 8 │ 3% (1 / 130) │ 0% (0 / 129) │ ├────────────────┼─────────────────────────────┼────────────────────────────────────────┤ │ 4 │ 97% (129 / 130) │ 100% (129 / 129) │ ┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
✅ Weights compression finished ✅ Image embedding model successfully converted ✅ Qwen/Qwen2-VL-7B-Instruct model conversion finished. You can find results in Qwen2-VL-7B-Instruct
As discussed, the model comprises Image Encoder and LLM (with separated text embedding part) that generates answer. In ov_qwen2_vl.py
we defined inference class OVQwen2VLModel
that will represent generation cycle, It is based on HuggingFace Transformers GenerationMixin
and looks similar to Optimum Intel OVModelForCausalLM
that is used for LLM inference.
from ov_qwen2_vl import OVQwen2VLModel
# Uncomment below lines to see the model inference class code
# OVQwen2VLModel??
from notebook_utils import device_widget
device = device_widget(default="AUTO", exclude=["NPU"])
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
model = OVQwen2VLModel(model_dir, device.value)
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
from transformers import TextStreamer
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)
if processor.chat_template is None:
tok = AutoTokenizer.from_pretrained(model_dir)
processor.chat_template = tok.chat_template
example_image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
example_image_path = Path("demo.jpeg")
if not example_image_path.exists():
Image.open(requests.get(example_image_url, stream=True).raw).save(example_image_path)
image = Image.open(example_image_path)
question = "Describe this image."
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": f"file://{example_image_path}",
},
{"type": "text", "text": question},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
display(image)
print("Question:")
print(question)
print("Answer:")
generated_ids = model.generate(**inputs, max_new_tokens=100, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Question: Describe this image. Answer: The image depicts a serene beach scene at sunset. A woman and her dog are sitting on the sandy shore, enjoying each other's company. The woman is wearing a plaid shirt and has long hair. She is holding the dog's paw, and the dog is wearing a colorful harness. The dog appears to be a large breed, possibly a Labrador Retriever. The ocean is visible in the background, with gentle waves and a clear sky. The sun is setting, casting a warm glow over
if not Path("gradio_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/qwen2-vl/gradio_helper.py")
open("gradio_helper.py", "w").write(r.text)
Now, you can try to chat with model. Upload image or video using Upload
button, provide your text message into Input
field and click Submit
to start communication.
from gradio_helper import make_demo
demo = make_demo(model, processor)
try:
demo.launch(debug=True)
except Exception:
demo.launch(debug=True, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/