Skip to content

XCP-NG VM GPU Passthrough

Problem Statement

Since we are using XCP-NG as the hypervisor to manage our infrastructure, the VM by default has no access to specialized hardwares such as GPU. However, for some use cases, a GPU is required to perform necessary computation, such as deep learning training. Therefore, we need to figure out how to systematically provision GPU resource to specific VMs when needed.

Solution Overview

In general, we leverage XCP-NG's PCIE passthrough feature to assign GPUs to specific VMs. This is different from virtualizing one GPU into multiple GPUs via the vGPUs technology. Since most GPUs in house are Nvidia's RTX series, the hardware doesn't support vGPUs anyway. That being said, the assignment is exclusive, i.e. once a physical GPU is assigned to a VM, it is not available to others, including the host os.

Step by Step Instructions

Step 0: Log into the XCP-ng host machine

ssh root@192.168.1.14 # cloudlet4 for example

Step 1: Stop all the VM running on the host

This step is critical, otherwise the rest of VMs would halt after the host machine reboot

xe vm-list # list the existing VMs along with uuid on the host
xe vm-shutdown uuid=<vm uuid> # shutdown the VMs one by one

Step 2: list all available PCI devices

lspci
# Example output:
# 0b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] (rev a1)

Step 3: Tell XCP-ng not to use this device ID for the host physical machine

/opt/xensource/libexec/xen-cmdline --set-dom0 "xen-pciback.hide=(0000:0b:00.0)"

Note that the id 0000:0b:00.0 comes from lspci's output in the previous step, which is the 3080Ti's pci device id in this case. (not sure where does the 0000 prefix come from though)

For multiple GPUs

/opt/xensource/libexec/xen-cmdline --set-dom0 "xen-pciback.hide=(0000:0b:00.0)(0000:0c:00.0)"

Step 4: Reboot the XCP-ng host

reboot

Step 5: Check assignable PCI devices

Now, should be able to see that device 0000:0b:00.0 is assignable

xl pci-assignable-list
# 0000:0b:00.0

Step 6: Assign the PCI device to a specific VM

xe vm-param-set other-config:pci=0/0000:0b:00.0 uuid=<vm uuid>

In the case which need to assign 2 GPUs to one VM (devices with id 0/0000:0b:00.0 and 0/0000:0c:00.0 for example)

xe vm-param-set other-config:pci=0/0000:0b:00.0,0/0000:0c:00.0 uuid=<vm uuid>

Step 7: Start the VM

xe vm-start uuid=<vm uuid>
# or via the XOA UI

Step 8: Log into the VM

ssh xen@192.168.1.21 # gpu-dev for example

Step 9: Get the latest Nvidia driver version

sudo apt update
apt-cache search nvidia-driver

Step 10: Install Nvidia-Driver & Check whether the passthrough success

sudo apt install nvidia-driver-515 # as of Aug 2022
nvidia-smi # check whether the GPU is successfully passed through (sometime requires force reboot)

Then you should be able to see some output similar to the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:00:05.0 Off |                  N/A |
|  0%   27C    P8    12W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1340      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

References