基于DHCP服务器使用 PXE 引导和 kickstart (UEFI) 自动安装 ESXi 7

基于DHCP服务器使用 PXE 引导和 kickstart (UEFI) 自动安装 ESXi 7
基于UEFI的PXE安装 ESXi 7
PXE网络引导自动化安装 ESXi 7 详解
PXE+Kickstart 批量安装 ESXi 7
使用PXE 和TFTP 引导ESXi 安装ESXi 7 程序

1 基于DHCP服务器使用 PXE 引导和 kickstart (UEFI) 自动安装 ESXi 7 录制视频

2 浪潮NF5280M5服务器基于DHCP服务器使用 PXE 引导和 kickstart (UEFI) 自动安装 ESXi 7 录制视频

#CentOS 7 系统 安装dhcp服务器 、 tftp-server服务器 、 web服务器
yum install -y dhcp tftp-server httpd

1、配置DHCP服务器

守护进程 :/etc/sbin/dhcpd
脚本:/etc/init.d/dhcpd
端口:67(bootps) 服务端,68(bootpc) 客户端
配置文件:/etc/dhcp/dhcpd.conf
租约信息:/var/lib/dhcpd/dhcpd.leases
配置文件:cp /usr/share/doc/dhcp-4.2.5/dhcpd.conf.example /etc/dhcp/dhcpd.conf
修改DHCP监听特定端口 vi /etc/sysconfig/dhcpd

#编辑DHCP服务器配置文件
vi /etc/dhcp/dhcpd.conf

default-lease-time 600;
max-lease-time 7200;
log-facility local7;

subnet 10.53.220.0 netmask 255.255.255.0 {
  range 10.53.220.41 10.53.220.49;
  option domain-name-servers 10.1.0.1, 114.114.114.114;   #DNS服务器IP地址
  option routers 10.53.220.1;                             #网关地址
  next-server 10.53.220.224;                              # 指定TFTP服务器地址
  filename "/mboot.efi";                                  # 指定网络引导映像文件
}
#绑定IP地址
host zhangfangzhou {                                      #zhangfangzhou 随意指定,但必须唯一
  hardware ethernet 00:50:56:99:06:b7;
  fixed-address 10.53.220.48;
}

#DHCP的服务设置开机启动、启动DHCP服务
systemctl enable dhcpd && systemctl start dhcpd

########防火墙设置
firewall-cmd --zone=public --add-port=67/udp --permanent
firewall-cmd --zone=public --add-port=68/udp --permanent
#重新载入配置
firewall-cmd --reload
firewall-cmd --list-all #查看防火墙规则,只显示/etc/firewalld/zones/public.xml中防火墙策略

2、配置 tftp 服务器

启用 tftp 服务
vi /etc/xinetd.d/tftp

# default: off
# description: The tftp server serves files using the trivial file transfer \
#       protocol.  The tftp protocol is often used to boot diskless \
#       workstations, download configuration files to network-aware printers, \
#       and to start the installation process for some operating systems.
service tftp
{
        socket_type             = dgram
        protocol                = udp
        wait                    = yes
        user                    = root
        server                  = /usr/sbin/in.tftpd
        server_args             = -s /var/lib/tftpboot # TFTP服务器顶级目录
        disable                 = no    # 从yes修改为no
        per_source              = 11
        cps                     = 100 2
        flags                   = IPv4
}

#tftp的服务设置开机启动、启动tftp服务
systemctl enable tftp && systemctl start tftp

########防火墙设置
firewall-cmd --zone=public --add-port=69/tcp --permanent
firewall-cmd --zone=public --add-port=69/udp --permanent
#重新载入配置
firewall-cmd --reload
firewall-cmd --list-all #查看防火墙规则,只显示/etc/firewalld/zones/public.xml中防火墙策略

3、挂载ESXi 7 镜像

#创建文件夹,主要是下载ESXi 7和挂载ESXi 7使用
mkdir -p /var/lib/tftpboot/{iso,ESXi70u2}
wget -P /var/lib/tftpboot/iso  http://10.53.123.144/ISO/ESXi/VMware-VMvisor-Installer-7.0U2a-17867351.x86_64.iso
cd /var/lib/tftpboot
mount /var/lib/tftpboot/iso/VMware-VMvisor-Installer-7.0U2a-17867351.x86_64.iso /var/lib/tftpboot/ESXi70u2     #重启后需要重新mount

#1、UEFI启动 ESXi 7
将上面DHCP服务器指定的启动镜像文件复制到指定目录下。上面的设置示例名为“mboot.efi”,因此将其复制为该名称。
cd /var/lib/tftpboot/
cp -p /var/lib/tftpboot/ESXi70u2/efi/boot/bootx64.efi /var/lib/tftpboot/mboot.efi

直接在 tftpboot 下复制名为 boot.cfg 的文件,该文件描述了引导设置。
cp -p /var/lib/tftpboot/ESXi70u2/efi/boot/boot.cfg /var/lib/tftpboot/boot.cfg
 #2、传统BIOS启动 ESXi 7
如果要使用旧版 BIOS,则需要3.86版的 syslinux 软件包。从 https://www.kernel.org/pub/linux/utils/boot/syslinux/ 下载包
cd /tmp
wget https://mirrors.edge.kernel.org/pub/linux/utils/boot/syslinux/3.xx/syslinux-3.86.tar.gz
tar xvzf syslinux-3.86.tar.gz
cp -p syslinux-3.86/core/mboot.efi /var/lib/tftpboot/
#3、修改boot.cfg
删除 boot.cfg 中文件路径的“/”
sed -i 's|/||g' boot.cfg

4、检查boot.cfg

根据您的环境更改“标题”和“前缀”。为“title”使用一个描述性名称是个好主意,它是自动安装过程中显示的标题字符。“前缀prefix”指定存储 ESXi 安装程序的目录。在这种情况下,它将是“ESXi70u2”。

修改title和refix名称
prefix=ESXi70u2 (ESXi70u2就是上面ESXi ISO文件mount的文件夹)

# vi /var/lib/tftpboot/boot.cfg
bootstate=0
title=Loading ESXi installer www.zhangfangzhou.cn
timeout=5
prefix=ESXi70u2
kernel=b.b00
kernelopt=runweasel cdromBoot
modules=jumpstrt.gz --- useropts.gz --- features.gz --- k.b00 --- uc_intel.b00 --- uc_amd.b00 --- uc_hygon.b00 --- procfs.b00 --- vmx.v00 --- vim.v00 --- tpm.v00 --- sb.v00 --- s.v00 --- atlantic.v00 --- bnxtnet.v00 --- bnxtroce.v00 --- brcmfcoe.v00 --- brcmnvme.v00 --- elxiscsi.v00 --- elxnet.v00 --- i40enu.v00 --- iavmd.v00 --- icen.v00 --- igbn.v00 --- irdman.v00 --- iser.v00 --- ixgben.v00 --- lpfc.v00 --- lpnic.v00 --- lsi_mr3.v00 --- lsi_msgp.v00 --- lsi_msgp.v01 --- lsi_msgp.v02 --- mtip32xx.v00 --- ne1000.v00 --- nenic.v00 --- nfnic.v00 --- nhpsa.v00 --- nmlx4_co.v00 --- nmlx4_en.v00 --- nmlx4_rd.v00 --- nmlx5_co.v00 --- nmlx5_rd.v00 --- ntg3.v00 --- nvme_pci.v00 --- nvmerdma.v00 --- nvmxnet3.v00 --- nvmxnet3.v01 --- pvscsi.v00 --- qcnic.v00 --- qedentv.v00 --- qedrntv.v00 --- qfle3.v00 --- qfle3f.v00 --- qfle3i.v00 --- qflge.v00 --- rste.v00 --- sfvmk.v00 --- smartpqi.v00 --- vmkata.v00 --- vmkfcoe.v00 --- vmkusb.v00 --- vmw_ahci.v00 --- clusters.v00 --- crx.v00 --- elx_esx_.v00 --- btldr.v00 --- esx_dvfi.v00 --- esx_ui.v00 --- esxupdt.v00 --- tpmesxup.v00 --- weaselin.v00 --- loadesx.v00 --- lsuv2_hp.v00 --- lsuv2_in.v00 --- lsuv2_ls.v00 --- lsuv2_nv.v00 --- lsuv2_oe.v00 --- lsuv2_oe.v01 --- lsuv2_oe.v02 --- lsuv2_sm.v00 --- native_m.v00 --- qlnative.v00 --- vdfs.v00 --- vmware_e.v00 --- vsan.v00 --- vsanheal.v00 --- vsanmgmt.v00 --- tools.t00 --- xorg.v00 --- gc.v00 --- imgdb.tgz --- basemisc.tgz --- resvibs.tgz --- imgpayld.tgz
build=7.0.2-0.0.17867351
updated=0

到这一步可以实现通过PXE网络安装ESXi

5、配置web 服务器

#默认网站目录/var/www/html

systemctl enable httpd && systemctl start httpd

########防火墙设置
firewall-cmd --zone=public --add-port=80/tcp --permanent
firewall-cmd --zone=public --add-port=443/tcp --permanent
#重新载入配置
firewall-cmd --reload
firewall-cmd --list-all #查看防火墙规则,只显示/etc/firewalld/zones/public.xml中防火墙策略

6、配置 ESXi 7 的 kickstart

示例 kickstart 文件存储在 ESXi 的 /etc/vmware/weasel/ks.cfg 中。将此复制到 http 服务器上的 /var/www/html

#示例 1 kickstart 文件,给ESXi 7 配置 静态IP地址
vi /var/www/html/ks.cfg
#####################
# Accept the VMware End User License Agreement
vmaccepteula
 
# Set the root password for the DCUI and Tech Support Mode
rootpw P@ssw0rd
 
# The install media is in the CD-ROM drive
install --firstdisk --overwritevmfs
 
# Set the network on the first network adapter
network --bootproto=static --device=vmnic0 --ip=10.53.220.199 --netmask=255.255.255.0 --vlanid=0 --gateway=10.53.220.1 --hostname=10.53.220.199 --nameserver=10.1.0.1

#vmserialnum
vmserialnum --esx=5U4TK-DML1M-M8550-XK1QP-1A052
# reboot after install
reboot
 
# run the following command only on the firstboot
%firstboot --interpreter=busybox

sleep 10
# enable & start remote ESXi Shell (SSH)
vim-cmd hostsvc/enable_ssh
vim-cmd hostsvc/start_ssh
sleep 1
# enable & start ESXi Shell (TSM)
vim-cmd hostsvc/enable_esx_shell
vim-cmd hostsvc/start_esx_shell
sleep 1
# enable High Performance
esxcli system settings advanced set --option=/Power/CpuPolicy --string-value="High Performance"
sleep 1
#Disable ipv6
esxcli network ip set --ipv6-enabled=0
sleep 1
#不加入体验
esxcli system settings advanced set -o /UserVars/HostClientCEIPOptIn -i 2
sleep 1
#enable ntp
/bin/esxcli system ntp set --server=ntp.aliyun.com
sleep 1
/bin/chkconfig ntpd on
sleep 1
#Disable ShellWarning
esxcli system settings advanced set -o /UserVars/SuppressShellWarning -i 1
sleep 1
cat << EOF > /etc/ssh/keys-root/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAgEA6PP8/RDdYIqUq6DE3zj9Qs8AF3uzfYH5lYrB+mMxxfI8kq2NIzQlsW1KDaH/fWYTkkK120lPUu97lWic9Li3Z3iFR6Nh4q6PVTfBNt4xOx754Ipqtpefk+9sLZAYEPK8pnRP0QZv7CtDFv842tfUYIrVNnecRQTFfNtnGDcXnO1u2RE1kq6Tr5N3595PbPLDKczjOFnS+jy0MeKKHPJcZfYz7TUTSzTwTHYbPRRaQ8/0eihUwzpAmXRo9NYNle26qp6+SlRsjGSBcUr0rh0wSe6r/C2btnpOUd//aFvcl8plhyb++nivlhB71v+6I0UcPoSXOIVs/1QuHEMbv7Ircjb2emqHtZDpk8KSYhgV0ZdbAq9XOcux76eok//xtjbleKPAcDMY/KotIEh7QX4NLQxSJOm5gCLkn5kbrHfVx6nWlGzVVds1sDlcnSAWul5lFiI5ZkArXFKcnm+aCnStPpx5SSCpZpMUdZvt8ZA7vLx3xjMDFwv5HTuTFwB9mlZrdfqp5USC2mWC3eAAPE7GxDSfJv9epteIYywIP9NVT3Z4ng9z6jrcFfA4GFlfLrk8J71cnxZ/AWZXXUwp3ooE/Cp3jc473VpK7FZwjQ7Xz9PD8WQgMOO4xnGQhPWlxhRuoTYyQVa0xOBO9gh9Cuc6zq5FQgYQEcSB+/FBJ/YNDIc= www.zhangfangzhou.cn
EOF
sleep 1
date > /finished.stamp

# Restart a last time
reboot


#示例 2 kickstart 文件,ESXi 7 通过 DHCP 服务器获得IP地址
​vi /var/www/html/ks.cfg

# Accept the VMware End User License Agreement
vmaccepteula
 
# Set the root password for the DCUI and Tech Support Mode
rootpw P@ssw0rd
 
# The install media is in the CD-ROM drive
install --firstdisk --overwritevmfs
 
# Set the network on the first network adapter
network --bootproto=dhcp

#vmserialnum
vmserialnum --esx=5U4TK-DML1M-M8550-XK1QP-1A052
# reboot after install
reboot
 
# run the following command only on the firstboot
%firstboot --interpreter=busybox

sleep 10
# enable & start remote ESXi Shell (SSH)
vim-cmd hostsvc/enable_ssh
vim-cmd hostsvc/start_ssh
sleep 1
# enable & start ESXi Shell (TSM)
vim-cmd hostsvc/enable_esx_shell
vim-cmd hostsvc/start_esx_shell
sleep 1
# enable High Performance
esxcli system settings advanced set --option=/Power/CpuPolicy --string-value="High Performance"
sleep 1
#Disable ipv6
esxcli network ip set --ipv6-enabled=0
sleep 1
#不加入体验
esxcli system settings advanced set -o /UserVars/HostClientCEIPOptIn -i 2
sleep 1
#enable ntp
/bin/esxcli system ntp set --server=ntp.aliyun.com
sleep 1
/bin/chkconfig ntpd on
sleep 1
#Disable ShellWarning
esxcli system settings advanced set -o /UserVars/SuppressShellWarning -i 1
sleep 1
cat << EOF > /etc/ssh/keys-root/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAgEA6PP8/RDdYIqUq6DE3zj9Qs8AF3uzfYH5lYrB+mMxxfI8kq2NIzQlsW1KDaH/fWYTkkK120lPUu97lWic9Li3Z3iFR6Nh4q6PVTfBNt4xOx754Ipqtpefk+9sLZAYEPK8pnRP0QZv7CtDFv842tfUYIrVNnecRQTFfNtnGDcXnO1u2RE1kq6Tr5N3595PbPLDKczjOFnS+jy0MeKKHPJcZfYz7TUTSzTwTHYbPRRaQ8/0eihUwzpAmXRo9NYNle26qp6+SlRsjGSBcUr0rh0wSe6r/C2btnpOUd//aFvcl8plhyb++nivlhB71v+6I0UcPoSXOIVs/1QuHEMbv7Ircjb2emqHtZDpk8KSYhgV0ZdbAq9XOcux76eok//xtjbleKPAcDMY/KotIEh7QX4NLQxSJOm5gCLkn5kbrHfVx6nWlGzVVds1sDlcnSAWul5lFiI5ZkArXFKcnm+aCnStPpx5SSCpZpMUdZvt8ZA7vLx3xjMDFwv5HTuTFwB9mlZrdfqp5USC2mWC3eAAPE7GxDSfJv9epteIYywIP9NVT3Z4ng9z6jrcFfA4GFlfLrk8J71cnxZ/AWZXXUwp3ooE/Cp3jc473VpK7FZwjQ7Xz9PD8WQgMOO4xnGQhPWlxhRuoTYyQVa0xOBO9gh9Cuc6zq5FQgYQEcSB+/FBJ/YNDIc= www.zhangfangzhou.cn
EOF
sleep 1
date > /finished.stamp

# Restart a last time
reboot

7、最终boot.cfg配置文件

# vi /var/lib/tftpboot/boot.cfg
bootstate=0
title=Loading ESXi installer www.zhangfangzhou.cn
timeout=5
prefix=ESXi70u2
kernel=b.b00
kernelopt=ks=http://10.53.220.224/ks.cfg
modules=jumpstrt.gz --- useropts.gz --- features.gz --- k.b00 --- uc_intel.b00 --- uc_amd.b00 --- uc_hygon.b00 --- procfs.b00 --- vmx.v00 --- vim.v00 --- tpm.v00 --- sb.v00 --- s.v00 --- atlantic.v00 --- bnxtnet.v00 --- bnxtroce.v00 --- brcmfcoe.v00 --- brcmnvme.v00 --- elxiscsi.v00 --- elxnet.v00 --- i40enu.v00 --- iavmd.v00 --- icen.v00 --- igbn.v00 --- irdman.v00 --- iser.v00 --- ixgben.v00 --- lpfc.v00 --- lpnic.v00 --- lsi_mr3.v00 --- lsi_msgp.v00 --- lsi_msgp.v01 --- lsi_msgp.v02 --- mtip32xx.v00 --- ne1000.v00 --- nenic.v00 --- nfnic.v00 --- nhpsa.v00 --- nmlx4_co.v00 --- nmlx4_en.v00 --- nmlx4_rd.v00 --- nmlx5_co.v00 --- nmlx5_rd.v00 --- ntg3.v00 --- nvme_pci.v00 --- nvmerdma.v00 --- nvmxnet3.v00 --- nvmxnet3.v01 --- pvscsi.v00 --- qcnic.v00 --- qedentv.v00 --- qedrntv.v00 --- qfle3.v00 --- qfle3f.v00 --- qfle3i.v00 --- qflge.v00 --- rste.v00 --- sfvmk.v00 --- smartpqi.v00 --- vmkata.v00 --- vmkfcoe.v00 --- vmkusb.v00 --- vmw_ahci.v00 --- clusters.v00 --- crx.v00 --- elx_esx_.v00 --- btldr.v00 --- esx_dvfi.v00 --- esx_ui.v00 --- esxupdt.v00 --- tpmesxup.v00 --- weaselin.v00 --- loadesx.v00 --- lsuv2_hp.v00 --- lsuv2_in.v00 --- lsuv2_ls.v00 --- lsuv2_nv.v00 --- lsuv2_oe.v00 --- lsuv2_oe.v01 --- lsuv2_oe.v02 --- lsuv2_sm.v00 --- native_m.v00 --- qlnative.v00 --- vdfs.v00 --- vmware_e.v00 --- vsan.v00 --- vsanheal.v00 --- vsanmgmt.v00 --- tools.t00 --- xorg.v00 --- gc.v00 --- imgdb.tgz --- basemisc.tgz --- resvibs.tgz --- imgpayld.tgz
build=7.0.2-0.0.17867351
updated=0

到这里可以基于DHCP实现通过PXE自动安装 ESXi 7、自动设置系统密码、自动接受许可、自动添加产品密钥、自动启用SSH服务、自动启用shell服务、电源模式自动设置为高性能、禁用IPV6、不参加(CEIP)客户体验改善计划、自动配置NTP、自动添加私钥。

https://www.zhangfangzhou.cn/pxe-kickstart-install-uefi-esxi-7.html

如何在Ubuntu 22.04上安装GeForce RTX 4090驱动程序和CUDA

To install the GeForce RTX 4090 driver and cuda on Ubuntu 22.04

1. 打开终端并更新
使用sudo更新apt软件包列表并使用sudo升级apt软件包

sudo apt update && sudo apt upgrade

2. 确定可用驱动
使用sudo命令列出ubuntu驱动程序列表

sudo ubuntu-drivers list
......
nvidia-driver-565, (kernel modules provided by nvidia-dkms-565)
nvidia-driver-535, (kernel modules provided by linux-modules-nvidia-535-generic)
nvidia-driver-575, (kernel modules provided by nvidia-dkms-575)
nvidia-driver-555-open, (kernel modules provided by nvidia-dkms-555-open)
nvidia-driver-575-open, (kernel modules provided by nvidia-dkms-575-open)
nvidia-driver-570-server, (kernel modules provided by linux-modules-nvidia-570-server-generic)
nvidia-driver-555, (kernel modules provided by nvidia-dkms-555)
nvidia-driver-535-server-open, (kernel modules provided by linux-modules-nvidia-535-server-open-generic)
nvidia-driver-560, (kernel modules provided by nvidia-dkms-560)
nvidia-driver-570-server-open, (kernel modules provided by linux-modules-nvidia-570-server-open-generic)
nvidia-driver-570-open, (kernel modules provided by linux-modules-nvidia-570-open-generic)
nvidia-driver-535-server, (kernel modules provided by linux-modules-nvidia-535-server-generic)
nvidia-driver-560-open, (kernel modules provided by nvidia-dkms-560-open)
nvidia-driver-565-open, (kernel modules provided by nvidia-dkms-565-open)
nvidia-driver-550, (kernel modules provided by linux-modules-nvidia-550-generic)
nvidia-driver-550-open, (kernel modules provided by linux-modules-nvidia-550-open-generic)
nvidia-driver-570, (kernel modules provided by linux-modules-nvidia-570-generic)
nvidia-driver-535-open, (kernel modules provided by linux-modules-nvidia-535-open-generic)
open-vm-tools-desktop

3.安装驱动

sudo apt install nvidia-driver-570

4.重启系统

sudo cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nvidia.conf
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist rivafb
EOF
sudo update-initramfs -u

sudo reboot

5.验证驱动是否正常加载

root@lenovo-ThinkStation-PX:~# nvidia-smi 
Fri Jul 18 17:13:35 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169                Driver Version: 570.169        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:2A:00.0 Off |                  Off |
| 32%   42C    P8             34W /  450W |      87MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:3D:00.0 Off |                  Off |
| 30%   34C    P8             16W /  450W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off |   00000000:BD:00.0 Off |                  Off |
| 32%   32C    P8              6W /  450W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off |   00000000:E1:00.0 Off |                  Off |
| 32%   33C    P8             15W /  450W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2453      G   /usr/lib/xorg/Xorg                       46MiB |
|    0   N/A  N/A            2564      G   /usr/bin/gnome-shell                     13MiB |
|    1   N/A  N/A            2453      G   /usr/lib/xorg/Xorg                        4MiB |
|    2   N/A  N/A            2453      G   /usr/lib/xorg/Xorg                        4MiB |
|    3   N/A  N/A            2453      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

6.安装 CUDA

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

用 nvcc --version 确认cuda的版本,如果显示Command nvcc not found,则编辑~/.bashrc

vim ~/.bashrc
export PATH=/usr/local/cuda-12.8/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

#更新变量
source ~/.bashrc

root@su:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

7.锁定驱动版本防止升级冲突

sudo apt-mark hold nvidia-driver-570
sudo apt-mark hold cuda-toolkit-12-8

如何在Ubuntu 24.04上安装NVIDIA L40驱动程序和CUDA

To install the NVIDIA L40 driver and cuda on Ubuntu 24.04

1. 打开终端并更新
使用sudo更新apt软件包列表并使用sudo升级apt软件包

sudo apt update && sudo apt upgrade

2. 确定可用驱动
使用sudo命令列出ubuntu驱动程序列表

sudo ubuntu-drivers list
......
nvidia-driver-565, (kernel modules provided by nvidia-dkms-565)
nvidia-driver-535, (kernel modules provided by linux-modules-nvidia-535-generic)
nvidia-driver-575, (kernel modules provided by nvidia-dkms-575)
nvidia-driver-555-open, (kernel modules provided by nvidia-dkms-555-open)
nvidia-driver-575-open, (kernel modules provided by nvidia-dkms-575-open)
nvidia-driver-570-server, (kernel modules provided by linux-modules-nvidia-570-server-generic)
nvidia-driver-555, (kernel modules provided by nvidia-dkms-555)
nvidia-driver-535-server-open, (kernel modules provided by linux-modules-nvidia-535-server-open-generic)
nvidia-driver-560, (kernel modules provided by nvidia-dkms-560)
nvidia-driver-570-server-open, (kernel modules provided by linux-modules-nvidia-570-server-open-generic)
nvidia-driver-570-open, (kernel modules provided by linux-modules-nvidia-570-open-generic)
nvidia-driver-535-server, (kernel modules provided by linux-modules-nvidia-535-server-generic)
nvidia-driver-560-open, (kernel modules provided by nvidia-dkms-560-open)
nvidia-driver-565-open, (kernel modules provided by nvidia-dkms-565-open)
nvidia-driver-550, (kernel modules provided by linux-modules-nvidia-550-generic)
nvidia-driver-550-open, (kernel modules provided by linux-modules-nvidia-550-open-generic)
nvidia-driver-570, (kernel modules provided by linux-modules-nvidia-570-generic)
nvidia-driver-535-open, (kernel modules provided by linux-modules-nvidia-535-open-generic)
open-vm-tools-desktop

3.安装驱动

sudo apt install nvidia-driver-570-server    #数据中心卡(如L40)专用驱动,稳定性更好,支持 MIG、多用户多实例等特性

4.重启系统

sudo reboot

5.验证驱动是否正常加载

root@su:~# nvidia-smi
Thu Jun 12 06:14:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40                     Off |   00000000:13:00.0 Off |                    0 |
| N/A   29C    P0             79W /  300W |       0MiB /  46068MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

6.安装 CUDA

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

用 nvcc --version 确认cuda的版本,如果显示Command nvcc not found,则编辑~/.bashrc

vim ~/.bashrc
export PATH=/usr/local/cuda-12.8/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

#更新变量
source ~/.bashrc

root@su:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

7.锁定驱动版本防止升级冲突

sudo apt-mark hold nvidia-driver-570-server
sudo apt-mark hold cuda-toolkit-12-8


基于VMware 组织唯一标识符(OUI)和vCenter Server ID的虚拟机MAC地址规划

VMware vCenter Server MAC 地址分配与冲突避免

1. vCenter Server MAC 地址分配机制

在企业级 VMware 虚拟化环境中,多个 vCenter Server 可能共存。默认情况下,vCenter Server 使用 VMware 组织唯一标识符 (OUI) 和 vCenter Server ID 生成虚拟机 MAC 地址。

MAC 地址格式

VMware OUI:00:50:56
完整格式:00:50:56:XX:YY:ZZ

  • XX 计算方式80 + vCenter Server ID(十六进制表示)

  • YY、ZZ:随机分配的十六进制值

地址分配范围

  • 适用于 0-63 之间的 vCenter Server 实例 ID

  • 每个 vCenter Server 最多可分配 64,000 个唯一 MAC 地址

  • 地址范围:

    • 00:50:56:80:YY:ZZ(vCenter ID = 0)

    • 00:50:56:81:YY:ZZ(vCenter ID = 1)

    • ...

    • 00:50:56:BF:YY:ZZ(vCenter ID = 63)

分配方式 vCenter Server

ID

MAC地址

范围

OUI 0 00:50:56:80:YY:ZZ
OUI 1 00:50:56:81:YY:ZZ
OUI 2 00:50:56:82:YY:ZZ
OUI 3 00:50:56:83:YY:ZZ
OUI 4 00:50:56:84:YY:ZZ
OUI 5 00:50:56:85:YY:ZZ
OUI 6 00:50:56:86:YY:ZZ
OUI 7 00:50:56:87:YY:ZZ
OUI 8 00:50:56:88:YY:ZZ
OUI 9 00:50:56:89:YY:ZZ
OUI 10 00:50:56:8A:YY:ZZ
OUI 11 00:50:56:8B:YY:ZZ
OUI 12 00:50:56:8C:YY:ZZ
OUI 13 00:50:56:8D:YY:ZZ
OUI 14 00:50:56:8E:YY:ZZ
OUI 15 00:50:56:8F:YY:ZZ
OUI 16 00:50:56:90:YY:ZZ
OUI 17 00:50:56:91:YY:ZZ
OUI 18 00:50:56:92:YY:ZZ
OUI 19 00:50:56:93:YY:ZZ
OUI 20 00:50:56:94:YY:ZZ
OUI 21 00:50:56:95:YY:ZZ
OUI 22 00:50:56:96:YY:ZZ
OUI 23 00:50:56:97:YY:ZZ
OUI 24 00:50:56:98:YY:ZZ
OUI 25 00:50:56:99:YY:ZZ
OUI 26 00:50:56:9A:YY:ZZ
OUI 27 00:50:56:9B:YY:ZZ
OUI 28 00:50:56:9C:YY:ZZ
OUI 29 00:50:56:9D:YY:ZZ
OUI 30 00:50:56:9E:YY:ZZ
OUI 31 00:50:56:9F:YY:ZZ
OUI 32 00:50:56:A0:YY:ZZ
OUI 33 00:50:56:A1:YY:ZZ
OUI 34 00:50:56:A2:YY:ZZ
OUI 35 00:50:56:A3:YY:ZZ
OUI 36 00:50:56:A4:YY:ZZ
OUI 37 00:50:56:A5:YY:ZZ
OUI 38 00:50:56:A6:YY:ZZ
OUI 39 00:50:56:A7:YY:ZZ
OUI 40 00:50:56:A8:YY:ZZ
OUI 41 00:50:56:A9:YY:ZZ
OUI 42 00:50:56:AA:YY:ZZ
OUI 43 00:50:56:AB:YY:ZZ
OUI 44 00:50:56:AC:YY:ZZ
OUI 45 00:50:56:AD:YY:ZZ
OUI 46 00:50:56:AE:YY:ZZ
OUI 47 00:50:56:AF:YY:ZZ
OUI 48 00:50:56:B0:YY:ZZ
OUI 49 00:50:56:B1:YY:ZZ
OUI 50 00:50:56:B2:YY:ZZ
OUI 51 00:50:56:B3:YY:ZZ
OUI 52 00:50:56:B4:YY:ZZ
OUI 53 00:50:56:B5:YY:ZZ
OUI 54 00:50:56:B6:YY:ZZ
OUI 55 00:50:56:B7:YY:ZZ
OUI 56 00:50:56:B8:YY:ZZ
OUI 57 00:50:56:B9:YY:ZZ
OUI 58 00:50:56:BA:YY:ZZ
OUI 59 00:50:56:BB:YY:ZZ
OUI 60 00:50:56:BC:YY:ZZ
OUI 61 00:50:56:BD:YY:ZZ
OUI 62 00:50:56:BE:YY:ZZ
OUI 63 00:50:56:BF:YY:ZZ

潜在冲突风险

如果两个 vCenter Server 具有相同的实例 ID,可能会生成相同的 MAC 地址,导致:

  • 网络冲突(数据包丢失、通信异常)

  • IP 绑定错误(DHCP 服务器误分配 IP)


2. 修改 vCenter Server 实例 ID 以避免冲突

修改步骤(vSphere Client)

  1. 导航vCenter Server > 配置

  2. 选择 常规 > 编辑

  3. 调整 运行时设置

    • vCenter Server 唯一 ID 中输入 0-63 范围内的唯一值

    • 设置 vCenter Server 受管地址(IPv4/IPv6/FQDN)

    • 确保 vCenter Server 名称 正确

  4. 保存更改并重启 vCenter Server

重要提示

  • 修改实例 ID 不会影响已分配的 MAC 地址

  • 若需重新分配 MAC 地址,可将虚拟机网卡设置为手动 MAC,然后改回自动分配

www.zhangfangzhou.cn

MySQL主从数据库Last_Error: Error ‘Row size too large. The maximum row size for the used table type, not counting BLOBs, is 65535. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs’ on query

MySQL 版本: MySQL 5.7.31
MySQL 架构: 主从数据库架构
告警节点: 从数据库
运行场景: 某大型制造企业

             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1118
                   Last_Error: Error 'Row size too large. The maximum row size for the used table type, not counting BLOBs, is 65535. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs' on query. Default database: 'ekp'. Query: 'alter table ekp_ff4cf69105f28480891e add column fd_3ddc5be2f5270e varchar(4000)'

解决办法,修改MySQL配置参数,从新进行数据同步。

[mysqld]
port=3360
bind_address=0.0.0.0
server-id = 128
log_bin = mysql-bin
binlog_cache_size = 4M
binlog_format = mixed
expire_logs_days = 99
relay-log=mysqld-relay-bin
init-connect = 'SET NAMES utf8mb4'
character-set-server = utf8mb4
datadir=/home/mysql
socket=/home/mysql/mysql.sock
symbolic-links=0
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
open_files_limit = 5000
key_buffer_size = 256M
query_cache_size = 64M
thread_cache_size = 64
lower_case_table_names = 1
default_storage_engine = InnoDB
innodb_file_format = Barracuda   # 启用 Barracuda,支持大行
innodb_file_per_table = 1             # 独立表空间,便于管理
innodb_default_row_format = DYNAMIC   # 默认 DYNAMIC 行格式,避免行大小限制
innodb_buffer_pool_size = 2048M       # 增加缓冲池(根据内存调整)
innodb_log_file_size = 512M                # 增加日志文件大小,支持大表变更
innodb_log_buffer_size = 16M             # 日志缓冲区,提升写入性能
innodb_large_prefix = 1                        # MySQL 5.7 需启用,支持大索引前缀
performance_schema = 0
explicit_defaults_for_timestamp
skip-external-locking
skip-name-resolve
max_allowed_packet=2560M
wait_timeout=60000
[client]
port = 3360
socket=/home/mysql/mysql.sock

Ubuntu 24.04 LTS如何安装Nvidia显卡驱动、CUDA、NVIDIA Container Toolkit套件

Ubuntu 24.04 LTS如何安装Nvidia显卡驱动、CUDA、NVIDIA Container Toolkit套件

1、安装Nvidia显卡驱动
若有Nvidia显卡,Ubuntu系统会安装开源的nouveau驱动,用指令sudo lshw -C display确认,driver区域会显示"nouveau"。

#卸载自带的驱动
sudo apt update
sudo apt upgrade
sudo apt purge *nvidia*

使用ubuntu-drivers list指令列出目前Nvidia显示卡可用的驱动版本

# 让Ubuntu自动挑选推荐的驱动版本

sudo ubuntu-drivers install

# 或者手动指定版本,填入要安装的Nvidia驱动版本号。
sudo ubuntu-drivers install nvidia:570
安装后nouveau应会自动加入黑名单禁止加载。接着重新启动,用sudo lshw -C display确认是否安装成功,driver区域应会显示"nvidia"。

www.zhangfangzhou.cn

2、双GPU显卡笔记本电脑
像Intel+Nvidia这种的双GPU笔记本电脑,即使装了Nvidia驱动也可能继续用Intel的GPU渲染3D,导致3D性能低下。

此时可以使用prime-select指令,指定用Nvidia显示卡负责渲染。

sudo prime-select nvidia
重开机后再使用指令:vulkaninfo --summary查看主显示卡为何。

3、Ubuntu安装cuda,CUDA Toolkit Installer。

Installation Instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

用 nvcc --version 确认cuda的版本,如果显示Command nvcc not found,则编辑~/.bashrc

vim ~/.bashrc
export PATH=/usr/local/cuda-12.8/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

#更新变量
source ~/.bashrc

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

www.zhangfangzhou.cn

4、安装NVIDIA Container Toolkit,这是设计给Docker和Podman容器用的Nvidia工具,使容器可以使用CUDA计算。

即使宿主机没有安装CUDA,容器內照样可以使用CUDA计算,方便你在容器里面跑不同版本的CUDA,不会受到宿主机的CUDA版本影响。

必须先安装Nvidia专有驱动才可以安装NVIDIA Container Toolkit。

(1)在Ubuntu安装Docker
(2)加入NVIDIA Container Toolkit的套件库

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    
安装NVIDIA Container Toolkit
sudo apt update
sudo apt install nvidia-container-toolkit

向Docker注册Nvidia
sudo nvidia-ctk runtime configure --runtime=docker

重新启动Docker
sudo systemctl restart docker

执行Ubuntu容器,测试能否出现Nvidia显卡的信息
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

www.zhangfangzhou.cn
5、安装TensorRT,TensorRT是Nvidia推出的深度学习推理平台。

必须先安装CUDA才能安装TensorRT。 https://developer.nvidia.com/nvidia-tensorrt-download
安装TensorRT的deb档,加入套件库

# 指定系统版本
os="ubuntu2204"

# 指定TensorRT版本
tag="10.5.0.x-1+cuda12.6"

sudo dpkg -i nv-tensorrt-local-repo-${os}-${tag}_1.0-1_amd64.deb
sudo cp /var/nv-tensorrt-local-repo-${os}-${tag}/*-keyring.gpg /usr/share/keyrings/

sudo apt update
安装TensorRT
sudo apt install tensorrt

如何在 Ubuntu 24.04 中使用 Ollama LLM 本地安装 DeepSeek

如何在 Ubuntu 24.04 中使用 Ollama LLM 本地安装 DeepSeek
How to Install DeepSeek Locally with Ollama LLM in Ubuntu 24.04

1、安装 Python 和 Git
在安装任何东西之前,最好先更新系统以确保所有现有软件包都是最新的。

sudo apt update && sudo apt upgrade -y

Ubuntu 可能预装了 Python,但务必确保安装正确的版本(Python 3.8 或更高版本)。

sudo apt install python3
python3 --version
pip 是 Python 的软件包管理器,需要安装 DeepSeek 和 Ollama 的依赖项。

sudo apt install python3-pip
pip3 --version

sudo apt install git
git --version

2、为 DeepSeek 安装 Ollama

现在 Python 和 Git 已安装完毕,可以安装 Ollama 来管理 DeepSeek。

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

接下来,启动并启用 Ollama,使其在系统启动时自动启动。

sudo systemctl start ollama
sudo systemctl enable ollama
现在 Ollama 已安装完毕,我们可以继续安装 DeepSeek。

3、下载并运行 DeepSeek 模型
现在 Ollama 已安装完毕,您可以下载 DeepSeek 模型。

ollama run deepseek-r1:7b

下载完成后,您可以通过运行以下命令来验证模型是否可用:

ollama list

4、在 Web UI 中运行 DeepSeek
虽然 Ollama 允许您通过命令行与 DeepSeek 交互,但您可能更喜欢更用户友好的 Web 界面。为此,我们将使用 Ollama Web UI,这是一个用于与 Ollama 模型交互的简单 Web 界面。

首先,创建一个虚拟环境,将您的 Python 依赖项与系统范围的 Python 安装隔离开来。

sudo apt install python3-venv
python3 -m venv ~/open-webui-venv
source ~/open-webui-venv/bin/activate

PIP设置阿里云镜像源,以加速包的安装。
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
Writing to /root/.config/pip/pip.conf

现在您的虚拟环境已激活,您可以使用 pip 安装 Open WebUI。
pip install open-webui

安装后,使用以下命令启动服务器。
open-webui serve
打开您的 Web 浏览器并导航到 http://localhost:8080 – 您应该会看到 Ollama Web UI 界面。

www.zhangfangzhou.cn
5、在系统启动时启用 Open-WebUI
要使 Open-WebUI 在启动时启动,您可以创建一个 systemd 服务,在系统启动时自动启动 Open-WebUI 服务器。

vim nano /etc/systemd/system/open-webui.service

[Unit]
Description=Open WebUI Service
After=network.target

[Service]
User=your_username
WorkingDirectory=/home/your_username/open-webui-venv
ExecStart=/home/your_username/open-webui-venv/bin/open-webui serve
Restart=always
Environment="PATH=/home/your_username/open-webui-venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

[Install]
WantedBy=multi-user.target
Replace your_username with your actual username.

现在重新加载 systemd 守护程序以识别新服务:

sudo systemctl daemon-reload
最后,启用并启动服务以在启动时启动:

sudo systemctl enable open-webui.service
sudo systemctl start open-webui.service

检查服务的状态以确保其正常运行:
sudo systemctl status open-webui.service

6、更新 open-webui 包,可以使用以下命令:

pip install --upgrade open-webui

7、open-webui无法链接ollama 报错ERROR:apps.ollama.main:Connection error: Cannot connect
修改启动配置

默认ollama绑定在127.0.0.1的11434端口,修改/etc/systemd/system/ollama.service,在[Service]下添加如下内容,使ollama绑定到0.0.0.0的11434端口
Environment="OLLAMA_HOST=0.0.0.0"

sudo systemctl daemon-reload
sudo systemctl restart ollama

www.zhangfangzhou.cn

8、部署完成后,发下每次登录都会先出现白屏,需要取消api.openai.com
INFO [open_webui.routers.openai] get_all_models()
ERROR [open_webui.routers.openai] Connection error: Cannot connect to host api.openai.com:443 ssl:default [None]
INFO [open_webui.routers.ollama] get_all_models()
登录 http://10.53.122.243:8080/admin/settings
取消 管理OpenAI API连接

9、错误提示
/home/$user/AnythingLLMDesktop/start
su@su:~$ /home/su/AnythingLLMDesktop/start
[10216:0227/083834.604266:FATAL:setuid_sandbox_host.cc(158)] The SUID sandbox helper binary was found, but is not configured correctly. Rather than run without sandboxing I'm aborting now. You need to make sure that /home/su/AnythingLLMDesktop/anythingllm-desktop/chrome-sandbox is owned by root and has mode 4755.
Trace/breakpoint trap (core dumped) www.zhangfangzhou.cn

解决办法

sudo chmod 4755 /home/su/AnythingLLMDesktop/anythingllm-desktop/chrome-sandbox

www.zhangfangzhou.cn

在Proxmox VE 8下开启vGPU – Tesla P4为例

1、对于 Proxmox VE 8,需要使用16.0+版本的 vGPU 驱动程序,低版本的驱动程序不支持 Linux 6.x 内核。

vGPU
vGPU 技术通过将硬件 GPU 分割成多个虚拟 GPU 以支持多个虚拟机。每个虚拟 GPU 可以被分配给不同的虚拟机,从而使多个虚拟机拥有“独显”。

2、关于 PVE 使用 GPU 有三种方法:
显卡直通
普通显卡 + vGPU-unlock
数据中心显卡 vGPU

3、NVIDIA® 虚拟 GPU 软件支持的 GPU(NVIDIA® Virtual GPU Software Supported GPUs)

4、物理宿主机
主板 BIOS 需要开启 IOMMU / VT-d、Above 4G Decoding、SR-IOV。

IOMMU 是一种地址映射技术,而 VT-d 是 Intel 对该技术的别称;Above 4G Decoding 关系到 PCI-E 设备 RAM 的 64 位寻址能力,通常用于需要让 CPU 访问全部显存的场景,使用 vGPU 时推荐开启;SR-IOV 允许一个 PCI-E 设备被多个虚拟机使用,常用于网卡等设备共享。

PVE 物理宿主机配置

# 把下面几行追加到 /etc/modules 里,用于加载所需的内核模块
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
 
# 屏蔽开源驱动
echo "blacklist nouveau" >> /etc/modprobe.d/pve-blacklist.conf
 
# 开启IOMMU
#编辑grub,请不要盲目改。根据自己的环境,选择设置
vi /etc/default/grub
#在里面找到:
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
#然后修改为:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
#如果是amd cpu请改为:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"
#如果是需要显卡直通,建议在cmdline再加一句video=vesafb:off video=efifb:off video=simplefb:off,加了之后,pve重启进内核后停留在一个画面,这是正常情况
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt video=vesafb:off video=efifb:off video=simplefb:off"

修改完成之后,直接更新grub 
update-grub

注意,如果此方法还不能开启iommu,请修改 
 /etc/kernel/cmdline文件
并且使用proxmox-boot-tool refresh 更新启动项
 
# 安装依赖包
apt install build-essential dkms mdevctl pve-headers-$(uname -r)
 
# 重启机器
reboot

重启之后,在终端输入
dmesg | grep iommu
出现如下例子。则代表成功

[ 1.341100] pci 0000:00:00.0: Adding to iommu group 0
[ 1.341116] pci 0000:00:01.0: Adding to iommu group 1
[ 1.341126] pci 0000:00:02.0: Adding to iommu group 2
[ 1.341137] pci 0000:00:14.0: Adding to iommu group 3
[ 1.341146] pci 0000:00:17.0: Adding to iommu group 4

5、安装 NVIDIA Host 驱动
NVIDIA 官网的驱动是非公开的,需要注册 NVIDIA 商业账户才可访问;
vGPU16.7 KVM 驱动 NVIDIA-GRID-Linux-KVM-535.183.04-53...
链接:https://pan.baidu.com/s/181yNg67-nr1u5kF6unjrIQ?pwd=w3lh
提取码:w3lh

# 本次安装环境:PVE版本7.4-3,Linux内核版本5.15.107-2,驱动版本15.1
chmod +x NVIDIA-Linux-x86_64-430.67-vgpu-kvm.run
./NVIDIA-Linux-x86_64-430.67-vgpu-kvm.run
 
# 安装过程完成后可以使用以下命令验证
$ dkms status
nvidia, 525.85.07, 5.15.107-2-pve, x86_64: installed
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.07    Driver Version: 525.85.07    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            On   | 00000000:81:00.0 Off |                    0 |
| N/A   53C    P8    12W /  75W |   1899MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
 

6、划分 PCI-E 设备给虚拟机
通过 mdevctl types | grep '^[^ ]*$' 查看所有生成了 mdev 的 PCI 设备 ID,然后通过 PVE 网页控制台给虚拟机分配 PCI 设备,设备选择刚才查到的设备 ID。该处的 MDev 字段,后面半段的数字为显存容量,字母为 vGPU 类型,这里选 Q 全能型即可,显存按需分配。

7、Debian 系统安装vGPU。

# 禁用 nouveau 并安装必须软件包
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
apt install build-essential gcc-multilib dkms mdevctl
update-initramfs -k all -u
reboot
 
# 重启完成后安装 Guest 驱动
chmod +x NVIDIA-Linux-x86_64-430.63-grid.run
./NVIDIA-Linux-x86_64-430.63-grid.run
 
# 验证安装是否成功
dkms status
nvidia-smi
nvidia-smi -q | grep License # 查看授权,Unlicensed (Unrestricted) 表示未授权

8、安装授权
Windows
使用管理员身份启动 PowerShell 执行如下命令
# <ls-hostname-or-ip>为你的授权服务IP端口
curl.exe --insecure -L -X GET https://<dls-hostname-or-ip>/-/client-token -o "C:\Program Files\NVIDIA Corporation\vGPU Licensing\ClientConfigToken\client_configuration_token_$($(Get-Date).tostring('dd-MM-yy-hh-mm-ss')).tok"
 
Restart-Service NVDisplay.ContainerLocalSystem

Linux
root 执行如下命令
# <ls-hostname-or-ip>为你的授权服务IP端口
curl --insecure -L -X GET https://<dls-hostname-or-ip>/-/client-token -o /etc/nvidia/ClientConfigToken/client_configuration_token_$(date '+%d-%m-%Y-%H-%M-%S').tok
 
service nvidia-gridd restart

9、查看授权
Windows
使用管理员身份启动 PowerShell 执行如下命令
nvidia-smi.exe -q | Select-String License # 查看授权状态

Linux
root 执行如下命令
nvidia-smi -q | grep License # 查看授权状态

在VMware ESXi上对三星PCIe 4.0 NVMe SSD 980 PRO硬盘进行直通DirectPath和虚拟化Virtualization的读写测试

在VMware ESXi上对三星PCIe 4.0 NVMe SSD 980 PRO硬盘进行直通DirectPath和虚拟化Virtualization的读写测试
存储虚拟化后,读性能损失约21.5%,写性能损失约18.4%

Read and Write Tests of Samsung PCIe 4.0 NVMe SSD 980 PRO Hard Drive with DirectPath and Virtualization on VMware ESXi

物理服务器 NF5280M5
硬盘 PCIEx4转NMVe
虚拟化软件 ESXi 7.03
虚拟机配置 8核心 16G内存 120G系统盘
操作系统 CentOS7.9
数据盘 三星 PCIe 4.0 NVMe SSD 980 PRO
测试软件 fio

1、软件安装

yum -y install epel-release
yum -y install fio

2、挂载数据目录

sudo vgcreate lvmgroup /dev/vdb
sudo lvcreate -l 100%FREE -n data lvmgroup
sudo mkfs.xfs /dev/lvmgroup/data
sudo mkdir /data
sudo echo "/dev/lvmgroup/data /data xfs defaults 0 0" >>/etc/fstab
sudo mount -av

3、编写 fio 测试配置文件:创建一个用于 fio 测试的配置文件。您可以使用文本编辑器创建一个新文件,例如 fio_test_config.fio

[global]
ioengine=libaio
direct=1
thread=1
rw=randrw
bs=4k
numjobs=4
size=32G
runtime=300
group_reporting

[randwrite]

[global]:这部分指定了全局配置,适用于所有作业。
ioengine=libaio:指定I/O引擎为libaio,表示使用异步I/O。
direct=1:启用直接I/O,绕过操作系统的文件系统缓存。
thread=1:每个作业使用一个线程。
rw=randrw:读写模式为随机读写。
bs=4k:块大小为4KB。
numjobs=4:作业数量为4,表示并发的作业数。
size=32G:每个作业的测试文件大小为32GB。
runtime=300:测试运行时间为300秒。
group_reporting:指定所有线程的统计信息应汇总在一起报告。

直通DirectPath

[root@10-53-220-58 ~]# cd /data/
[root@10-53-220-58 data]# ls
[root@10-53-220-58 data]# sudo dd if=/dev/zero of=/data/fio_test_file bs=1G count=32
32+0 records in
32+0 records out
34359738368 bytes (34 GB) copied, 36.9548 s, 930 MB/s
[root@10-53-220-58 data]# vi fio_test_config.fio
[root@10-53-220-58 data]# sudo fio fio_test_config.fio
randwrite: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.7
Starting 4 threads
randwrite: Laying out IO file (1 file / 32768MiB)
randwrite: Laying out IO file (1 file / 32768MiB)
randwrite: Laying out IO file (1 file / 32768MiB)
randwrite: Laying out IO file (1 file / 32768MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=102MiB/s,w=103MiB/s][r=26.2k,w=26.3k IOPS][eta 00m:00s] 
randwrite: (groupid=0, jobs=4): err= 0: pid=9048: Fri Apr 12 13:55:32 2024
   read: IOPS=26.0k, BW=102MiB/s (107MB/s)(29.8GiB/300001msec)
    slat (nsec): min=3381, max=69136, avg=4802.53, stdev=1677.41
    clat (usec): min=21, max=19215, avg=113.61, stdev=218.66
     lat (usec): min=34, max=19220, avg=118.47, stdev=218.66
    clat percentiles (usec):
     |  1.00th=[   48],  5.00th=[   51], 10.00th=[   53], 20.00th=[   55],
     | 30.00th=[   57], 40.00th=[   59], 50.00th=[   65], 60.00th=[   74],
     | 70.00th=[   77], 80.00th=[   85], 90.00th=[   93], 95.00th=[  343],
     | 99.00th=[ 1270], 99.50th=[ 1369], 99.90th=[ 1516], 99.95th=[ 1565],
     | 99.99th=[ 1958]
   bw (  KiB/s): min=20680, max=36128, per=24.99%, avg=26004.68, stdev=2194.12, samples=2397
   iops        : min= 5170, max= 9032, avg=6501.10, stdev=548.54, samples=2397
  write: IOPS=25.0k, BW=102MiB/s (106MB/s)(29.7GiB/300001msec)
    slat (usec): min=4, max=135, avg= 6.16, stdev= 1.90
    clat (nsec): min=1292, max=1175.4k, avg=27396.71, stdev=3962.20
     lat (usec): min=19, max=1181, avg=33.62, stdev= 4.09
    clat percentiles (nsec):
     |  1.00th=[18304],  5.00th=[21888], 10.00th=[23680], 20.00th=[24960],
     | 30.00th=[26496], 40.00th=[26752], 50.00th=[27264], 60.00th=[27520],
     | 70.00th=[28288], 80.00th=[29312], 90.00th=[31616], 95.00th=[34048],
     | 99.00th=[39168], 99.50th=[41216], 99.90th=[43776], 99.95th=[45312],
     | 99.99th=[50944]
   bw (  KiB/s): min=20280, max=36008, per=24.99%, avg=25980.38, stdev=2247.11, samples=2397
   iops        : min= 5070, max= 9002, avg=6495.02, stdev=561.78, samples=2397
  lat (usec)   : 2=0.04%, 4=0.01%, 10=0.01%, 20=0.90%, 50=50.88%
  lat (usec)   : 100=44.43%, 250=1.06%, 500=0.46%, 750=0.46%, 1000=0.53%
  lat (msec)   : 2=1.22%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=4.49%, sys=11.07%, ctx=15587579, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, &amp;amp;gt;=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &amp;amp;gt;=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &amp;amp;gt;=64=0.0%
     issued rwts: total=7804560,7797520,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=102MiB/s (107MB/s), 102MiB/s-102MiB/s (107MB/s-107MB/s), io=29.8GiB (31.0GB), run=300001-300001msec
  WRITE: bw=102MiB/s (106MB/s), 102MiB/s-102MiB/s (106MB/s-106MB/s), io=29.7GiB (31.9GB), run=300001-300001msec

Disk stats (read/write):
    dm-2: ios=7801557/7794611, merge=0/0, ticks=862107/186550, in_queue=1048584, util=100.00%, aggrios=7804560/7797562, aggrmerge=0/1, aggrticks=857542/181607, aggrin_queue=1039149, aggrutil=100.00%
  nvme0n1: ios=7804560/7797562, merge=0/1, ticks=857542/181607, in_queue=1039149, util=100.00%
www.zhangfangzhou.cn

虚拟化Virtualization

[root@10-53-220-44 data]# sudo dd if=/dev/zero of=/data/fio_test_file bs=1G count=32
32+0 records in
32+0 records out
34359738368 bytes (34 GB) copied, 71.8294 s, 478 MB/s
[root@10-53-220-44 data]# vi fio_test_config.fio
[root@10-53-220-44 data]# sudo fio fio_test_config.fio
randwrite: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.7
Starting 4 threads
randwrite: Laying out IO file (1 file / 32768MiB)
randwrite: Laying out IO file (1 file / 32768MiB)
randwrite: Laying out IO file (1 file / 32768MiB)
randwrite: Laying out IO file (1 file / 32768MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=78.2MiB/s,w=77.8MiB/s][r=20.0k,w=19.9k IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=4): err= 0: pid=9016: Fri Apr 12 14:43:21 2024
   read: IOPS=20.4k, BW=79.7MiB/s (83.6MB/s)(23.4GiB/300001msec)
    slat (usec): min=4, max=114, avg= 6.78, stdev= 1.98
    clat (usec): min=42, max=38534, avg=134.39, stdev=190.16
     lat (usec): min=53, max=38545, avg=141.24, stdev=190.14
    clat percentiles (usec):
     |  1.00th=[   79],  5.00th=[   87], 10.00th=[   89], 20.00th=[   92],
     | 30.00th=[   93], 40.00th=[   96], 50.00th=[   98], 60.00th=[  101],
     | 70.00th=[  104], 80.00th=[  106], 90.00th=[  113], 95.00th=[  161],
     | 99.00th=[ 1205], 99.50th=[ 1319], 99.90th=[ 1467], 99.95th=[ 1516],
     | 99.99th=[ 2057]
   bw (  KiB/s): min=   73, max=24856, per=16.35%, avg=13347.32, stdev=9723.44, samples=2397
   iops        : min=   18, max= 6214, avg=3336.67, stdev=2431.02, samples=2397
  write: IOPS=20.4k, BW=79.6MiB/s (83.5MB/s)(23.3GiB/300001msec)
    slat (usec): min=5, max=115, avg= 8.06, stdev= 1.98
    clat (nsec): min=1403, max=38544k, avg=44914.21, stdev=37185.93
     lat (usec): min=32, max=38560, avg=53.04, stdev=37.23
    clat percentiles (usec):
     |  1.00th=[   36],  5.00th=[   39], 10.00th=[   40], 20.00th=[   42],
     | 30.00th=[   43], 40.00th=[   44], 50.00th=[   45], 60.00th=[   46],
     | 70.00th=[   47], 80.00th=[   48], 90.00th=[   50], 95.00th=[   52],
     | 99.00th=[   57], 99.50th=[   59], 99.90th=[   69], 99.95th=[   87],
     | 99.99th=[  135]
   bw (  KiB/s): min=   71, max=25352, per=16.36%, avg=13341.18, stdev=9726.48, samples=2397
   iops        : min=   17, max= 6338, avg=3335.13, stdev=2431.78, samples=2397
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=45.37%, 100=33.22%
  lat (usec)   : 250=19.16%, 500=0.43%, 750=0.43%, 1000=0.46%
  lat (msec)   : 2=0.91%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=3.53%, sys=11.43%, ctx=12237590, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, &amp;amp;gt;=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &amp;amp;gt;=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &amp;amp;gt;=64=0.0%
     issued rwts: total=6121601,6116013,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=79.7MiB/s (83.6MB/s), 79.7MiB/s-79.7MiB/s (83.6MB/s-83.6MB/s), io=23.4GiB (25.1GB), run=300001-300001msec
  WRITE: bw=79.6MiB/s (83.5MB/s), 79.6MiB/s-79.6MiB/s (83.5MB/s-83.5MB/s), io=23.3GiB (25.1GB), run=300001-300001msec

Disk stats (read/write):
    dm-2: ios=6120646/6115090, merge=0/0, ticks=806337/255779, in_queue=1062184, util=100.00%, aggrios=6121601/6116035, aggrmerge=0/1, aggrticks=801719/251319, aggrin_queue=1051124, aggrutil=100.00%
  sdb: ios=6121601/6116035, merge=0/1, ticks=801719/251319, in_queue=1051124, util=100.00%
www.zhangfangzhou.cn

4、结果
直通 (DirectPath)
read: IOPS=26.0k, BW=102MiB/s (107MB/s)(29.8GiB/300001msec)
write: IOPS=25.0k, BW=102MiB/s (106MB/s)(29.7GiB/300001msec)

虚拟化Virtualization
read: IOPS=20.4k, BW=79.7MiB/s (83.6MB/s)(23.4GiB/300001msec)
write: IOPS=20.4k, BW=79.6MiB/s (83.5MB/s)(23.3GiB/300001msec)

 

读取 (Read) 写入 (Write) 带宽 (Bandwidth)
直通 (DirectPath) 26.0k 25.0k 102MiB/s (107MB/s)
虚拟化 (Virtualization) 20.4k 20.4k 79.7MiB/s (83.6MB/s)

存储虚拟化后,读性能损失约21.5%,写性能损失约18.4%