当前位置：首页 > news >正文

kubernetes Device Plugin原理与源码分析

news 来源：原创 2025/9/12 8:24:34

一、背景与核心概念

1.1 Kubernetes设备管理演进之路

1.1.1 Extended Resource的局限性

在Kubernetes早期版本中，管理非标准硬件资源（如GPU、FPGA）主要依赖 Extended Resource（扩展资源） 机制，Extended Resource机制有如下问题：

手动静态声明：管理员需通过kubectl直接修改Node的status.capacity字段，例如：

kubectl proxy & 
curl -X PATCH http://localhost:8001/api/v1/nodes/<node-name>/status -H "Content-Type: application/json-patch+json" \
-d '[{"op": "add", "path": "/status/capacity/nvidia.com~1gpu", "value": "4"}]'

这种做法无法动态感知资源状态（如设备故障或新增设备时需手动更新），而且容易出错。

缺乏设备发现与生命周期管理

无法感知设备真实存在性（如设备是否已被物理移除）;

没有健康检查机制（如GPU驱动崩溃无法自动隔离坏卡）;

设备分配粒度粗放（无法按设备属性筛选，如GPU型号、FPGA IP核类型）。

资源调度与设备分配的割裂

调度器仅根据资源数量（如nvidia.com/gpu: 1）分配，但实际设备使用（如设备挂载、驱动加载）需依赖DaemonSet或InitContainer脚本，存在竞态条件风险。

1.1.2 Device Plugin的诞生动因

Kubernetes社区为解决异构设备全生命周期管理的痛点，在1.8版本引入Device Plugin框架：

核心目标：标准化设备管理接口，实现动态注册、发现、分配、监控的闭环。
关键设计思想：
- 插件化架构：硬件厂商只需实现Device Plugin接口，无需修改kubelet核心代码。
- gRPC长连接通信：通过ListAndWatch接口实时同步设备列表与健康状态。
- 资源分配原子性：kubelet在Pod调度成功后直接调用Allocate接口，确保设备初始化（如GPU显存清空、FPGA程序烧录）与容器启动的原子性。

1.1.3 架构演进对比

能力	Extended Resource	Device Plugin
资源发现	静态配置	动态注册（插件主动上报）
健康监控	无	实时状态推送（如GPU温度异常）
设备初始化	依赖外部脚本	通过Allocate接口统一管理
拓扑感知调度	不支持	支持NUMA亲和性、PCIe拓扑
厂商扩展成本	需修改K8s核心组件	仅需实现标准gRPC接口

1.1.4 设备管理需求场景

随着技术的发展，越来越多的硬件资源需要接入k8s平台调度管理，例如：

GPU场景：深度学习训练
- 核心需求：
  - 设备隔离性：多个Pod共享物理GPU时需隔离显存与计算单元（如NVIDIA MIG技术）
  - 驱动兼容性：容器内GPU驱动版本需与宿主机一致（通过nvidia-docker2的–gpus参数注入设备）
  - 资源超分：通过vGPU技术（如vCUDA）实现单卡多容器共享
- Device Plugin的作用：
  - 自动挂载设备与驱动库：通过Allocate返回的mounts将libcuda.so等库注入容器
  - 健康检查：监控GPU的XID错误（通过nvidia-smi查询），自动标记故障设备
FPGA场景：动态重配置
- 核心需求：
  - 比特流（Bitstream）动态烧录：不同Pod可能需要不同的FPGA逻辑功能（如加密/解密切换）
  - 设备锁定：烧录过程中需独占访问FPGA，防止并发操作导致硬件损坏
- Device Plugin的定制逻辑：
  - 预烧录检查：在Allocate阶段调用厂商工具（如Xilinx的vivado）验证比特流兼容性
  - 设备状态机管理：通过ListAndWatch标记FPGA为InProgramming/Ready状态
其他异构计算场景
- AI加速器（如TPU、NPU）：
  - 需要为容器注入专属SDK（如TensorFlow TPU驱动）
  - 监控加速器温度与功耗，实现弹性调度
- 智能网卡（SmartNIC）：
  - 分配SR-IOV VF（Virtual Function）给Pod，加速网络包处理
  - 通过Device Plugin实现VF与PF（Physical Function）的绑定关系管理
- 高性能存储设备（如Optane PMem）：
  - 按NUMA节点分配内存模式存储，避免跨节点访问延迟

行业实践案例：

NVIDIA GPU Operator：通过Device Plugin自动部署GPU驱动、容器运行时、监控组件，实现“一键式”GPU集群管理
AWS Inferentia：使用Device Plugin为机器学习推理容器分配NeuronCore资源，并注入神经加速运行时库
Intel FPGA Plugin：实现FPGA设备的区域管理（Region），支持多租户共享同一物理FPGA卡

以下内容基于kubernetes@v1.22.7

1.2 Device Plugin核心架构

1.2.1 组件交互全景图

在这里插入图片描述

如上图所示（以GPU device plugin为例），所谓的device plugin从逻辑上来说包含两部分：

kubelet中关于device plugin的逻辑：主要负责管理各种注册上来的device plugin（有的机器上存在多种设备，所以可能会部署多个device plugin服务），以及创建pod容器时向device plugin服务申请资源等
节点上的device plugin service（下文简称为插件）：主要负责向kubelet注册、管理节点上的设备以及向kubelet上报设备列表和健康状态等信息

1.2.2 Device Plugin GRPC proto定义与核心接口解析

从上文可以知道，在device plugin逻辑中kubelet和插件中各起了一个GRPC server，先看看这两个GRPC server的proto方法定义：

proto文件位置：k8s.io/kubelet/pkg/apis/deviceplugin下两个版本：v1alpha和v1beta1

v1alpha版本

// kubelet GRPC server
// Registration is the service advertised by the Kubelet
// Only when Kubelet answers with a success code to a Register Request
// may Device Plugins start their service
// Registration may fail when device plugin version is not supported by
// Kubelet or the registered resourceName is already taken by another
// active device plugin. Device plugin is expected to terminate upon registration failure
service Registration {rpc Register(RegisterRequest) returns (Empty) {}
}// 插件GRPC server
// DevicePlugin is the service advertised by Device Plugins
service DevicePlugin {// ListAndWatch returns a stream of List of Devices// Whenever a Device state changes or a Device disappears, ListAndWatch// returns the new listrpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}// Allocate is called during container creation so that the Device// Plugin can run device specific operations and instruct Kubelet// of the steps to make the Device available in the containerrpc Allocate(AllocateRequest) returns (AllocateResponse) {}
}

v1beta1版本

// kubelet GRPC server
// Registration is the service advertised by the Kubelet
// Only when Kubelet answers with a success code to a Register Request
// may Device Plugins start their service
// Registration may fail when device plugin version is not supported by
// Kubelet or the registered resourceName is already taken by another
// active device plugin. Device plugin is expected to terminate upon registration failure
service Registration {rpc Register(RegisterRequest) returns (Empty) {}
}// 插件GRPC server
// DevicePlugin is the service advertised by Device Plugins
service DevicePlugin {// GetDevicePluginOptions returns options to be communicated with Device// Managerrpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}// ListAndWatch returns a stream of List of Devices// Whenever a Device state change or a Device disappears, ListAndWatch// returns the new listrpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}// GetPreferredAllocation returns a preferred set of devices to allocate// from a list of available ones. The resulting preferred allocation is not// guaranteed to be the allocation ultimately performed by the// devicemanager. It is only designed to help the devicemanager make a more// informed allocation decision when possible.rpc GetPreferredAllocation(PreferredAllocationRequest) returns (PreferredAllocationResponse) {}// Allocate is called during container creation so that the Device// Plugin can run device specific operations and instruct Kubelet// of the steps to make the Device available in the containerrpc Allocate(AllocateRequest) returns (AllocateResponse) {}// PreStartContainer is called, if indicated by Device Plugin during registeration phase,// before each container start. Device plugin can run device specific operations// such as resetting the device before making devices available to the containerrpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {}
}

对比两个版本的proto定义，可以看出v1beta1版本比v1alpha版本的插件服务多了GetDevicePluginOptions、GetPreferredAllocation、PreStartContainer三个方法，而两个版本共有的两个方法是device plugin逻辑的核心方法：

ListAndWatch：

1）核心作用：设备发现与状态监控。

初始设备列表上报：当设备插件启动时，kubelet 调用此方法获取当前节点上所有可用设备的详细信息（如设备 ID、健康状态）；

实时状态更新：通过 GRPC 流（streaming）持续向kubelet推送设备状态变化（如设备故障、恢复或新增设备）。

2）关键行为

设备信息同步：设备插件返回一个设备列表，每个设备包含以下信息，kubelet缓存这些信息，并更新节点的资源容量（Capacity）和可分配资源（Allocatable）：

message Device {string ID = 1;            // 设备唯一标识（如 GPU UUID）string health = 2;        // 健康状态（`Healthy` 或 `Unhealthy`）repeated DeviceTopology topology = 3; // 拓扑信息（如 NUMA 节点、PCI 总线）
}

长连接维护：ListAndWatch建立了一个持久的GRPC流连接，设备插件可以随时通过此流发送状态更新；若连接断开，kubelet会尝试重新连接，并重新获取完整的设备列表。

健康状态管理：设备插件负责监控设备健康（如通过驱动检测GPU的温度或错误状态），并通过流推送变更；当设备标记为Unhealthy时，kubelet不再将其分配给新Pod。

Allocate：

1）核心作用：资源分配与设备初始化

当Pod被调度到节点后，kubelet调用此方法，要求设备插件为容器分配具体的设备资源，返回容器访问设备所需的配置（如挂载路径、环境变量）。

2）关键行为

设备分配逻辑：kubelet传递需要分配的设备ID列表（从调度结果中获取），设备插件需确保这些设备可用，并执行必要的初始化操作（如FPGA比特流烧录）：

message AllocateRequest {repeated ContainerAllocateRequest container_requests = 1;
}message ContainerAllocateRequest {repeated string devicesIDs = 1; // 请求分配的设备 ID（如 ["GPU-1234"]）
}

返回容器配置：插件返回容器运行时所需的配置信息，kubelet将这些配置注入容器，使其能正确访问设备：

message AllocateResponse {repeated ContainerAllocateResponse container_responses = 1;
}message ContainerAllocateResponse {repeated Device mounts = 1;    // 设备挂载路径（如 /dev/nvidia0）map<string, string> envs = 2;  // 环境变量（如 NVIDIA_VISIBLE_DEVICES）repeated DeviceSpec devices = 3; // 设备权限（如 cgroup 配置）
}

3）资源原子性保证

分配过程是原子的，确保设备在容器启动前已准备好（避免竞态条件）；若分配失败（如设备已被占用），kubelet会触发Pod调度失败。

方法	触发时机	核心功能	数据流方向
ListAndWatch	插件注册后持续运行	上报设备列表并实时同步状态	插件 → kubelet（流式推送）
Allocate	Pod调度到节点后容器启动前	分配具体设备并返回容器访问配置	kubelet → 插件（请求-响应）

1.2.3 设备注册机制与资源上报流程

设备注册机制

1）设备插件的启动与自检

插件启动主要有以下步骤：

资源发现：设备插件（如 nvidia-device-plugin）启动后，首先扫描宿主机上的物理设备（如通过 nvidia-smi 获取 GPU 信息）；

创建Unix Socket文件：在/var/lib/kubelet/device-plugins/目录下创建.sock文件（如nvidia-gpu.sock），作为与kubelet通信的端点；

启动GRPC服务：实现Device Plugin的GRPC接口（v1alpha或者v1beta1），包括ListAndWatch和Allocate方法；

2）向kubelet注册设备

调用kubelet GRPC Register接口：设备插件通过/var/lib/kubelet/device-plugins/kubelet.sock文件，使用GRPC的方式调用kubelet 提供的Registration API（GRPC 服务）发送注册请求，包含以下关键信息：

message RegisterRequest {string version = 1;          // 设备插件API版本（如 "v1beta1"）string endpoint = 2;          // Unix Socket路径（如 "nvidia-gpu.sock"）string resource_name = 3;     // 资源名称（如 "nvidia.com/gpu"）Options options = 4;          // 可选参数（如预启动容器配置）
}

kubelet处理注册请求：kubelet的PluginWatcher模块监听/var/lib/kubelet/device-plugins目录下的socket文件创建事件，并校验资源名格式（需符合[vendor-domain]/[resource-type]，如 amd.com/gpu），之后将插件信息存入EndpointStore，建立长连接管理通道。

常见的注册失败场景：资源名称冲突，例如同一资源名称已被其他插件注册；socket权限错误，例如kubelet无权限访问插件的Unix Socket；API版本不兼容，例如插件版本与kubelet支持的Device Plugin API版本不匹配。

资源上报流程

Device Plugin → kubelet (ListAndWatch) → kubelet缓存 → kubelet → API Server → etcd

1）设备列表上报（ListAndWatch）

长连接状态同步：初始设备列表上报是kubelet调用插件的ListAndWatch接口，插件返回当前设备列表及元数据：

message ListAndWatchResponse {repeated Device devices = 1;  // 设备列表
}
message Device {string ID = 1;              // 设备唯一ID（如GPU UUID）string health = 2;          // 健康状态（"Healthy"/"Unhealthy"）repeated DeviceTopology topology = 3; // 拓扑信息（NUMA节点、PCI总线）
}

之后插件通过ListAndWatch的GRPC Stream长期保持连接，当设备状态变化（如GPU过热）时，立即推送增量更新。kubelet收到上报信息后，会将设备信息被缓存并触发节点资源容量（Capacity）与可分配资源（Allocatable）更新。

2）资源信息同步至API Server

节点状态更新：kubelet的NodeStatusUpdater定期（默认10s）将节点资源总量（包括Device Plugin上报的资源）通过PATCH /api/v1/nodes/<node-name>/status更新至API Server；调度器（如kube-scheduler）监听节点资源变化，决策Pod调度时使用nvidia.com/gpu: 2 等扩展资源需求。

3）关键数据结构与配置

设备拓扑信息：用于拓扑感知调度（如确保容器与设备在同一NUMA节点）

message DeviceTopology {repeated NUMANode nodes = 1;  // 设备所属NUMA节点
}

资源上报参数：可通过kubelet参数调节上报行为

--device-plugin-registration-timeout=10s  # 插件注册超时时间
--device-plugin-poll-interval=5s          # 插件健康检查间隔

流程异常处理与调试

1）常见故障场景

设备插件崩溃：kubelet检测到连接断开后，标记设备为Unhealthy，并在插件重启后重新注册。

资源上报延迟：检查插件ListAndWatch是否阻塞（如插件因设备扫描卡住未发送心跳）。

资源未生效：确认kubelet已更新节点状态：kubectl get node <node-name> -o json | jq '.status.allocatable'

2）调试技巧

查看kubelet日志：

journalctl -u kubelet | grep device_plugin
# 关键日志标记：
# "Starting Device Plugin manager" – 设备管理模块启动
# "Registering new device plugin" – 插件注册成功
# "ListAndWatch failed" – 设备列表获取异常

检查已注册设备：

kubectl get node <node-name> -o jsonpath='{.status.allocatable}' | jq .
# 输出示例：{ "nvidia.com/gpu": "4" }

插件端调试：启用插件详细日志（如nvidia-device-plugin的–debug模式）；使用grpcurl工具手动测试GRPC接口：

grpcurl -unix /var/lib/kubelet/device-plugins/nvidia-gpu.sock list
grpcurl -unix ... deviceplugin.DevicePlugin/ListAndWatch

二、源码分析准备

2.1 目标版本说明

以下代码基于 kubernetes@v1.22.7 + proto v1beta1

2.2 关键代码路径导航

kubelet设备管理入口：kubernetes/pkg/kubelet/cm/devicemanager/manager.go
Device Plugin接口定义：k8s.io/kubelet/pkg/kubelet/apis/deviceplugin/
设备分配器实现：kubernetes/pkg/kubelet/cm/devicemanager/

三、kubelet设备管理核心流程源码剖析

3.1 kubelet device plugin逻辑初始化

device manager实例初始化：

// NewManagerImpl creates a new manager.
func NewManagerImpl(topology []cadvisorapi.Node, topologyAffinityStore topologymanager.Store) (*ManagerImpl, error) {socketPath := pluginapi.KubeletSocket // /var/lib/kubelet/device-plugins/kubelet.sock...return newManagerImpl(socketPath, topology, topologyAffinityStore)
}func newManagerImpl(socketPath string, topology []cadvisorapi.Node, topologyAffinityStore topologymanager.Store) (*ManagerImpl, error) {...manager := &ManagerImpl{endpoints: make(map[string]endpointInfo),socketname:            file,socketdir:             dir,allDevices:            NewResourceDeviceInstances(),healthyDevices:        make(map[string]sets.String),unhealthyDevices:      make(map[string]sets.String),allocatedDevices:      make(map[string]sets.String),podDevices:            newPodDevices(),numaNodes:             numaNodes,topologyAffinityStore: topologyAffinityStore,devicesToReuse:        make(PodReusableDevices),}...return manager, nil
}

device manager实例启动如下，这里有个特别需要注意的逻辑：kubelet启动（含重启场景）会清除/var/lib/kubelet/device-plugins/目录下设备插件的socket文件，因此设备插件中需要监听kubelet重启事件（一般通过监听/var/lib/kubelet/device-plugins/kubelet.sock文件的创建事件），当设备插件发现kubelet重启，设备插件自己也需要重启逻辑（服务重启或者内部逻辑重启）。

// Start starts the Device Plugin Manager and start initialization of
// podDevices and allocatedDevices information from checkpointed state and
// starts device plugin registration service.
func (m *ManagerImpl) Start(activePods ActivePodsFunc, sourcesReady config.SourcesReady) error {...// Loads in allocatedDevices information from disk.err := m.readCheckpoint()if err != nil {klog.InfoS("Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date", "err", err)}...// 移除/var/lib/kubelet/device-plugins/目录下设备插件的socket文件// Removes all stale sockets in m.socketdir. Device plugins can monitor// this and use it as a signal to re-register with the new Kubelet.if err := m.removeContents(m.socketdir); err != nil {klog.ErrorS(err, "Fail to clean up stale content under socket dir", "path", m.socketdir)}s, err := net.Listen("unix", socketPath)if err != nil {klog.ErrorS(err, "Failed to listen to socket while starting device plugin registry")return err}// 启动插件注册的GRPC server服务m.wg.Add(1)m.server = grpc.NewServer([]grpc.ServerOption{}...)pluginapi.RegisterRegistrationServer(m.server, m)go func() {defer m.wg.Done()m.server.Serve(s)}()klog.V(2).InfoS("Serving device plugin registration server on socket", "path", socketPath)return nil
}

kubelet device plugin逻辑启动简单总结：

kubelet移除对应目录下的socket文件 -> kubelet启动Registration GRPC server

3.2 设备插件注册机制

前文提到设备插件注册需要先在/var/lib/kubelet/device-plugins/目录下创建自己的socket文件并启动对应的GRPC server，之后调kubelet的Registration GRPC server的Register方法。kubelet Register方法的实现：

// Register registers a device plugin.
func (m *ManagerImpl) Register(ctx context.Context, r *pluginapi.RegisterRequest) (*pluginapi.Empty, error) {klog.InfoS("Got registration request from device plugin with resource", "resourceName", r.ResourceName)metrics.DevicePluginRegistrationCount.WithLabelValues(r.ResourceName).Inc()var versionCompatible boolfor _, v := range pluginapi.SupportedVersions {if r.Version == v {versionCompatible = truebreak}}if !versionCompatible {err := fmt.Errorf(errUnsupportedVersion, r.Version, pluginapi.SupportedVersions)klog.InfoS("Bad registration request from device plugin with resource", "resourceName", r.ResourceName, "err", err)return &pluginapi.Empty{}, err}if !v1helper.IsExtendedResourceName(v1.ResourceName(r.ResourceName)) {err := fmt.Errorf(errInvalidResourceName, r.ResourceName)klog.InfoS("Bad registration request from device plugin", "err", err)return &pluginapi.Empty{}, err}// TODO: for now, always accepts newest device plugin. Later may consider to// add some policies here, e.g., verify whether an old device plugin with the// same resource name is still alive to determine whether we want to accept// the new registration.go m.addEndpoint(r)return &pluginapi.Empty{}, nil
}func (m *ManagerImpl) addEndpoint(r *pluginapi.RegisterRequest) {new, err := newEndpointImpl(filepath.Join(m.socketdir, r.Endpoint), r.ResourceName, m.callback)if err != nil {klog.ErrorS(err, "Failed to dial device plugin with request", "request", r)return}m.registerEndpoint(r.ResourceName, r.Options, new)go func() {m.runEndpoint(r.ResourceName, new)}()
}func (m *ManagerImpl) runEndpoint(resourceName string, e endpoint) {e.run()e.stop()m.mutex.Lock()defer m.mutex.Unlock()if old, ok := m.endpoints[resourceName]; ok && old.e == e {m.markResourceUnhealthy(resourceName)}klog.V(2).InfoS("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", e)
}// run initializes ListAndWatch gRPC call for the device plugin and
// blocks on receiving ListAndWatch gRPC stream updates. Each ListAndWatch
// stream update contains a new list of device states.
// It then issues a callback to pass this information to the device manager which
// will adjust the resource available information accordingly.
func (e *endpointImpl) run() {stream, err := e.client.ListAndWatch(context.Background(), &pluginapi.Empty{})if err != nil {klog.ErrorS(err, "listAndWatch ended unexpectedly for device plugin", "resourceName", e.resourceName)return}for {response, err := stream.Recv()if err != nil {klog.ErrorS(err, "listAndWatch ended unexpectedly for device plugin", "resourceName", e.resourceName)return}devs := response.Devicesklog.V(2).InfoS("State pushed for device plugin", "resourceName", e.resourceName, "resourceCapacity", len(devs))var newDevs []pluginapi.Devicefor _, d := range devs {newDevs = append(newDevs, *d)}e.callback(e.resourceName, newDevs)}
}

当一个设备插件调用Register方法注册后，kubelet会起个协程执行addEndpoint动作，在addEndpoint动作中，kubelet会先把设备插件的信息保存在内存manager.endpoints字典（map）中。之后通过注册请求参数中的信息起个携程调用设备插件GRPC server中的ListAndWatch方法，该方法实现了kubelet与设备插件建立了一个GRPC长连接。

设备注册逻辑简单总结：

设备插件                         kubelet
——————————————————————————————————————————————————————————————————————————
调用Register方法           ->    设备插件信息存入内存
server端ListAndWatch实现  <-     调用ListAndWatch方法与设备插件建立GRPC长连接

把上述校验设备名称的函数展开，看看kubelet对设备插件名称的限制：

// IsExtendedResourceName returns true if:
// 1. the resource name is not in the default namespace;
// 2. resource name does not have "requests." prefix,
// to avoid confusion with the convention in quota
// 3. it satisfies the rules in IsQualifiedName() after converted into quota resource name
func IsExtendedResourceName(name v1.ResourceName) bool {// v1.DefaultResourceRequestsPrefix = "requests."if IsNativeResource(name) || strings.HasPrefix(string(name), v1.DefaultResourceRequestsPrefix) {return false}// Ensure it satisfies the rules in IsQualifiedName() after converted into quota resource namenameForQuota := fmt.Sprintf("%s%s", v1.DefaultResourceRequestsPrefix, string(name))if errs := validation.IsQualifiedName(string(nameForQuota)); len(errs) != 0 {return false}return true
}// IsNativeResource returns true if the resource name is in the
// *kubernetes.io/ namespace. Partially-qualified (unprefixed) names are
// implicitly in the kubernetes.io/ namespace.
func IsNativeResource(name v1.ResourceName) bool {return !strings.Contains(string(name), "/") ||IsPrefixedNativeResource(name)
}// IsPrefixedNativeResource returns true if the resource name is in the
// *kubernetes.io/ namespace.
func IsPrefixedNativeResource(name v1.ResourceName) bool {return strings.Contains(string(name), v1.ResourceDefaultNamespacePrefix) // v1.ResourceDefaultNamespacePrefix = "kubernetes.io/"
}// IsQualifiedName tests whether the value passed is what Kubernetes calls a
// "qualified name".  This is a format used in various places throughout the
// system.  If the value is not valid, a list of error strings is returned.
// Otherwise an empty list (or nil) is returned.
func IsQualifiedName(value string) []string {var errs []stringparts := strings.Split(value, "/")var name stringswitch len(parts) {case 1:name = parts[0]case 2:var prefix stringprefix, name = parts[0], parts[1]if len(prefix) == 0 {errs = append(errs, "prefix part "+EmptyError())} else if msgs := IsDNS1123Subdomain(prefix); len(msgs) != 0 {errs = append(errs, prefixEach(msgs, "prefix part ")...)}default:return append(errs, "a qualified name "+RegexError(qualifiedNameErrMsg, qualifiedNameFmt, "MyName", "my.name", "123-abc")+" with an optional DNS subdomain prefix and '/' (e.g. 'example.com/MyName')")}if len(name) == 0 {errs = append(errs, "name part "+EmptyError())} else if len(name) > qualifiedNameMaxLength {errs = append(errs, "name part "+MaxLenError(qualifiedNameMaxLength))}if !qualifiedNameRegexp.MatchString(name) {errs = append(errs, "name part "+RegexError(qualifiedNameErrMsg, qualifiedNameFmt, "MyName", "my.name", "123-abc"))}return errs
}

从上述代码可知， kubelet要求设备名称命名需要以如下规则 ：

不能是原生资源的命名方式，也就是设备名称必须包含“/”但不能以“kubernetes.io/”开头；
不能以“requests.”开头；
不能包含多个“/”；
kubelet会拼接个前缀“requests.”，拼接后的字符串按“/”切割为1、2两部分，第1部分长度必须小于等于253且满足正则表达式^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$；第2部分长度必须小于等于63且满足正则表达式^([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]$

3.3 资源上报与状态同步

设备插件通过kubelet调用ListAndWatch方法建立的GRPC长连接实现资源上报与状态同步，本文主要分析kubelet端在ListAndWatch长连接接收到上报数据的处理逻辑：

func (m *ManagerImpl) genericDeviceUpdateCallback(resourceName string, devices []pluginapi.Device) {m.mutex.Lock()m.healthyDevices[resourceName] = sets.NewString()m.unhealthyDevices[resourceName] = sets.NewString()m.allDevices[resourceName] = make(map[string]pluginapi.Device)for _, dev := range devices {m.allDevices[resourceName][dev.ID] = devif dev.Health == pluginapi.Healthy {m.healthyDevices[resourceName].Insert(dev.ID)} else {m.unhealthyDevices[resourceName].Insert(dev.ID)}}m.mutex.Unlock()if err := m.writeCheckpoint(); err != nil {klog.ErrorS(err, "Writing checkpoint encountered")}
}// Checkpoints device to container allocation information to disk.
func (m *ManagerImpl) writeCheckpoint() error {m.mutex.Lock()registeredDevs := make(map[string][]string)for resource, devices := range m.healthyDevices {registeredDevs[resource] = devices.UnsortedList()}data := checkpoint.New(m.podDevices.toCheckpointData(),registeredDevs)m.mutex.Unlock()err := m.checkpointManager.CreateCheckpoint(kubeletDeviceManagerCheckpoint, data)if err != nil {err2 := fmt.Errorf("failed to write checkpoint file %q: %v", kubeletDeviceManagerCheckpoint, err)klog.InfoS("Failed to write checkpoint file", "err", err)return err2}return nil
}

可以看出kubelet在收到设备插件ListAndWatch长连接中上报过来的数据后，主要做了两个事情：

更新内存中的设备信息，主要是设备id和各个设备的健康状况
把当前内存中的设备信息写入/var/lib/kubelet/device-plugins/kubelet_internal_checkpoint文件中，防止kubelet重启数据丢失。

到这里设备插件已经把设备信息上报到kubelet，但kubelet如何把这些信息更新到node对象的capacity和allocatable字段中呢？其实更新node对象信息不是同步做的，而是在kubelet的另一个异步协程里做的：

// kubernetes/pkg/kubelet/kubelet.go
// Run starts the kubelet reacting to config updates
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {...if kl.kubeClient != nil {// Start syncing node status immediately, this may set up things the runtime needs to run.go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)...}...
}// kubernetes/pkg/kubelet/kubelet_node_status.go
// syncNodeStatus should be called periodically from a goroutine.
// It synchronizes node status to master if there is any change or enough time
// passed from the last sync, registering the kubelet first if necessary.
func (kl *Kubelet) syncNodeStatus() {...if err := kl.updateNodeStatus(); err != nil {klog.ErrorS(err, "Unable to update node status")}
}// kubernetes/pkg/kubelet/kubelet_node_status.go
// updateNodeStatus updates node status to master with retries if there is any
// change or enough time passed from the last sync.
func (kl *Kubelet) updateNodeStatus() error {klog.V(5).InfoS("Updating node status")for i := 0; i < nodeStatusUpdateRetry; i++ {if err := kl.tryUpdateNodeStatus(i); err != nil {if i > 0 && kl.onRepeatedHeartbeatFailure != nil {kl.onRepeatedHeartbeatFailure()}klog.ErrorS(err, "Error updating node status, will retry")} else {return nil}}return fmt.Errorf("update node status exceeds retry count")
}// kubernetes/pkg/kubelet/kubelet_node_status.go
// tryUpdateNodeStatus tries to update node status to master if there is any
// change or enough time passed from the last sync.
func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error {...kl.setNodeStatus(node)...
}// kubernetes/pkg/kubelet/kubelet_node_status.go
// setNodeStatus fills in the Status fields of the given Node, overwriting
// any fields that are currently set.
// TODO(madhusudancs): Simplify the logic for setting node conditions and
// refactor the node status condition code out to a different file.
func (kl *Kubelet) setNodeStatus(node *v1.Node) {for i, f := range kl.setNodeStatusFuncs {klog.V(5).InfoS("Setting node status condition code", "position", i, "node", klog.KObj(node))if err := f(node); err != nil {klog.ErrorS(err, "Failed to set some node status fields", "node", klog.KObj(node))}}
}// kl.setNodeStatusFuncs的初始化：
// kubernetes/pkg/kubelet/kubelet.go
func NewMainKubelet(...) (*Kubelet, error) {...klet.setNodeStatusFuncs = klet.defaultNodeStatusFuncs()...
}// kubernetes/pkg/kubelet/kubelet_node_status.go
// defaultNodeStatusFuncs is a factory that generates the default set of
// setNodeStatus funcs
func (kl *Kubelet) defaultNodeStatusFuncs() []func(*v1.Node) error {...setters = append(setters, ...nodestatus.MachineInfo(..., kl.containerManager.GetDevicePluginResourceCapacity, ...),...)
}// kubernetes/pkg/kubelet/cm/container_manager_linux.go
func (cm *containerManagerImpl) GetDevicePluginResourceCapacity() (v1.ResourceList, v1.ResourceList, []string) {return cm.deviceManager.GetCapacity()
}// 最终跟踪到设备管理器里的函数
// kubernetes/pkg/kubelet/cm/devicemanager/manager.go
// GetCapacity is expected to be called when Kubelet updates its node status.
// The first returned variable contains the registered device plugin resource capacity.
// The second returned variable contains the registered device plugin resource allocatable.
// The third returned variable contains previously registered resources that are no longer active.
// Kubelet uses this information to update resource capacity/allocatable in its node status.
// After the call, device plugin can remove the inactive resources from its internal list as the
// change is already reflected in Kubelet node status.
// Note in the special case after Kubelet restarts, device plugin resource capacities can
// temporarily drop to zero till corresponding device plugins re-register. This is OK because
// cm.UpdatePluginResource() run during predicate Admit guarantees we adjust nodeinfo
// capacity for already allocated pods so that they can continue to run. However, new pods
// requiring device plugin resources will not be scheduled till device plugin re-registers.
func (m *ManagerImpl) GetCapacity() (v1.ResourceList, v1.ResourceList, []string) {needsUpdateCheckpoint := falsevar capacity = v1.ResourceList{}var allocatable = v1.ResourceList{}deletedResources := sets.NewString()m.mutex.Lock()for resourceName, devices := range m.healthyDevices {eI, ok := m.endpoints[resourceName]if (ok && eI.e.stopGracePeriodExpired()) || !ok {// The resources contained in endpoints and (un)healthyDevices// should always be consistent. Otherwise, we run with the risk// of failing to garbage collect non-existing resources or devices.if !ok {klog.ErrorS(nil, "Unexpected: healthyDevices and endpoints are out of sync")}delete(m.endpoints, resourceName)delete(m.healthyDevices, resourceName)deletedResources.Insert(resourceName)needsUpdateCheckpoint = true} else {capacity[v1.ResourceName(resourceName)] = *resource.NewQuantity(int64(devices.Len()), resource.DecimalSI)allocatable[v1.ResourceName(resourceName)] = *resource.NewQuantity(int64(devices.Len()), resource.DecimalSI)}}for resourceName, devices := range m.unhealthyDevices {eI, ok := m.endpoints[resourceName]if (ok && eI.e.stopGracePeriodExpired()) || !ok {if !ok {klog.ErrorS(nil, "Unexpected: unhealthyDevices and endpoints are out of sync")}delete(m.endpoints, resourceName)delete(m.unhealthyDevices, resourceName)deletedResources.Insert(resourceName)needsUpdateCheckpoint = true} else {capacityCount := capacity[v1.ResourceName(resourceName)]unhealthyCount := *resource.NewQuantity(int64(devices.Len()), resource.DecimalSI)capacityCount.Add(unhealthyCount)capacity[v1.ResourceName(resourceName)] = capacityCount}}m.mutex.Unlock()if needsUpdateCheckpoint {if err := m.writeCheckpoint(); err != nil {klog.ErrorS(err, "Error on writing checkpoint")}}return capacity, allocatable, deletedResources.UnsortedList()
}

3.4 设备分配关键路径

以一个在resources.requests中指定设备资源（如nvidia.com/gpu: 1）的pod为例，从pod创建到运行起来的关键路径有：

k8s scheduler

k8s scheduler负责相关资源筛选与节点绑定，从流程上来看主要有三步：

1）过滤：调度器通过Filter插件检查各节点的Allocatable资源是否满足请求，得到一个满足要求的节点列表

2）打分：调度器按一定算法给上述节点打分；

3）绑定节点：选择一个得分最高的节点给pod绑定。

kubelet

kubelet通过监听k8s API server的更新事件，发现有需要处理的pod，并把pod放入本地待处理队列，之后准备进入准入检查流程。

1）准入控制（Admission Checks）

kubelet在启动Pod前执行一系列准入检查，确保节点满足Pod的运行条件。对于GPU等拓展资源请求，关键步骤如下：

扩展资源验证：kubelet先检查Pod的resources.requests中是否包含GPU等扩展资源（如nvidia.com/gpu），之后对比Pod请求的资源与节点的 Allocatable字段（可通过kubectl describe node查看），确保节点剩余资源足够满足请求。

设备插件健康检查：kubelet确认GPU设备插件已通过GRPC接口注册到本节点，且处于健康状态（前文提到的device plugin通过ListAndWatch定期上报资源），之后kubelet检查是否有足够的空闲设备满足Pod的请求。

资源预留（Reserve Resources）：kubelet的Device Manager组件会从本地资源池中预留（Reserve）Pod 请求的GPU资源，避免其他Pod争抢。例如Pod请求 nvidia.com/gpu: 1，Device Manager从可用设备列表中标记一个GPU（如gpu0）为“已预留”。

2）准入失败处理

如果准入检查未通过，kubelet会拒绝该Pod并生成事件（Event），触发调度器重新调度。准入失败的常见场景有：

资源不足：若节点GPU资源不足，kubelet上报FailedScheduling事件，调度器重新选择节点。

设备插件异常：若设备插件未注册或上报的设备状态异常，kubelet拒绝Pod并记录错误日志。

3）准入通过后的后续操作

若准入检查通过，kubelet继续执行以下操作：

调用设备插件分配设备：通过GRPC调用设备插件的Allocate接口，传入预留的设备ID（如 [“gpu0”]）。

生成容器配置：将设备插件返回的配置（环境变量、Volume挂载、设备路径等）注入Pod的容器定义。

启动容器：通过CRI（Container Runtime Interface）调用容器运行时（如containerd、Docker）创建容器。

kubelet调用设备插件的Allocate分配扩展资源代码如下：

// kubernetes/pkg/kubelet/cm/container_manager_linux.go
func (m *resourceAllocator) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {pod := attrs.Pod// 拼接pod init container和普通container，对每个容器调用deviceManage的Allocate方法分配资源for _, container := range append(pod.Spec.InitContainers, pod.Spec.Containers...) {err := m.deviceManager.Allocate(pod, &container)...}...
}// kubernetes/pkg/kubelet/cm/devicemanager/manager.go
// Allocate is the call that you can use to allocate a set of devices
// from the registered device plugins.
func (m *ManagerImpl) Allocate(pod *v1.Pod, container *v1.Container) error {...// 给init container分配资源for _, initContainer := range pod.Spec.InitContainers {if container.Name == initContainer.Name {if err := m.allocateContainerResources(pod, container, m.devicesToReuse[string(pod.UID)]); err != nil {return err}...}}// 给普通container分配资源if err := m.allocateContainerResources(pod, container, m.devicesToReuse[string(pod.UID)]); err != nil {return err}...
}// kubernetes/pkg/kubelet/cm/devicemanager/manager.go
// allocateContainerResources attempts to allocate all of required device
// plugin resources for the input container, issues an Allocate rpc request
// for each new device resource requirement, processes their AllocateResponses,
// and updates the cached containerDevices on success.
func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {...for k, v := range container.Resources.Limits {...resp, err := eI.e.allocate(devs)...}...
}// kubernetes/pkg/kubelet/cm/devicemanager/endpoint.go
// allocate issues Allocate gRPC call to the device plugin.
func (e *endpointImpl) allocate(devs []string) (*pluginapi.AllocateResponse, error) {...// 这里e.client.Allocate就是通过GRPC调用设备插件的Allocate方法分配资源return e.client.Allocate(context.Background(), &pluginapi.AllocateRequest{ContainerRequests: []*pluginapi.ContainerAllocateRequest{{DevicesIDs: devs},},})
}

3.5 故障与异常处理

设备故障与device plugin服务故障

1）设备故障

device plugin负责监控其管理的硬件设备（如GPU、FPGA）的健康状态，并通过ListAndWatch GRPC方法向kubelet实时推送设备状态。kubelet侧则内存中的设备更新为不健康，不会再分配给新创建的pod。如果运行中的Pod使用了不健康设备，kubelet可能驱逐该Pod或触发重新调度（需结合集群策略）。

2）device plugin服务进程异常

由于kubelet与device plugin服务建立了GRPC stream流连接，当device plugin服务异常时，kubelet侧能感知，并把该类型的设备都标记为不可用，阻止新Pod分配这些设备。相关代码：

// kubernetes/pkg/kubelet/cm/devicemanager/manager.go
func (m *ManagerImpl) runEndpoint(resourceName string, e endpoint) {// e.run正常是阻塞的，当device plugin服务异常时，e.run会退出，// 并执行e.stop、m.markResourceUnhealthy等逻辑e.run()e.stop()m.mutex.Lock()defer m.mutex.Unlock()if old, ok := m.endpoints[resourceName]; ok && old.e == e {m.markResourceUnhealthy(resourceName)}klog.V(2).InfoS("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", e)
}func (m *ManagerImpl) markResourceUnhealthy(resourceName string) {klog.V(2).InfoS("Mark all resources Unhealthy for resource", "resourceName", resourceName)healthyDevices := sets.NewString()if _, ok := m.healthyDevices[resourceName]; ok {healthyDevices = m.healthyDevices[resourceName]m.healthyDevices[resourceName] = sets.NewString()}if _, ok := m.unhealthyDevices[resourceName]; !ok {m.unhealthyDevices[resourceName] = sets.NewString()}m.unhealthyDevices[resourceName] = m.unhealthyDevices[resourceName].Union(healthyDevices)
}

设备故障时的资源回收流程

对于已分配故障设备的Pod，kubelet的默认行为是不主动驱逐，但可通过以下机制触发回收：

1）依赖Pod的健康检查（推荐）：在Pod中配置Liveness Probe或Readiness Probe，检测应用是否因设备故障而不可用，当Probe失败时，kubelet会自动重启容器（根据restartPolicy）或标记Pod为未就绪，最终可能触发重新调度。

2）节点压力驱逐（需配置）：如果故障设备导致节点资源不足，kubelet可能根据eviction policies驱逐相关Pod（需配置–eviction-hard 或–eviction-soft参数），驱逐后，Pod会重新调度到其他可用节点。

3）手动干预：手动删除Pod：kubectl delete pod <pod-name> --force，若设备故障是永久性的，需管理员修复设备后重启Device Plugin服务。

插件崩溃时的kubelet恢复策略

kubernetes设计上不管理插件的生命周期，需通过外部机制（如systemd或DaemonSet）确保插件高可用。

四、新一代设备管理方案展望

4.1 Device Plugin架构的局限性讨论

Kubernetes的Device Plugin架构为硬件设备管理提供了一种标准化接口，但在实际大规模生产环境中逐渐暴露出一些局限性。以下从多个维度分析其不足：

设备依赖传递复杂

问题：Device Plugin的Allocate接口仅负责向容器传递设备文件路径和环境变量，无法处理复杂的设备初始化依赖（如FPGA固件加载、GPU驱动版本匹配）。

示例：某FPGA设备需在容器启动前加载特定固件，但Device Plugin无法保证固件加载顺序，需依赖外部初始化容器（Init Container）或自定义准入控制器。

资源分配粒度不灵活

问题：Device Plugin的分配模型以整数资源（如 nvidia.com/gpu: 1）为基础，无法直接支持分片资源（如GPU算力切片、FPGA部分重配置区域）。

解决方案局限：尽管可通过自定义资源名称（如nvidia.com/mig-1g.5gb: 1）实现分片，但需修改调度器插件，维护成本高。

多设备协同支持不足

问题：在多设备协同场景（如GPU+高速网卡RDMA通信）中，Device Plugin无法保证设备间的拓扑亲和性（如NUMA对齐、PCIe Switch分组）。

后果：资源分配可能导致跨NUMA访问，显著降低性能。

升级与维护成本高

问题：Device Plugin与kubelet通过GRPC长连接通信，升级插件需重启kubelet或处理连接重建，可能引发资源分配中断。

案例：升级NVIDIA GPU插件时，需确保所有运行中的GPU任务已完成，否则可能触发Pod驱逐。

生态系统碎片化

问题：不同硬件厂商的Device Plugin实现差异大（如资源命名、健康检查逻辑），导致运维标准化困难。

示例：Intel FPGA插件与Habana Gaudi插件的设备发现接口完全不同，增加集群管理复杂度。

4.2 CDI（Container Device Interface）方案对比

4.2.1 CDI简介

CDI是由容器运行时社区（如 containerd、CRI-O）提出的一种设备注入标准，旨在通过解耦设备管理与Kubernetes核心组件，解决Device Plugin的局限性。CDI核心设计理念：

声明式设备注入：通过JSON配置文件描述设备需求，由容器运行时在容器启动时动态注入设备。
解耦设备管理：硬件厂商提供CDI配置文件，无需实现特定API（如GRPC），降低与Kubernetes的耦合。
支持复杂设备拓扑：允许定义设备依赖关系（如GPU需要特定驱动版本）、资源组合和拓扑约束。

4.2.2 CDI与Device Plugin的对比：

维度	Device Plugin	CDI
架构层级	Kubernetes核心层（kubelet集成）	容器运行时层（containerd/CRI-O集成）
通信协议	GRPC长连接	静态配置文件（JSON）
设备分配粒度	整数资源（如 nvidia.com/gpu: 1）	灵活（支持设备文件、环境变量、挂载等任意组合）
多设备协同	依赖调度器扩展（如拓扑管理器）	通过配置文件定义设备组和依赖关系
升级维护	需处理GRPC连接重建	修改配置文件无需重启组件
生态系统支持	广泛（社区成熟）	逐步成熟（containerd/CRI-O已支持）
性能开销	GRPC通信延迟	配置文件解析开销（可忽略）

4.2.3 CDI的核心优势

灵活的设备注入模型

示例配置：

{"cdiVersion": "0.4.0","kind": "nvidia.com/gpu","devices": [{"name": "gpu0","containerEdits": {"deviceNodes": [{"path": "/dev/nvidia0"}],"env": {"NVIDIA_VISIBLE_DEVICES": "0"}}}]
}

能力扩展：可注入设备文件、环境变量、挂载卷、内核模块加载等操作。

解耦硬件与Kubernetes

厂商只需提供CDI配置文件，无需维护Device Plugin守护进程，支持异构运行时（如containerd、CRI-O、Docker）。

拓扑感知与设备组合

定义设备组确保协同分配：

{"kind": "qat.intel.com/group","devices": [{"name": "qat-group0","containerEdits": {"deviceNodes": [{"path": "/dev/qat0"},{"path": "/dev/usdm0"}]},"requirements": {"devices": ["qat.intel.com/qat0", "qat.intel.com/usdm0"]}}]
}

4.2.4 CDI 的当前挑战

生态成熟度：CDI标准仍处于演进阶段，部分高级功能（如动态配置生成）尚未被所有运行时支持；硬件厂商需适配CDI规范，迁移现有Device Plugin逻辑。
Kubernetes集成：CDI本身不处理资源调度，需结合Kubernetes扩展（如Scheduling Framework）实现配额管理；资源预留（kube-reserved）等特性需额外开发。
安全边界：CDI配置文件需严格审核，避免恶意设备注入（如特权设备挂载）。

4.2.5 实际场景对比

场景1：FPGA 部分重配置

Device Plugin：需自定义资源名称（如intel.com/fpga-region-1: 1），并修改调度器插件支持分片。

CDI：通过配置文件描述可重配置区域，直接作为独立设备注入，无需修改Kubernetes核心。

场景2：GPU与RDMA网卡协同

Device Plugin：依赖拓扑管理器确保NUMA亲和性，但需定制调度策略。

CDI：在配置文件中定义设备组，强制绑定GPU和网卡设备，由运行时保证同时注入。

4.2.6 未来演进方向

混合模式：短期过渡方案中，CDI可与Device Plugin共存，CDI负责设备注入，Device Plugin负责资源调度。
Kubernetes原生支持：社区正在推动KEP-3063将CDI集成到Kubernetes资源模型。
统一设备抽象：结合CDI与动态资源分配（DRA），实现跨硬件厂商的标准化设备管理。