运维
运维相关知识和内容
Kubernetes v1.36深度解析:动态资源分配DRA进入GA,GPU调度迎来革命
DRA GA化:Kubernetes GPU调度的新时代
2026年4月22日,Kubernetes v1.36发布,DRA(动态资源分配)进入正式可用(GA)状态。这彻底改变了K8s处理GPU/FPGA等硬件加速器的方式。
一、旧设备插件的三大痛点
# 旧方式:只能按数量申请GPU,无法描述需求细节
resources:
limits:
nvidia.com/gpu: "2" # 无法指定型号、显存大小等
限制:只能按数量分配 | 无法GPU共享/分时复用 | 不支持跨节点拓扑感知
二、DRA核心配置
ResourceClass - 资源类型定义
apiVersion: resource.k8s.io/v1beta2
kind: ResourceClass
metadata:
name: gpu-nvidia-a100
driverName: gpu.nvidia.com
parametersRef:
apiGroup: gpu.nvidia.com/v1
kind: NvidiaGPUConfig
name: a100-config
---
apiVersion: gpu.nvidia.com/v1
kind: NvidiaGPUConfig
metadata:
name: a100-config
spec:
memory: "40Gi" # 需要40GB显存
computeCapability: "8.0"
mig: false
ResourceClaim + Pod使用
apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaim
metadata:
name: my-gpu-claim
namespace: ml-training
spec:
resourceClassName: gpu-nvidia-a100
allocationMode: WaitForFirstConsumer
---
apiVersion: v1
kind: Pod
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.3
resourceClaims:
- name: gpu-resource
source:
resourceClaimName: my-gpu-claim
resourceClaims:
- name: gpu-resource
三、GPU时间共享(DRA新特性)
apiVersion: gpu.nvidia.com/v1
kind: NvidiaGPUConfig
metadata:
name: shared-gpu-config
spec:
sharing:
strategy: TimeSlicing
timeSlicingConfig:
interval: "5ms"
sharing.maxSharableGPU: 4 # 4个Pod共享一个GPU(适合推理服务)
四、从设备插件迁移DRA
# 安装支持DRA的NVIDIA GPU Operator
helm upgrade --install nvidia-gpu-operator nvidia/gpu-operator \
--set devicePlugin.enabled=false \
--set dra.enabled=true \
-n gpu-operator
# 验证
kubectl get resourceclasses
# 应看到 gpu.nvidia.com ResourceClass
# 测试Pod调度
kubectl apply -f test-gpu-pod.yaml
kubectl describe pod test-gpu-pod | grep -A 5 "Events:"
五、DRA监控指标
# 新增Prometheus指标
kube_resourceclaim_allocated{namespace,name,resource_class}
kube_resourceclaim_pending{namespace,name}
node_dra_resource_available{resource_class,attribute}
告警规则:
- alert: DRAResourceExhausted
expr: kube_resourceclaim_pending > 10
for: 5m
annotations:
summary: "{{ $labels.resource_class }}资源不足,{{ $value }}个等待分配"
DRA GA化是Kubernetes AI工作负载管理的里程碑,2026年值得投入的运维升级。