Appearance
ResourceQuota
概述
ResourceQuota 是 Kubernetes 中用于限制命名空间资源总量的对象。它可以限制命名空间中可以创建的对象数量,以及这些对象可以消耗的计算资源总量,为多租户环境提供资源隔离和公平分配。
关键点 | 内容 |
---|---|
核心作用 | 限制命名空间内资源的总体使用量 |
作用范围 | 命名空间级别 |
限制类型 | 计算资源、存储资源、对象数量 |
执行机制 | 准入控制器实时检查 |
配额状态 | 实时跟踪已使用和剩余配额 |
ResourceQuota 的本质
设计理念
- 资源总量控制:限制命名空间内所有资源的总体使用量
- 多租户隔离:为不同租户提供资源隔离和公平分配
- 预算管理:类似云计算的资源预算概念
- 防止资源耗尽:避免单个命名空间消耗过多集群资源
工作原理
资源创建请求 → ResourceQuota 检查 → 配额验证 → 更新使用量 → 资源创建
↓ ↓ ↓ ↓ ↓
用户提交 准入控制器 检查剩余配额 更新计数器 成功/失败
ResourceQuota vs LimitRange
特性 | ResourceQuota | LimitRange |
---|---|---|
控制范围 | 命名空间总量 | 单个资源对象 |
限制类型 | 总量限制 | 单体限制 |
对象数量 | 支持 | 不支持 |
默认值 | 不支持 | 支持 |
使用场景 | 多租户资源分配 | 资源规范化 |
基本配置
1. 计算资源配额
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: development
labels:
quota-type: compute
environment: development
spec:
hard:
# CPU 配额
requests.cpu: "10" # CPU 请求总量限制
limits.cpu: "20" # CPU 限制总量限制
# 内存配额
requests.memory: 20Gi # 内存请求总量限制
limits.memory: 40Gi # 内存限制总量限制
# 临时存储配额
requests.ephemeral-storage: 50Gi # 临时存储请求总量
limits.ephemeral-storage: 100Gi # 临时存储限制总量
---
# 测试 Pod - 消耗配额
apiVersion: v1
kind: Pod
metadata:
name: quota-test-1
namespace: development
spec:
containers:
- name: app
image: nginx:1.20
resources:
requests:
cpu: "1" # 消耗 1 CPU 请求配额
memory: 2Gi # 消耗 2Gi 内存请求配额
limits:
cpu: "2" # 消耗 2 CPU 限制配额
memory: 4Gi # 消耗 4Gi 内存限制配额
---
# 第二个 Pod
apiVersion: v1
kind: Pod
metadata:
name: quota-test-2
namespace: development
spec:
containers:
- name: app
image: busybox:1.35
resources:
requests:
cpu: "500m" # 消耗 0.5 CPU 请求配额
memory: 1Gi # 消耗 1Gi 内存请求配额
limits:
cpu: "1" # 消耗 1 CPU 限制配额
memory: 2Gi # 消耗 2Gi 内存限制配额
# 当前配额使用情况:
# CPU requests: 1.5/10, limits: 3/20
# Memory requests: 3Gi/20Gi, limits: 6Gi/40Gi
2. 对象数量配额
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-count-quota
namespace: production
labels:
quota-type: object-count
environment: production
spec:
hard:
# Pod 相关
pods: "50" # 最多 50 个 Pod
replicationcontrollers: "10" # 最多 10 个 RC
# 工作负载控制器
deployments.apps: "20" # 最多 20 个 Deployment
replicasets.apps: "30" # 最多 30 个 ReplicaSet
statefulsets.apps: "5" # 最多 5 个 StatefulSet
daemonsets.apps: "3" # 最多 3 个 DaemonSet
jobs.batch: "10" # 最多 10 个 Job
cronjobs.batch: "5" # 最多 5 个 CronJob
# 服务和网络
services: "20" # 最多 20 个 Service
services.loadbalancers: "3" # 最多 3 个 LoadBalancer 服务
services.nodeports: "5" # 最多 5 个 NodePort 服务
ingresses.networking.k8s.io: "10" # 最多 10 个 Ingress
# 配置和存储
configmaps: "30" # 最多 30 个 ConfigMap
secrets: "20" # 最多 20 个 Secret
persistentvolumeclaims: "15" # 最多 15 个 PVC
# RBAC
roles.rbac.authorization.k8s.io: "10" # 最多 10 个 Role
rolebindings.rbac.authorization.k8s.io: "15" # 最多 15 个 RoleBinding
---
# 测试对象创建
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deployment-1
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: test-app-1
template:
metadata:
labels:
app: test-app-1
spec:
containers:
- name: app
image: nginx:1.20
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
# 这个 Deployment 会:
# - 消耗 1 个 deployments.apps 配额
# - 创建 1 个 ReplicaSet(消耗 replicasets.apps 配额)
# - 创建 3 个 Pod(消耗 pods 配额)
3. 存储配额
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-quota
namespace: data-processing
labels:
quota-type: storage
team: data-team
spec:
hard:
# 通用存储配额
requests.storage: "1Ti" # 总存储请求量限制
persistentvolumeclaims: "20" # PVC 数量限制
# 按存储类别的配额
fast-ssd.storageclass.storage.k8s.io/requests.storage: "500Gi" # 高速 SSD 存储配额
standard.storageclass.storage.k8s.io/requests.storage: "2Ti" # 标准存储配额
backup.storageclass.storage.k8s.io/requests.storage: "5Ti" # 备份存储配额
# 按存储类别的 PVC 数量
fast-ssd.storageclass.storage.k8s.io/persistentvolumeclaims: "5" # 高速 SSD PVC 数量
standard.storageclass.storage.k8s.io/persistentvolumeclaims: "10" # 标准 PVC 数量
backup.storageclass.storage.k8s.io/persistentvolumeclaims: "20" # 备份 PVC 数量
---
# 高速存储 PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fast-data-pvc
namespace: data-processing
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi # 消耗 fast-ssd 存储配额
---
# 标准存储 PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: standard-data-pvc
namespace: data-processing
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard
resources:
requests:
storage: 500Gi # 消耗 standard 存储配额
---
# 备份存储 PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: backup-pvc
namespace: data-processing
spec:
accessModes:
- ReadWriteMany
storageClassName: backup
resources:
requests:
storage: 1Ti # 消耗 backup 存储配额
高级配置
1. 优先级类配额
yaml
# 优先级类定义
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
globalDefault: false
description: "高优先级工作负载"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "低优先级工作负载"
---
# 按优先级的资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: priority-quota
namespace: mixed-workloads
labels:
quota-type: priority-based
spec:
hard:
# 高优先级工作负载配额
high-priority.priorityclass.scheduling.k8s.io/requests.cpu: "8" # 高优先级 CPU 请求
high-priority.priorityclass.scheduling.k8s.io/requests.memory: 16Gi # 高优先级内存请求
high-priority.priorityclass.scheduling.k8s.io/pods: "20" # 高优先级 Pod 数量
# 低优先级工作负载配额
low-priority.priorityclass.scheduling.k8s.io/requests.cpu: "4" # 低优先级 CPU 请求
low-priority.priorityclass.scheduling.k8s.io/requests.memory: 8Gi # 低优先级内存请求
low-priority.priorityclass.scheduling.k8s.io/pods: "50" # 低优先级 Pod 数量
# 总体配额(所有优先级)
requests.cpu: "15" # 总 CPU 请求(高优先级 + 低优先级 + 默认)
requests.memory: 30Gi # 总内存请求
pods: "100" # 总 Pod 数量
---
# 高优先级 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-app
namespace: mixed-workloads
spec:
replicas: 3
selector:
matchLabels:
app: critical-app
template:
metadata:
labels:
app: critical-app
spec:
priorityClassName: high-priority # 使用高优先级
containers:
- name: app
image: critical-service:v1.0
resources:
requests:
cpu: "1" # 消耗高优先级 CPU 配额
memory: 2Gi # 消耗高优先级内存配额
limits:
cpu: "2"
memory: 4Gi
---
# 低优先级 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-job
namespace: mixed-workloads
spec:
replicas: 5
selector:
matchLabels:
app: batch-job
template:
metadata:
labels:
app: batch-job
spec:
priorityClassName: low-priority # 使用低优先级
containers:
- name: worker
image: batch-worker:v1.0
resources:
requests:
cpu: "200m" # 消耗低优先级 CPU 配额
memory: 512Mi # 消耗低优先级内存配额
limits:
cpu: "500m"
memory: 1Gi
2. 作用域配额
yaml
# 终止状态 Pod 配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: terminating-quota
namespace: batch-processing
labels:
quota-type: terminating
spec:
# 作用域:只对终止状态的 Pod 生效
scopes:
- Terminating
hard:
requests.cpu: "5" # 终止状态 Pod 的 CPU 请求总量
requests.memory: 10Gi # 终止状态 Pod 的内存请求总量
pods: "20" # 终止状态 Pod 的数量
---
# 非终止状态 Pod 配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: not-terminating-quota
namespace: batch-processing
spec:
# 作用域:只对非终止状态的 Pod 生效
scopes:
- NotTerminating
hard:
requests.cpu: "20" # 非终止状态 Pod 的 CPU 请求总量
requests.memory: 40Gi # 非终止状态 Pod 的内存请求总量
pods: "50" # 非终止状态 Pod 的数量
---
# BestEffort QoS 类配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: besteffort-quota
namespace: experimental
spec:
# 作用域:只对 BestEffort QoS 类的 Pod 生效
scopes:
- BestEffort
hard:
pods: "10" # BestEffort Pod 数量限制
---
# NotBestEffort QoS 类配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: not-besteffort-quota
namespace: experimental
spec:
# 作用域:只对非 BestEffort QoS 类的 Pod 生效
scopes:
- NotBestEffort
hard:
requests.cpu: "10" # 非 BestEffort Pod 的 CPU 请求
requests.memory: 20Gi # 非 BestEffort Pod 的内存请求
pods: "30" # 非 BestEffort Pod 数量
3. 多配额组合策略
yaml
# 基础计算资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: base-compute-quota
namespace: enterprise-app
labels:
quota-category: compute
priority: high
spec:
hard:
# 基础计算资源
requests.cpu: "50" # 总 CPU 请求
limits.cpu: "100" # 总 CPU 限制
requests.memory: 100Gi # 总内存请求
limits.memory: 200Gi # 总内存限制
requests.ephemeral-storage: 500Gi # 总临时存储请求
limits.ephemeral-storage: 1Ti # 总临时存储限制
---
# 对象数量配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-quota
namespace: enterprise-app
labels:
quota-category: objects
priority: medium
spec:
hard:
# 核心对象
pods: "200" # Pod 总数
deployments.apps: "50" # Deployment 数量
services: "30" # Service 数量
configmaps: "100" # ConfigMap 数量
secrets: "50" # Secret 数量
# 存储对象
persistentvolumeclaims: "30" # PVC 数量
# 网络对象
ingresses.networking.k8s.io: "20" # Ingress 数量
networkpolicies.networking.k8s.io: "10" # NetworkPolicy 数量
---
# 存储专用配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-quota
namespace: enterprise-app
labels:
quota-category: storage
priority: high
spec:
hard:
# 总存储配额
requests.storage: "10Ti" # 总存储请求
# 按存储类型分配
fast-ssd.storageclass.storage.k8s.io/requests.storage: "2Ti" # 高速存储
standard.storageclass.storage.k8s.io/requests.storage: "5Ti" # 标准存储
archive.storageclass.storage.k8s.io/requests.storage: "20Ti" # 归档存储
# 按存储类型的 PVC 数量
fast-ssd.storageclass.storage.k8s.io/persistentvolumeclaims: "10"
standard.storageclass.storage.k8s.io/persistentvolumeclaims: "15"
archive.storageclass.storage.k8s.io/persistentvolumeclaims: "5"
---
# GPU 资源配额(如果集群支持)
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: enterprise-app
labels:
quota-category: gpu
priority: critical
spec:
hard:
# GPU 资源
requests.nvidia.com/gpu: "20" # GPU 请求总数
limits.nvidia.com/gpu: "20" # GPU 限制总数
# GPU 内存(如果支持)
requests.nvidia.com/gpu-memory: "160Gi" # GPU 内存请求
limits.nvidia.com/gpu-memory: "160Gi" # GPU 内存限制
实际应用场景
1. 多租户 SaaS 平台
yaml
# 企业级租户配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: enterprise-tenant-quota
namespace: tenant-enterprise-001
labels:
tenant-id: enterprise-001
plan: enterprise
billing-tier: premium
annotations:
tenant-name: "Acme Corporation"
contract-start: "2024-01-01"
contract-end: "2024-12-31"
contact: "admin@acme.com"
spec:
hard:
# 计算资源 - 企业级配额
requests.cpu: "100" # 100 核 CPU 请求
limits.cpu: "200" # 200 核 CPU 限制
requests.memory: 400Gi # 400GB 内存请求
limits.memory: 800Gi # 800GB 内存限制
# 对象数量 - 企业级限制
pods: "500" # 500 个 Pod
deployments.apps: "100" # 100 个 Deployment
services: "50" # 50 个 Service
services.loadbalancers: "10" # 10 个 LoadBalancer
ingresses.networking.k8s.io: "20" # 20 个 Ingress
# 存储资源 - 企业级存储
requests.storage: "50Ti" # 50TB 存储
persistentvolumeclaims: "100" # 100 个 PVC
# 配置和密钥
configmaps: "200" # 200 个 ConfigMap
secrets: "100" # 100 个 Secret
---
# 标准租户配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: standard-tenant-quota
namespace: tenant-standard-001
labels:
tenant-id: standard-001
plan: standard
billing-tier: medium
spec:
hard:
# 计算资源 - 标准配额
requests.cpu: "20" # 20 核 CPU 请求
limits.cpu: "40" # 40 核 CPU 限制
requests.memory: 80Gi # 80GB 内存请求
limits.memory: 160Gi # 160GB 内存限制
# 对象数量 - 标准限制
pods: "100" # 100 个 Pod
deployments.apps: "20" # 20 个 Deployment
services: "15" # 15 个 Service
services.loadbalancers: "2" # 2 个 LoadBalancer
ingresses.networking.k8s.io: "5" # 5 个 Ingress
# 存储资源 - 标准存储
requests.storage: "10Ti" # 10TB 存储
persistentvolumeclaims: "30" # 30 个 PVC
# 配置和密钥
configmaps: "50" # 50 个 ConfigMap
secrets: "30" # 30 个 Secret
---
# 基础租户配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: basic-tenant-quota
namespace: tenant-basic-001
labels:
tenant-id: basic-001
plan: basic
billing-tier: low
spec:
hard:
# 计算资源 - 基础配额
requests.cpu: "5" # 5 核 CPU 请求
limits.cpu: "10" # 10 核 CPU 限制
requests.memory: 20Gi # 20GB 内存请求
limits.memory: 40Gi # 40GB 内存限制
# 对象数量 - 基础限制
pods: "30" # 30 个 Pod
deployments.apps: "10" # 10 个 Deployment
services: "5" # 5 个 Service
services.loadbalancers: "0" # 不允许 LoadBalancer
ingresses.networking.k8s.io: "2" # 2 个 Ingress
# 存储资源 - 基础存储
requests.storage: "1Ti" # 1TB 存储
persistentvolumeclaims: "10" # 10 个 PVC
# 配置和密钥
configmaps: "20" # 20 个 ConfigMap
secrets: "10" # 10 个 Secret
2. 开发环境资源管理
yaml
# 开发团队 A 配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-team-a-quota
namespace: dev-team-a
labels:
team: team-a
environment: development
cost-center: "engineering"
spec:
hard:
# 开发环境计算资源
requests.cpu: "10" # 10 核 CPU
limits.cpu: "20" # 20 核 CPU(允许突发)
requests.memory: 40Gi # 40GB 内存
limits.memory: 80Gi # 80GB 内存
# 开发环境对象限制
pods: "100" # 100 个 Pod
deployments.apps: "30" # 30 个 Deployment
services: "20" # 20 个 Service
# 开发环境存储
requests.storage: "5Ti" # 5TB 存储(包含测试数据)
persistentvolumeclaims: "50" # 50 个 PVC
# 开发环境配置
configmaps: "100" # 100 个 ConfigMap(各种配置)
secrets: "50" # 50 个 Secret
---
# 测试环境配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: test-environment-quota
namespace: testing
labels:
environment: testing
purpose: integration-testing
spec:
hard:
# 测试环境需要稳定资源
requests.cpu: "15" # 15 核 CPU
limits.cpu: "25" # 25 核 CPU
requests.memory: 60Gi # 60GB 内存
limits.memory: 100Gi # 100GB 内存
# 测试环境对象
pods: "150" # 150 个 Pod(并行测试)
deployments.apps: "40" # 40 个 Deployment
services: "30" # 30 个 Service
jobs.batch: "50" # 50 个测试 Job
# 测试数据存储
requests.storage: "10Ti" # 10TB 存储(测试数据)
persistentvolumeclaims: "30" # 30 个 PVC
---
# 预发布环境配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: staging-environment-quota
namespace: staging
labels:
environment: staging
purpose: pre-production
spec:
hard:
# 预发布环境接近生产配置
requests.cpu: "30" # 30 核 CPU
limits.cpu: "50" # 50 核 CPU
requests.memory: 120Gi # 120GB 内存
limits.memory: 200Gi # 200GB 内存
# 预发布环境对象
pods: "200" # 200 个 Pod
deployments.apps: "50" # 50 个 Deployment
services: "40" # 40 个 Service
ingresses.networking.k8s.io: "15" # 15 个 Ingress
# 预发布存储
requests.storage: "20Ti" # 20TB 存储
persistentvolumeclaims: "40" # 40 个 PVC
3. 机器学习平台资源配额
yaml
# ML 训练环境配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-training-quota
namespace: ml-training
labels:
workload-type: ml-training
resource-intensive: "true"
gpu-enabled: "true"
spec:
hard:
# 高计算需求
requests.cpu: "200" # 200 核 CPU(大规模训练)
limits.cpu: "400" # 400 核 CPU
requests.memory: 1Ti # 1TB 内存(大模型训练)
limits.memory: 2Ti # 2TB 内存
# GPU 资源
requests.nvidia.com/gpu: "50" # 50 个 GPU
limits.nvidia.com/gpu: "50" # 50 个 GPU
# 训练任务对象
pods: "100" # 100 个训练 Pod
jobs.batch: "200" # 200 个训练 Job
# 大量数据存储
requests.storage: "500Ti" # 500TB 存储(训练数据集)
persistentvolumeclaims: "100" # 100 个 PVC
# 模型和配置
configmaps: "200" # 200 个 ConfigMap(训练配置)
secrets: "100" # 100 个 Secret(API 密钥等)
---
# ML 推理环境配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-inference-quota
namespace: ml-inference
labels:
workload-type: ml-inference
latency-sensitive: "true"
spec:
hard:
# 推理服务资源
requests.cpu: "50" # 50 核 CPU
limits.cpu: "100" # 100 核 CPU
requests.memory: 200Gi # 200GB 内存
limits.memory: 400Gi # 400GB 内存
# 推理 GPU
requests.nvidia.com/gpu: "20" # 20 个 GPU
limits.nvidia.com/gpu: "20" # 20 个 GPU
# 推理服务对象
pods: "200" # 200 个推理 Pod
deployments.apps: "50" # 50 个推理 Deployment
services: "50" # 50 个推理 Service
ingresses.networking.k8s.io: "20" # 20 个 Ingress
# 模型存储
requests.storage: "50Ti" # 50TB 存储(模型文件)
persistentvolumeclaims: "50" # 50 个 PVC
---
# ML 实验环境配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-experiment-quota
namespace: ml-experiments
labels:
workload-type: ml-experiments
temporary: "true"
spec:
hard:
# 实验环境资源
requests.cpu: "30" # 30 核 CPU
limits.cpu: "60" # 60 核 CPU
requests.memory: 120Gi # 120GB 内存
limits.memory: 240Gi # 240GB 内存
# 实验 GPU
requests.nvidia.com/gpu: "10" # 10 个 GPU
limits.nvidia.com/gpu: "10" # 10 个 GPU
# 实验对象(较多短期任务)
pods: "300" # 300 个实验 Pod
jobs.batch: "500" # 500 个实验 Job
# 实验数据存储
requests.storage: "100Ti" # 100TB 存储(实验数据)
persistentvolumeclaims: "200" # 200 个 PVC
命令行操作
基本操作
bash
# 查看 ResourceQuota
kubectl get resourcequotas
kubectl get resourcequota -n production
kubectl get quota # 简写形式
# 查看详细信息
kubectl describe resourcequota compute-quota -n development
kubectl get resourcequota compute-quota -o yaml
# 查看所有命名空间的配额
kubectl get resourcequotas --all-namespaces
# 查看配额使用情况
kubectl get resourcequota -o wide
配额状态监控
bash
# 查看配额使用详情
kubectl describe resourcequota -n production
# 查看特定配额的使用情况
kubectl get resourcequota compute-quota -n development -o jsonpath='{.status}' | jq .
# 监控配额使用变化
watch kubectl get resourcequota -n production
# 查看配额相关事件
kubectl get events --field-selector reason=FailedCreate -n production
创建和管理
bash
# 从文件创建 ResourceQuota
kubectl apply -f resourcequota.yaml
# 更新 ResourceQuota
kubectl apply -f updated-resourcequota.yaml
# 删除 ResourceQuota
kubectl delete resourcequota compute-quota -n development
# 批量删除
kubectl delete resourcequotas --all -n test-namespace
配额验证和测试
bash
# 测试配额限制
# 1. 尝试创建超出配额的资源
kubectl run test-pod --image=nginx --requests='cpu=100' -n development
# 应该失败并显示配额错误
# 2. 查看当前配额使用情况
kubectl describe resourcequota -n development
# 3. 创建符合配额的资源
kubectl run test-pod --image=nginx --requests='cpu=1,memory=1Gi' -n development
# 4. 再次查看配额使用情况
kubectl describe resourcequota -n development
配额分析脚本
bash
#!/bin/bash
# 配额使用分析脚本
NAMESPACE=${1:-default}
echo "=== ResourceQuota 使用分析 - $NAMESPACE ==="
# 检查是否存在 ResourceQuota
if ! kubectl get resourcequota -n $NAMESPACE &>/dev/null; then
echo "命名空间 $NAMESPACE 中没有 ResourceQuota"
exit 0
fi
# 获取所有 ResourceQuota
echo "1. ResourceQuota 列表:"
kubectl get resourcequota -n $NAMESPACE
echo
# 详细使用情况
echo "2. 详细使用情况:"
for quota in $(kubectl get resourcequota -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}'); do
echo "--- $quota ---"
kubectl describe resourcequota $quota -n $NAMESPACE | grep -A 20 "Resource\|Used\|Hard"
echo
done
# 计算使用率
echo "3. 使用率分析:"
kubectl get resourcequota -n $NAMESPACE -o json | jq -r '
.items[] |
"\(.metadata.name):" as $name |
.status.hard as $hard |
.status.used as $used |
$hard | keys[] as $resource |
if $used[$resource] then
($used[$resource] | tonumber) / ($hard[$resource] | tonumber) * 100 as $percentage |
" \($resource): \($used[$resource])/\($hard[$resource]) (\($percentage | floor)%)"
else
" \($resource): 0/\($hard[$resource]) (0%)"
end
'
echo "=== 分析完成 ==="
故障排查
常见问题
问题 | 可能原因 | 解决方案 |
---|---|---|
资源创建失败 | 超出 ResourceQuota 限制 | 检查配额使用情况,调整配额或删除不需要的资源 |
配额未生效 | 命名空间错误或配额配置错误 | 检查命名空间和配额配置 |
配额计算错误 | 资源单位不匹配 | 检查资源单位(CPU、内存、存储) |
无法删除资源 | 配额控制器问题 | 检查 kube-controller-manager 日志 |
配额状态不更新 | 控制器同步问题 | 重启相关控制器或等待同步 |
诊断步骤
- 检查配额配置
bash
# 确认 ResourceQuota 存在且配置正确
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota <name> -n <namespace>
- 检查配额使用情况
bash
# 查看当前使用情况
kubectl describe resourcequota -n <namespace>
# 查看配额状态
kubectl get resourcequota -o yaml -n <namespace>
- 检查失败的资源创建
bash
# 查看相关事件
kubectl get events --field-selector reason=FailedCreate -n <namespace>
# 查看具体错误信息
kubectl describe pod <pod-name> -n <namespace>
- 验证资源计算
bash
# 手动计算资源使用
kubectl get pods -n <namespace> -o json | jq '.items[].spec.containers[].resources.requests'
常见错误和解决方案
yaml
# 错误1:资源单位不一致
# 错误的配置
spec:
hard:
requests.memory: "10G" # 错误:应该使用 Gi
requests.cpu: "10000m" # 可以,但建议使用 "10"
# 正确的配置
spec:
hard:
requests.memory: "10Gi" # 正确:使用 Gi
requests.cpu: "10" # 正确:使用核数
---
# 错误2:配额范围配置错误
# 错误的配置
spec:
scopes:
- terminating # 错误:应该是 Terminating
hard:
pods: "10"
# 正确的配置
spec:
scopes:
- Terminating # 正确:首字母大写
hard:
pods: "10"
---
# 错误3:存储类配额配置错误
# 错误的配置
spec:
hard:
fast-ssd.storageclass.storage.k8s.io/request.storage: "100Gi" # 错误:request 应该是 requests
# 正确的配置
spec:
hard:
fast-ssd.storageclass.storage.k8s.io/requests.storage: "100Gi" # 正确:使用 requests
最佳实践
1. 配额设计原则
yaml
# 1. 分层配额设计
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-base-quota
namespace: production
labels:
quota-tier: base
environment: production
review-cycle: quarterly
annotations:
description: "生产环境基础资源配额"
owner: "platform-team@company.com"
last-review: "2024-01-15"
next-review: "2024-04-15"
escalation-contact: "sre-team@company.com"
spec:
hard:
# 保守的基础配额
requests.cpu: "50" # 基础 CPU 配额
limits.cpu: "100" # 基础 CPU 限制
requests.memory: 200Gi # 基础内存配额
limits.memory: 400Gi # 基础内存限制
# 基础对象数量
pods: "200" # 基础 Pod 数量
deployments.apps: "50" # 基础 Deployment 数量
services: "30" # 基础 Service 数量
# 基础存储
requests.storage: "50Ti" # 基础存储配额
persistentvolumeclaims: "100" # 基础 PVC 数量
---
# 2. 渐进式配额增长
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-extended-quota
namespace: production
labels:
quota-tier: extended
environment: production
approval-required: "true"
spec:
hard:
# 扩展配额(需要审批)
requests.cpu: "100" # 扩展 CPU 配额
limits.cpu: "200" # 扩展 CPU 限制
requests.memory: 500Gi # 扩展内存配额
limits.memory: 1Ti # 扩展内存限制
# 扩展对象数量
pods: "500" # 扩展 Pod 数量
deployments.apps: "100" # 扩展 Deployment 数量
services: "80" # 扩展 Service 数量
# 扩展存储
requests.storage: "200Ti" # 扩展存储配额
persistentvolumeclaims: "300" # 扩展 PVC 数量
2. 监控和告警
yaml
# Prometheus 监控规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: resourcequota-monitoring
namespace: monitoring
spec:
groups:
- name: resourcequota.rules
rules:
# 配额使用率监控
- record: kubernetes:resourcequota:usage_ratio
expr: |
(
kube_resourcequota{type="used"}
/
kube_resourcequota{type="hard"}
) * 100
# CPU 配额使用率告警
- alert: ResourceQuotaCPUUsageHigh
expr: |
(
kube_resourcequota{resource="requests.cpu", type="used"}
/
kube_resourcequota{resource="requests.cpu", type="hard"}
) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "命名空间 CPU 配额使用率过高"
description: "命名空间 {{ $labels.namespace }} 的 CPU 配额使用率为 {{ $value | humanizePercentage }}"
# 内存配额使用率告警
- alert: ResourceQuotaMemoryUsageHigh
expr: |
(
kube_resourcequota{resource="requests.memory", type="used"}
/
kube_resourcequota{resource="requests.memory", type="hard"}
) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "命名空间内存配额使用率过高"
description: "命名空间 {{ $labels.namespace }} 的内存配额使用率为 {{ $value | humanizePercentage }}"
# 配额即将耗尽告警
- alert: ResourceQuotaNearExhaustion
expr: |
(
kube_resourcequota{type="used"}
/
kube_resourcequota{type="hard"}
) * 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: "资源配额即将耗尽"
description: "命名空间 {{ $labels.namespace }} 的 {{ $labels.resource }} 配额使用率为 {{ $value | humanizePercentage }},即将耗尽"
# Pod 数量配额告警
- alert: ResourceQuotaPodCountHigh
expr: |
(
kube_resourcequota{resource="pods", type="used"}
/
kube_resourcequota{resource="pods", type="hard"}
) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Pod 数量配额使用率过高"
description: "命名空间 {{ $labels.namespace }} 的 Pod 数量配额使用率为 {{ $value | humanizePercentage }}"
3. 自动化配额管理
bash
#!/bin/bash
# 自动配额管理脚本
set -e
# 配置
CONFIG_FILE="/etc/kubernetes/quota-config.yaml"
LOG_FILE="/var/log/quota-manager.log"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
# 日志函数
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}
# 发送 Slack 通知
send_slack_notification() {
local message="$1"
local color="$2"
if [[ -n "$SLACK_WEBHOOK" ]]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"attachments\":[{\"color\":\"$color\",\"text\":\"$message\"}]}" \
$SLACK_WEBHOOK
fi
}
# 检查配额使用率
check_quota_usage() {
local namespace="$1"
local threshold="${2:-80}"
log "检查命名空间 $namespace 的配额使用情况"
# 获取配额使用情况
local quota_data=$(kubectl get resourcequota -n $namespace -o json 2>/dev/null)
if [[ -z "$quota_data" ]]; then
log "命名空间 $namespace 中没有 ResourceQuota"
return 0
fi
# 分析每个配额项
echo "$quota_data" | jq -r '
.items[] |
.metadata.name as $quota_name |
.status.hard as $hard |
.status.used as $used |
$hard | keys[] as $resource |
if $used[$resource] then
($used[$resource] | tonumber) / ($hard[$resource] | tonumber) * 100 as $percentage |
if $percentage > '$threshold' then
"WARNING: \($quota_name)/\($resource): \($percentage | floor)% (\($used[$resource])/\($hard[$resource]))"
else
"OK: \($quota_name)/\($resource): \($percentage | floor)% (\($used[$resource])/\($hard[$resource]))"
end
else
"OK: \($quota_name)/\($resource): 0% (0/\($hard[$resource]))"
end
' | while read line; do
if [[ $line == WARNING:* ]]; then
log "$line"
send_slack_notification "🚨 配额告警: $namespace - $line" "warning"
else
log "$line"
fi
done
}
# 自动扩展配额
auto_scale_quota() {
local namespace="$1"
local resource="$2"
local current_usage="$3"
local current_limit="$4"
local usage_percentage="$5"
# 如果使用率超过 90%,自动扩展 20%
if (( $(echo "$usage_percentage > 90" | bc -l) )); then
local new_limit=$(echo "$current_limit * 1.2" | bc -l | cut -d. -f1)
log "自动扩展配额: $namespace/$resource 从 $current_limit 扩展到 $new_limit"
# 这里应该调用配额更新 API 或生成配额更新请求
# kubectl patch resourcequota ... (需要具体实现)
send_slack_notification "📈 自动配额扩展: $namespace/$resource $current_limit → $new_limit" "good"
fi
}
# 生成配额报告
generate_quota_report() {
local output_file="/tmp/quota-report-$(date +%Y%m%d).html"
log "生成配额使用报告: $output_file"
cat > $output_file << 'EOF'
<!DOCTYPE html>
<html>
<head>
<title>Kubernetes ResourceQuota 使用报告</title>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background-color: #f2f2f2; }
.warning { background-color: #fff3cd; }
.critical { background-color: #f8d7da; }
.ok { background-color: #d4edda; }
</style>
</head>
<body>
<h1>Kubernetes ResourceQuota 使用报告</h1>
<p>生成时间: $(date)</p>
<table>
<tr>
<th>命名空间</th>
<th>配额名称</th>
<th>资源类型</th>
<th>已使用</th>
<th>总配额</th>
<th>使用率</th>
<th>状态</th>
</tr>
EOF
# 获取所有命名空间的配额信息
for namespace in $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}'); do
kubectl get resourcequota -n $namespace -o json 2>/dev/null | jq -r '
.items[] |
.metadata.name as $quota_name |
.metadata.namespace as $ns |
.status.hard as $hard |
.status.used as $used |
$hard | keys[] as $resource |
if $used[$resource] then
($used[$resource] | tonumber) / ($hard[$resource] | tonumber) * 100 as $percentage |
if $percentage > 90 then
" <tr class=\"critical\"><td>\($ns)</td><td>\($quota_name)</td><td>\($resource)</td><td>\($used[$resource])</td><td>\($hard[$resource])</td><td>\($percentage | floor)%</td><td>危险</td></tr>"
elif $percentage > 80 then
" <tr class=\"warning\"><td>\($ns)</td><td>\($quota_name)</td><td>\($resource)</td><td>\($used[$resource])</td><td>\($hard[$resource])</td><td>\($percentage | floor)%</td><td>警告</td></tr>"
else
" <tr class=\"ok\"><td>\($ns)</td><td>\($quota_name)</td><td>\($resource)</td><td>\($used[$resource])</td><td>\($hard[$resource])</td><td>\($percentage | floor)%</td><td>正常</td></tr>"
end
else
" <tr class=\"ok\"><td>\($ns)</td><td>\($quota_name)</td><td>\($resource)</td><td>0</td><td>\($hard[$resource])</td><td>0%</td><td>正常</td></tr>"
end
' >> $output_file
done
cat >> $output_file << 'EOF'
</table>
</body>
</html>
EOF
log "配额报告已生成: $output_file"
}
# 主函数
main() {
log "开始配额管理任务"
# 检查所有命名空间的配额使用情况
for namespace in $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}'); do
check_quota_usage $namespace 80
done
# 生成每日报告
if [[ $(date +%H) == "09" ]]; then # 每天上午 9 点生成报告
generate_quota_report
fi
log "配额管理任务完成"
}
# 执行主函数
main "$@"
总结
ResourceQuota 是 Kubernetes 中重要的资源治理工具,它提供了命名空间级别的资源总量控制,是多租户环境和资源管理的核心组件。
关键要点:
- ResourceQuota 控制命名空间内资源的总体使用量,包括计算资源、存储资源和对象数量
- 支持多种配额类型和作用域,可以精确控制不同类型资源的使用
- 通过准入控制器实时检查和更新配额使用情况
- 是多租户环境资源隔离和公平分配的重要保障
- 需要结合监控告警和自动化管理来确保配额策略的有效执行
- 应该根据业务需求和环境特点制定合适的配额策略