

智能扩容秘籍！LLM推理服务云原生弹性配置，轻松应对流量高峰（2）

Annie出海

2025-11-15

导读：立即阅读

组件三：

基于HPA的CPU/内存扩缩容

Kubernetes原生的Horizontal Pod Autoscaler（HPA）支持基于CPU和内存的自动扩缩容。

今日文章阅读福利：《运维入门大礼包》

扫码添加小助理，发送暗号“运维”，即可获取。

HPA配置示例

# hpa-cpu.yamlapiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata:  name: llm-hpa-cpuspec:  scaleTargetRef:    apiVersion: apps/v1    kind: Deployment    name: llm-inference-service  minReplicas: 1  maxReplicas: 10  metrics:  - type: Resource    resource:      name: cpu      target:        type: Utilization        averageUtilization: 70  - type: Resource    resource:      name: memory      target:        type: Utilization        averageUtilization: 80

问题：CPU利用率≠推理负载

LLM推理的瓶颈通常在GPU，而非CPU。单纯依赖CPU利用率可能导致：

GPU已满载，但CPU仅30%，无法触发扩容；
或CPU因数据预处理高负载，但GPU空闲，导致误扩。

因此，必须引入自定义指标。

组件四：

基于KEDA的事件驱动扩缩容

KEDA是CNCF毕业项目，支持基于外部事件源（如Kafka、RabbitMQ、Prometheus指标）的自动扩缩容。

架构图

安装KEDA

helm repo add kedacore https://kedacore.github.io/chartshelm repo updatehelm install keda kedacore/keda --namespace keda --create-namespace

KEDA ScaledObject配置

# keda-scaledobject.yamlapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata:  name: llm-scaledobject  namespace: defaultspec:  scaleTargetRef:    name: llm-inference-service  minReplicaCount: 1  maxReplicaCount: 20  triggers:  - type: prometheus    metadata:      serverAddress: http://prometheus-server.default.svc.cluster.local:9090      metricName: llm_request_count      threshold: "10"  # 每秒请求数 > 10 时扩容      query: |        sum(rate(llm_request_count[2m])) by (job)  - type: prometheus    metadata:      serverAddress: http://prometheus-server.default.svc.cluster.local:9090      metricName: llm_pending_requests      threshold: "5"   # 队列长度 > 5 时扩容      query: |        llm_pending_requests

Java代码：模拟请求队列

// src/main/java/com/ai/inference/service/InferenceQueue.javapackage com.ai.inference.service;
import com.ai.inference.metrics.LLMMetrics;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.stereotype.Service;
import java.util.concurrent.BlockingQueue;import java.util.concurrent.LinkedBlockingQueue;
@Servicepublic class InferenceQueue {
    private final BlockingQueue<String> queue = new LinkedBlockingQueue<>(100);    private final LLMMetrics llmMetrics;
    @Autowired    public InferenceQueue(LLMMetrics llmMetrics) {        this.llmMetrics = llmMetrics;    }
    public boolean submit(String prompt) {        boolean offered = queue.offer(prompt);        if (offered) {            // 更新待处理请求数            llmMetrics.recordPendingRequests(queue.size());        }        return offered;    }
    @Scheduled(fixedDelay = 100)    public void process() {        String prompt = queue.poll();        if (prompt != null) {            // 调用推理服务            simulateInference(prompt);            llmMetrics.recordPendingRequests(queue.size());        }    }
    private void simulateInference(String prompt) {        try {            Thread.sleep(500 + (long)(Math.random() * 1000)); // 模拟耗时        } catch (InterruptedException e) {            Thread.currentThread().interrupt();        }    }}

组件五：

GPU感知调度与资源优化

Kubernetes支持GPU资源调度，但需正确配置。

节点打标与容忍

# 手动打标（通常由 device plugin 自动完成）kubectl label nodes gpu-node-1 accelerator=nvidia-tesla-t4kubectl taint nodes gpu-node-1 accelerator=nvidia-tesla-t4:NoSchedule

推理服务Pod配置

# llm-deployment-gpu.yamlapiVersion: apps/v1kind: Deploymentmetadata:  name: llm-inference-servicespec:  replicas: 1  selector:    matchLabels:      app: llm  template:    metadata:      labels:        app: llm    spec:      containers:      - name: inference        image: your-llm-service:1.0        ports:        - containerPort: 8080        resources:          limits:            nvidia.com/gpu: 1            memory: "16Gi"            cpu: "4"          requests:            nvidia.com/gpu: 1            memory: "8Gi"            cpu: "2"        env:        - name: MODEL_NAME          value: "llama-3-8b"      tolerations:      - key: accelerator        operator: Equal        value: nvidia-tesla-t4        effect: NoSchedule

监控GPU使用

使用dcgm-exporter将GPU指标暴露给Prometheus：

helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

查询GPU利用率：

DCGM_FI_DEV_GPU_UTIL{container="triton"}

高级策略：

预测性扩缩容与成本优化

基于时间的预测扩缩容

使用KEDA的cron触发器，在已知高峰前预热实例。

triggers:- type: cron  metadata:    timezone: Asia/Shanghai    start: 0 8 * * 1-5  # 工作日 8:00    end: 0 18 * * 1-5   # 工作日 18:00    desiredReplicas: "5"

分层部署：CPU vs GPU实例

GPU实例：处理实时推理；
CPU实例：处理异步任务、批量推理、预热缓存。

// 根据负载类型路由public String routeInference(String prompt, boolean isRealTime) {    if (isRealTime) {        return gpuClient.infer(prompt);    } else {        return asyncQueue.submit(prompt);    }}

成本监控

使用Kubecost监控GPU资源成本。

helm install kubecost kubecost/cost-analyzer --namespace kubecost --create-namespace

服务网格集成：

Istio+流量管理

在多版本推理服务（如A/B测试）场景下，Istio可实现精细化流量控制。

架构图

Istio配置

# virtual-service.yamlapiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata:  name: llm-routingspec:  hosts:  - llm.example.com  http:  - route:    - destination:        host: llm-inference-service-v1      weight: 90    - destination:        host: llm-inference-service-v2      weight: 10