轩辕李的博客 轩辕李的博客
首页
  • Java
  • Spring
  • 其他语言
  • 工具
  • JavaScript
  • TypeScript
  • Node.js
  • Vue.js
  • 前端工程化
  • 浏览器与Web API
  • 架构设计与模式
  • 代码质量管理
  • 基础
  • 操作系统
  • 计算机网络
  • 编程范式
  • 安全
  • 中间件
  • 心得
关于
  • 分类
  • 标签
  • 归档
GitHub (opens new window)

轩辕李

勇猛精进,星辰大海
首页
  • Java
  • Spring
  • 其他语言
  • 工具
  • JavaScript
  • TypeScript
  • Node.js
  • Vue.js
  • 前端工程化
  • 浏览器与Web API
  • 架构设计与模式
  • 代码质量管理
  • 基础
  • 操作系统
  • 计算机网络
  • 编程范式
  • 安全
  • 中间件
  • 心得
关于
  • 分类
  • 标签
  • 归档
GitHub (opens new window)
  • 架构设计与模式

    • 高可用-分布式基础之CAP理论
    • 高可用-服务容错与降级策略
      • 一、为什么需要容错与降级
        • 1、分布式系统的脆弱性
        • 2、容错与降级的价值
        • 3、容错降级的设计原则
      • 二、核心容错策略
        • 1、超时控制(Timeout)
        • 1.1、超时时间的设定
        • 1.2、超时时间的计算
        • 2、重试机制(Retry)
        • 2.1、重试策略
        • 2.2、重试的注意事项
        • 3、限流(Rate Limiting)
        • 3.1、常见限流算法
        • 3.1.1、固定窗口计数器
        • 3.1.2、滑动窗口
        • 3.1.3、令牌桶(Token Bucket)
        • 3.1.4、漏桶(Leaky Bucket)
        • 3.2、分布式限流
        • 4、熔断(Circuit Breaker)
        • 4.1、熔断器的三种状态
        • 4.2、熔断器实现
        • 4.3、使用熔断器
        • 5、隔离(Isolation)
        • 5.1、线程池隔离
        • 5.2、信号量隔离
      • 三、服务降级策略
        • 1、降级的分类
        • 1.1、自动降级
        • 1.2、手动降级
        • 2、降级的级别
        • 2.1、功能降级
        • 2.2、读降级
        • 2.3、写降级
        • 3、降级的实现方式
        • 3.1、返回默认值
        • 3.2、返回缓存数据
        • 3.3、返回兜底数据
      • 四、Sentinel 实战
        • 1、Sentinel 简介
        • 2、快速开始
        • 2.1、引入依赖
        • 2.2、配置 Sentinel Dashboard
        • 2.3、定义资源
        • 3、流量控制
        • 3.1、QPS 限流
        • 3.2、并发线程数限流
        • 3.3、关联流控
        • 3.4、链路流控
        • 4、熔断降级
        • 4.1、慢调用比例
        • 4.2、异常比例
        • 4.3、异常数
        • 5、系统自适应保护
        • 6、热点参数限流
      • 五、Netflix Hystrix
        • 1、基本使用
        • 1.1、引入依赖
        • 1.2、启用 Hystrix
        • 1.3、定义 Fallback
        • 2、线程池隔离
        • 3、请求合并
        • 4、请求缓存
      • 六、容错降级最佳实践
        • 1、容错策略组合
        • 2、超时时间设计
        • 3、熔断阈值设计
        • 4、降级优先级
        • 5、监控与告警
        • 6、优雅的降级响应
      • 七、总结
    • 高性能-缓存架构设计
    • 高性能-性能优化方法论
  • 代码质量管理

  • 基础

  • 操作系统

  • 计算机网络

  • AI

  • 编程范式

  • 安全

  • 中间件

  • 心得

  • 架构
  • 架构设计与模式
轩辕李
2024-06-11
目录

高可用-服务容错与降级策略

在分布式系统中,服务故障是常态而非异常。一个健壮的系统必须具备自我保护能力,当部分服务出现问题时,能够通过容错、降级、熔断等策略保证整体系统的可用性。本文将深入探讨服务容错与降级的核心原理、实现方式和最佳实践。

# 一、为什么需要容错与降级

# 1、分布式系统的脆弱性

在微服务架构中,一个业务请求可能需要经过多个服务的协作:

用户请求 → API网关 → 订单服务 → 库存服务 → 支付服务 → 物流服务

如果任何一个环节出现问题,整个链路都可能受到影响。常见的问题包括:

  • 服务超时:下游服务响应缓慢,导致上游服务线程阻塞
  • 服务异常:某个服务抛出异常,导致整个调用链失败
  • 资源耗尽:数据库连接池、线程池被耗尽
  • 雪崩效应:一个服务的故障引发连锁反应,导致整个系统崩溃

# 2、容错与降级的价值

容错是指系统能够在部分组件失效时继续运行,不至于完全崩溃。

降级是指在系统压力过大或部分功能不可用时,主动关闭或简化部分非核心功能,保证核心业务正常运行。

两者的核心目标都是:牺牲部分功能,保证整体可用。

# 3、容错降级的设计原则

  1. 快速失败:不要长时间等待无响应的服务
  2. 优雅降级:提供有损但可用的服务,而不是直接报错
  3. 故障隔离:防止故障扩散到其他服务
  4. 自动恢复:系统能够在故障消除后自动恢复
  5. 监控告警:及时发现问题并通知运维人员

# 二、核心容错策略

# 1、超时控制(Timeout)

问题:如果不设置超时,一个慢服务会拖垮整个系统。

解决方案:为每个远程调用设置合理的超时时间。

# 1.1、超时时间的设定

@RestController
public class OrderController {
    
    @Autowired
    private RestTemplate restTemplate;
    
    @GetMapping("/order/{id}")
    public Order getOrder(@PathVariable Long id) {
        HttpComponentsClientHttpRequestFactory factory = 
            new HttpComponentsClientHttpRequestFactory();
        factory.setConnectTimeout(1000);
        factory.setReadTimeout(3000);
        
        restTemplate.setRequestFactory(factory);
        
        return restTemplate.getForObject(
            "http://inventory-service/inventory/" + id, 
            Order.class
        );
    }
}

# 1.2、超时时间的计算

合理的超时时间应该考虑:

  • 正常响应时间的 P99 值:99% 的请求能在这个时间内完成
  • 业务容忍度:用户能接受的最长等待时间
  • 重试次数:如果有重试机制,要预留重试时间

公式:

超时时间 = P99响应时间 × 1.5 + 网络延迟 + 容错时间

# 2、重试机制(Retry)

问题:网络抖动、临时故障可能导致偶发失败。

解决方案:对失败的请求进行重试,提高成功率。

# 2.1、重试策略

@Service
public class InventoryService {
    
    private static final int MAX_RETRY = 3;
    private static final long RETRY_DELAY = 100;
    
    public Inventory getInventory(Long productId) {
        int attempts = 0;
        Exception lastException = null;
        
        while (attempts < MAX_RETRY) {
            try {
                return inventoryClient.getInventory(productId);
            } catch (Exception e) {
                attempts++;
                lastException = e;
                
                if (attempts < MAX_RETRY) {
                    try {
                        Thread.sleep(RETRY_DELAY * attempts);
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                    }
                }
            }
        }
        
        throw new ServiceException("Failed after " + MAX_RETRY + " retries", 
                                   lastException);
    }
}

# 2.2、重试的注意事项

  1. 幂等性保证:重试的操作必须是幂等的,否则可能导致数据重复
  2. 退避策略:使用指数退避(Exponential Backoff)避免重试风暴
  3. 重试次数限制:避免无限重试导致资源耗尽
  4. 区分错误类型:网络错误可以重试,业务错误不应重试

指数退避示例:

第1次重试: 延迟100ms
第2次重试: 延迟200ms
第3次重试: 延迟400ms
第4次重试: 延迟800ms

# 3、限流(Rate Limiting)

问题:突发流量可能压垮服务。

解决方案:限制单位时间内的请求数量,超过阈值则拒绝服务。

# 3.1、常见限流算法

# 3.1.1、固定窗口计数器
public class FixedWindowRateLimiter {
    
    private final int maxRequests;
    private final long windowSizeInMs;
    
    private final AtomicInteger counter = new AtomicInteger(0);
    private volatile long windowStart = System.currentTimeMillis();
    
    public FixedWindowRateLimiter(int maxRequests, long windowSizeInMs) {
        this.maxRequests = maxRequests;
        this.windowSizeInMs = windowSizeInMs;
    }
    
    public boolean allowRequest() {
        long now = System.currentTimeMillis();
        
        if (now - windowStart >= windowSizeInMs) {
            synchronized (this) {
                if (now - windowStart >= windowSizeInMs) {
                    counter.set(0);
                    windowStart = now;
                }
            }
        }
        
        return counter.incrementAndGet() <= maxRequests;
    }
}

优点:实现简单,内存占用小。

缺点:窗口临界点流量不均匀,可能出现突刺。

# 3.1.2、滑动窗口
public class SlidingWindowRateLimiter {
    
    private final int maxRequests;
    private final long windowSizeInMs;
    
    private final Queue<Long> requestTimestamps = new ConcurrentLinkedQueue<>();
    
    public SlidingWindowRateLimiter(int maxRequests, long windowSizeInMs) {
        this.maxRequests = maxRequests;
        this.windowSizeInMs = windowSizeInMs;
    }
    
    public boolean allowRequest() {
        long now = System.currentTimeMillis();
        long threshold = now - windowSizeInMs;
        
        while (!requestTimestamps.isEmpty() && 
               requestTimestamps.peek() < threshold) {
            requestTimestamps.poll();
        }
        
        if (requestTimestamps.size() < maxRequests) {
            requestTimestamps.offer(now);
            return true;
        }
        
        return false;
    }
}

优点:流量控制更平滑。

缺点:内存占用较大,需要存储每次请求的时间戳。

# 3.1.3、令牌桶(Token Bucket)
public class TokenBucketRateLimiter {
    
    private final long capacity;
    private final long refillRate;
    
    private double tokens;
    private long lastRefillTime;
    
    public TokenBucketRateLimiter(long capacity, long refillRate) {
        this.capacity = capacity;
        this.refillRate = refillRate;
        this.tokens = capacity;
        this.lastRefillTime = System.currentTimeMillis();
    }
    
    public synchronized boolean allowRequest() {
        refillTokens();
        
        if (tokens >= 1) {
            tokens -= 1;
            return true;
        }
        
        return false;
    }
    
    private void refillTokens() {
        long now = System.currentTimeMillis();
        long elapsedTime = now - lastRefillTime;
        
        double tokensToAdd = (elapsedTime / 1000.0) * refillRate;
        tokens = Math.min(capacity, tokens + tokensToAdd);
        
        lastRefillTime = now;
    }
}

优点:允许一定程度的突发流量。

缺点:实现相对复杂。

# 3.1.4、漏桶(Leaky Bucket)
public class LeakyBucketRateLimiter {
    
    private final long capacity;
    private final long leakRate;
    
    private long water = 0;
    private long lastLeakTime;
    
    public LeakyBucketRateLimiter(long capacity, long leakRate) {
        this.capacity = capacity;
        this.leakRate = leakRate;
        this.lastLeakTime = System.currentTimeMillis();
    }
    
    public synchronized boolean allowRequest() {
        leak();
        
        if (water < capacity) {
            water++;
            return true;
        }
        
        return false;
    }
    
    private void leak() {
        long now = System.currentTimeMillis();
        long elapsedTime = now - lastLeakTime;
        
        long leaked = (elapsedTime / 1000) * leakRate;
        water = Math.max(0, water - leaked);
        
        lastLeakTime = now;
    }
}

优点:流量绝对均匀,适合保护下游服务。

缺点:不支持突发流量。

# 3.2、分布式限流

单机限流在分布式环境下不够用,需要使用 Redis 实现分布式限流:

@Service
public class RedisRateLimiter {
    
    @Autowired
    private StringRedisTemplate redisTemplate;
    
    public boolean allowRequest(String key, int maxRequests, long windowInSeconds) {
        String redisKey = "rate_limit:" + key;
        long now = System.currentTimeMillis();
        long windowStart = now - windowInSeconds * 1000;
        
        redisTemplate.opsForZSet().removeRangeByScore(redisKey, 0, windowStart);
        
        Long count = redisTemplate.opsForZSet().zCard(redisKey);
        
        if (count != null && count < maxRequests) {
            redisTemplate.opsForZSet().add(redisKey, String.valueOf(now), now);
            redisTemplate.expire(redisKey, windowInSeconds, TimeUnit.SECONDS);
            return true;
        }
        
        return false;
    }
}

# 4、熔断(Circuit Breaker)

问题:当下游服务持续故障时,继续调用只会浪费资源。

解决方案:检测故障率,达到阈值后自动熔断,快速失败。

# 4.1、熔断器的三种状态

        +----------------+
        |     关闭        |  <-- 正常状态,请求正常通过
        | (Closed)       |
        +--------+-------+
                 |
                 | 失败率超过阈值
                 v
        +--------+-------+
        |     打开        |  <-- 拒绝所有请求,快速失败
        | (Open)         |
        +--------+-------+
                 |
                 | 经过冷却时间
                 v
        +--------+-------+
        |    半开         |  <-- 尝试部分请求,测试恢复
        | (Half-Open)    |
        +----------------+
                 |
     +-----------+-----------+
     |                       |
失败率仍然高              失败率降低
     |                       |
     v                       v
   打开                    关闭

# 4.2、熔断器实现

public class CircuitBreaker {
    
    private enum State {
        CLOSED, OPEN, HALF_OPEN
    }
    
    private State state = State.CLOSED;
    
    private final int failureThreshold;
    
    private final long retryTimeout;
    
    private int failureCount = 0;
    private int successCount = 0;
    private long lastFailureTime = 0;
    
    public CircuitBreaker(int failureThreshold, long retryTimeout) {
        this.failureThreshold = failureThreshold;
        this.retryTimeout = retryTimeout;
    }
    
    public synchronized boolean allowRequest() {
        if (state == State.OPEN) {
            if (System.currentTimeMillis() - lastFailureTime >= retryTimeout) {
                state = State.HALF_OPEN;
                failureCount = 0;
                successCount = 0;
                return true;
            }
            return false;
        }
        
        return true;
    }
    
    public synchronized void recordSuccess() {
        if (state == State.HALF_OPEN) {
            successCount++;
            if (successCount >= 3) {
                state = State.CLOSED;
                failureCount = 0;
            }
        } else if (state == State.CLOSED) {
            failureCount = 0;
        }
    }
    
    public synchronized void recordFailure() {
        failureCount++;
        lastFailureTime = System.currentTimeMillis();
        
        if (state == State.HALF_OPEN) {
            state = State.OPEN;
        } else if (failureCount >= failureThreshold) {
            state = State.OPEN;
        }
    }
    
    public State getState() {
        return state;
    }
}

# 4.3、使用熔断器

@Service
public class PaymentService {
    
    private final CircuitBreaker circuitBreaker = 
        new CircuitBreaker(5, 60000);
    
    public PaymentResult makePayment(PaymentRequest request) {
        if (!circuitBreaker.allowRequest()) {
            return PaymentResult.fallback("服务暂时不可用,请稍后重试");
        }
        
        try {
            PaymentResult result = paymentClient.pay(request);
            circuitBreaker.recordSuccess();
            return result;
        } catch (Exception e) {
            circuitBreaker.recordFailure();
            return PaymentResult.fallback("支付失败,请稍后重试");
        }
    }
}

# 5、隔离(Isolation)

问题:一个服务的故障可能占用所有资源,影响其他服务。

解决方案:为不同的服务分配独立的资源池,实现故障隔离。

# 5.1、线程池隔离

@Configuration
public class ThreadPoolConfig {
    
    @Bean("orderThreadPool")
    public ThreadPoolExecutor orderThreadPool() {
        return new ThreadPoolExecutor(
            10,
            20,
            60L,
            TimeUnit.SECONDS,
            new ArrayBlockingQueue<>(100),
            new ThreadPoolExecutor.CallerRunsPolicy()
        );
    }
    
    @Bean("paymentThreadPool")
    public ThreadPoolExecutor paymentThreadPool() {
        return new ThreadPoolExecutor(
            5,
            10,
            60L,
            TimeUnit.SECONDS,
            new ArrayBlockingQueue<>(50),
            new ThreadPoolExecutor.CallerRunsPolicy()
        );
    }
}

@Service
public class OrderService {
    
    @Autowired
    @Qualifier("orderThreadPool")
    private ThreadPoolExecutor orderThreadPool;
    
    public CompletableFuture<Order> createOrderAsync(OrderRequest request) {
        return CompletableFuture.supplyAsync(
            () -> createOrder(request),
            orderThreadPool
        );
    }
    
    private Order createOrder(OrderRequest request) {
        return null;
    }
}

优点:

  • 故障隔离:某个服务的线程池耗尽不影响其他服务
  • 资源控制:可以精确控制每个服务的并发数

缺点:

  • 资源开销:每个线程池都需要占用内存
  • 上下文切换:频繁的线程切换可能影响性能

# 5.2、信号量隔离

public class SemaphoreIsolation {
    
    private final Semaphore semaphore;
    
    public SemaphoreIsolation(int permits) {
        this.semaphore = new Semaphore(permits);
    }
    
    public <T> T execute(Supplier<T> command, Supplier<T> fallback) {
        if (semaphore.tryAcquire()) {
            try {
                return command.get();
            } finally {
                semaphore.release();
            }
        } else {
            return fallback.get();
        }
    }
}

@Service
public class InventoryService {
    
    private final SemaphoreIsolation isolation = new SemaphoreIsolation(10);
    
    public Inventory getInventory(Long productId) {
        return isolation.execute(
            () -> inventoryClient.getInventory(productId),
            () -> Inventory.unavailable()
        );
    }
}

优点:

  • 轻量级:不需要创建线程,开销小
  • 性能好:没有线程切换的开销

缺点:

  • 不支持超时:无法主动中断执行
  • 隔离性较弱:共享同一个线程池

# 三、服务降级策略

# 1、降级的分类

# 1.1、自动降级

系统根据监控指标自动触发降级:

@Service
public class RecommendationService {
    
    @Autowired
    private MetricsCollector metrics;
    
    public List<Product> getRecommendations(Long userId) {
        if (metrics.getCpuUsage() > 80 || metrics.getMemoryUsage() > 85) {
            return getCachedRecommendations(userId);
        }
        
        if (metrics.getResponseTime("recommendation") > 1000) {
            return getDefaultRecommendations();
        }
        
        return generateRecommendations(userId);
    }
    
    private List<Product> getCachedRecommendations(Long userId) {
        return null;
    }
    
    private List<Product> getDefaultRecommendations() {
        return null;
    }
    
    private List<Product> generateRecommendations(Long userId) {
        return null;
    }
}

# 1.2、手动降级

运维人员通过开关手动触发降级:

@RestController
public class FeatureToggleController {
    
    @Autowired
    private FeatureToggleService toggleService;
    
    @PostMapping("/admin/toggle/{feature}")
    public void toggleFeature(@PathVariable String feature, 
                              @RequestParam boolean enabled) {
        toggleService.setFeatureEnabled(feature, enabled);
    }
}

@Service
public class CommentService {
    
    @Autowired
    private FeatureToggleService toggleService;
    
    public List<Comment> getComments(Long articleId) {
        if (!toggleService.isFeatureEnabled("comment")) {
            return Collections.emptyList();
        }
        
        return commentRepository.findByArticleId(articleId);
    }
}

# 2、降级的级别

# 2.1、功能降级

关闭非核心功能:

@Service
public class ProductDetailService {
    
    @Autowired
    private FeatureToggleService toggleService;
    
    public ProductDetail getProductDetail(Long productId) {
        ProductDetail detail = new ProductDetail();
        
        detail.setBasicInfo(getBasicInfo(productId));
        
        if (toggleService.isFeatureEnabled("product.reviews")) {
            detail.setReviews(getReviews(productId));
        }
        
        if (toggleService.isFeatureEnabled("product.qa")) {
            detail.setQaList(getQAList(productId));
        }
        
        if (toggleService.isFeatureEnabled("product.recommendations")) {
            detail.setRecommendations(getRecommendations(productId));
        }
        
        return detail;
    }
    
    private ProductBasicInfo getBasicInfo(Long productId) {
        return null;
    }
    
    private List<Review> getReviews(Long productId) {
        return null;
    }
    
    private List<QA> getQAList(Long productId) {
        return null;
    }
    
    private List<Product> getRecommendations(Long productId) {
        return null;
    }
}

# 2.2、读降级

使用缓存或默认值:

@Service
public class UserService {
    
    @Autowired
    private RedisTemplate<String, User> redisTemplate;
    
    @Autowired
    private UserRepository userRepository;
    
    public User getUser(Long userId) {
        String cacheKey = "user:" + userId;
        
        User cachedUser = redisTemplate.opsForValue().get(cacheKey);
        if (cachedUser != null) {
            return cachedUser;
        }
        
        try {
            User user = userRepository.findById(userId).orElse(null);
            if (user != null) {
                redisTemplate.opsForValue().set(cacheKey, user, 10, TimeUnit.MINUTES);
            }
            return user;
        } catch (Exception e) {
            return getDefaultUser(userId);
        }
    }
    
    private User getDefaultUser(Long userId) {
        User user = new User();
        user.setId(userId);
        user.setNickname("用户" + userId);
        return user;
    }
}

# 2.3、写降级

异步处理或延迟写入:

@Service
public class LogService {
    
    @Autowired
    private KafkaTemplate<String, LogEntry> kafkaTemplate;
    
    @Autowired
    private LogRepository logRepository;
    
    private final Queue<LogEntry> logQueue = new ConcurrentLinkedQueue<>();
    
    public void log(LogEntry entry) {
        try {
            kafkaTemplate.send("logs", entry);
        } catch (Exception e) {
            logQueue.offer(entry);
            
            if (logQueue.size() > 1000) {
                flushLogs();
            }
        }
    }
    
    @Scheduled(fixedDelay = 5000)
    public void flushLogs() {
        List<LogEntry> logs = new ArrayList<>();
        LogEntry entry;
        while ((entry = logQueue.poll()) != null) {
            logs.add(entry);
        }
        
        if (!logs.isEmpty()) {
            logRepository.saveAll(logs);
        }
    }
}

# 3、降级的实现方式

# 3.1、返回默认值

@Service
public class WeatherService {
    
    public Weather getWeather(String city) {
        try {
            return weatherClient.getWeather(city);
        } catch (Exception e) {
            return Weather.defaultWeather(city);
        }
    }
}

# 3.2、返回缓存数据

@Service
public class NewsService {
    
    @Autowired
    private RedisTemplate<String, List<News>> redisTemplate;
    
    @Cacheable(value = "hot_news", unless = "#result == null")
    public List<News> getHotNews() {
        return newsRepository.findHotNews();
    }
    
    public List<News> getHotNewsWithFallback() {
        try {
            return getHotNews();
        } catch (Exception e) {
            List<News> cachedNews = redisTemplate.opsForValue().get("hot_news");
            if (cachedNews != null) {
                return cachedNews;
            }
            return Collections.emptyList();
        }
    }
}

# 3.3、返回兜底数据

@Service
public class BannerService {
    
    private static final List<Banner> FALLBACK_BANNERS = Arrays.asList(
        new Banner("默认广告1", "/images/default1.jpg"),
        new Banner("默认广告2", "/images/default2.jpg")
    );
    
    public List<Banner> getBanners() {
        try {
            return bannerRepository.findActiveBanners();
        } catch (Exception e) {
            return FALLBACK_BANNERS;
        }
    }
}

# 四、Sentinel 实战

# 1、Sentinel 简介

Sentinel 是阿里开源的流量控制和熔断降级组件,提供:

  • 流量控制:QPS 限流、并发线程数限流
  • 熔断降级:慢调用比例、异常比例、异常数
  • 系统自适应保护:CPU、内存、入口 QPS、线程数
  • 热点参数限流:对特定参数进行限流

# 2、快速开始

# 2.1、引入依赖

<dependency>
    <groupId>com.alibaba.cloud</groupId>
    <artifactId>spring-cloud-starter-alibaba-sentinel</artifactId>
</dependency>

# 2.2、配置 Sentinel Dashboard

spring:
  cloud:
    sentinel:
      transport:
        dashboard: localhost:8080
        port: 8719
      eager: true

# 2.3、定义资源

@RestController
public class OrderController {
    
    @GetMapping("/order/{id}")
    @SentinelResource(
        value = "getOrder",
        blockHandler = "handleBlock",
        fallback = "handleFallback"
    )
    public Order getOrder(@PathVariable Long id) {
        return orderService.getOrder(id);
    }
    
    public Order handleBlock(Long id, BlockException ex) {
        return Order.blocked("请求被限流或降级");
    }
    
    public Order handleFallback(Long id, Throwable ex) {
        return Order.error("系统异常,请稍后重试");
    }
}

# 3、流量控制

# 3.1、QPS 限流

@PostConstruct
public void initFlowRules() {
    List<FlowRule> rules = new ArrayList<>();
    
    FlowRule rule = new FlowRule();
    rule.setResource("getOrder");
    rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
    rule.setCount(100);
    
    rules.add(rule);
    FlowRuleManager.loadRules(rules);
}

# 3.2、并发线程数限流

@PostConstruct
public void initFlowRules() {
    List<FlowRule> rules = new ArrayList<>();
    
    FlowRule rule = new FlowRule();
    rule.setResource("createOrder");
    rule.setGrade(RuleConstant.FLOW_GRADE_THREAD);
    rule.setCount(20);
    
    rules.add(rule);
    FlowRuleManager.loadRules(rules);
}

# 3.3、关联流控

@PostConstruct
public void initFlowRules() {
    List<FlowRule> rules = new ArrayList<>();
    
    FlowRule rule = new FlowRule();
    rule.setResource("readOrder");
    rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
    rule.setCount(100);
    rule.setStrategy(RuleConstant.STRATEGY_RELATE);
    rule.setRefResource("writeOrder");
    
    rules.add(rule);
    FlowRuleManager.loadRules(rules);
}

当 writeOrder 的 QPS 超过阈值时,限制 readOrder 的访问。

# 3.4、链路流控

@PostConstruct
public void initFlowRules() {
    List<FlowRule> rules = new ArrayList<>();
    
    FlowRule rule = new FlowRule();
    rule.setResource("getUser");
    rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
    rule.setCount(50);
    rule.setStrategy(RuleConstant.STRATEGY_CHAIN);
    rule.setRefResource("orderService");
    
    rules.add(rule);
    FlowRuleManager.loadRules(rules);
}

只统计从 orderService 调用 getUser 的流量。

# 4、熔断降级

# 4.1、慢调用比例

@PostConstruct
public void initDegradeRules() {
    List<DegradeRule> rules = new ArrayList<>();
    
    DegradeRule rule = new DegradeRule();
    rule.setResource("getPaymentInfo");
    rule.setGrade(RuleConstant.DEGRADE_GRADE_RT);
    rule.setCount(500);
    rule.setTimeWindow(10);
    rule.setSlowRatioThreshold(0.5);
    rule.setMinRequestAmount(5);
    rule.setStatIntervalMs(1000);
    
    rules.add(rule);
    DegradeRuleManager.loadRules(rules);
}

含义:

  • 响应时间超过 500ms 的请求视为慢调用
  • 1 秒内至少有 5 个请求
  • 慢调用比例超过 50%
  • 触发熔断,熔断时长 10 秒

# 4.2、异常比例

@PostConstruct
public void initDegradeRules() {
    List<DegradeRule> rules = new ArrayList<>();
    
    DegradeRule rule = new DegradeRule();
    rule.setResource("sendSms");
    rule.setGrade(RuleConstant.DEGRADE_GRADE_EXCEPTION_RATIO);
    rule.setCount(0.3);
    rule.setTimeWindow(10);
    rule.setMinRequestAmount(5);
    rule.setStatIntervalMs(1000);
    
    rules.add(rule);
    DegradeRuleManager.loadRules(rules);
}

含义:

  • 1 秒内至少有 5 个请求
  • 异常比例超过 30%
  • 触发熔断,熔断时长 10 秒

# 4.3、异常数

@PostConstruct
public void initDegradeRules() {
    List<DegradeRule> rules = new ArrayList<>();
    
    DegradeRule rule = new DegradeRule();
    rule.setResource("uploadFile");
    rule.setGrade(RuleConstant.DEGRADE_GRADE_EXCEPTION_COUNT);
    rule.setCount(10);
    rule.setTimeWindow(60);
    rule.setMinRequestAmount(5);
    rule.setStatIntervalMs(60000);
    
    rules.add(rule);
    DegradeRuleManager.loadRules(rules);
}

含义:

  • 1 分钟内至少有 5 个请求
  • 异常数超过 10
  • 触发熔断,熔断时长 60 秒

# 5、系统自适应保护

@PostConstruct
public void initSystemRules() {
    List<SystemRule> rules = new ArrayList<>();
    
    SystemRule rule = new SystemRule();
    rule.setHighestSystemLoad(8.0);
    rule.setHighestCpuUsage(0.8);
    rule.setQps(10000);
    rule.setAvgRt(100);
    rule.setMaxThread(100);
    
    rules.add(rule);
    SystemRuleManager.loadRules(rules);
}

# 6、热点参数限流

@PostConstruct
public void initParamFlowRules() {
    ParamFlowRule rule = new ParamFlowRule();
    rule.setResource("getProduct");
    rule.setParamIdx(0);
    rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
    rule.setCount(100);
    
    ParamFlowItem item = new ParamFlowItem();
    item.setObject("1001");
    item.setClassType("long");
    item.setCount(200);
    rule.setParamFlowItemList(Collections.singletonList(item));
    
    ParamFlowRuleManager.loadRules(Collections.singletonList(rule));
}

@GetMapping("/product/{id}")
@SentinelResource(value = "getProduct")
public Product getProduct(@PathVariable Long id) {
    return productService.getProduct(id);
}

含义:

  • 对第 0 个参数(productId)进行限流
  • 普通商品 QPS 限制为 100
  • 商品 ID 为 1001 的热点商品 QPS 限制为 200

# 五、Netflix Hystrix

虽然 Hystrix 已经停止维护,但其设计思想仍然值得学习。

# 1、基本使用

# 1.1、引入依赖

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
</dependency>

# 1.2、启用 Hystrix

@SpringBootApplication
@EnableCircuitBreaker
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
}

# 1.3、定义 Fallback

@Service
public class UserService {
    
    @HystrixCommand(
        fallbackMethod = "getUserFallback",
        commandProperties = {
            @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", 
                           value = "1000"),
            @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", 
                           value = "10"),
            @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", 
                           value = "50"),
            @HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", 
                           value = "5000")
        }
    )
    public User getUser(Long userId) {
        return userClient.getUser(userId);
    }
    
    public User getUserFallback(Long userId, Throwable e) {
        User fallbackUser = new User();
        fallbackUser.setId(userId);
        fallbackUser.setNickname("默认用户");
        return fallbackUser;
    }
}

# 2、线程池隔离

@HystrixCommand(
    fallbackMethod = "getInventoryFallback",
    threadPoolKey = "inventoryThreadPool",
    threadPoolProperties = {
        @HystrixProperty(name = "coreSize", value = "10"),
        @HystrixProperty(name = "maxQueueSize", value = "100"),
        @HystrixProperty(name = "queueSizeRejectionThreshold", value = "80")
    }
)
public Inventory getInventory(Long productId) {
    return inventoryClient.getInventory(productId);
}

# 3、请求合并

@HystrixCollapser(
    batchMethod = "getBatchUsers",
    collapserProperties = {
        @HystrixProperty(name = "timerDelayInMilliseconds", value = "100"),
        @HystrixProperty(name = "maxRequestsInBatch", value = "200")
    }
)
public Future<User> getUser(Long userId) {
    return null;
}

@HystrixCommand
public List<User> getBatchUsers(List<Long> userIds) {
    return userClient.getBatchUsers(userIds);
}

# 4、请求缓存

@HystrixCommand
@CacheResult
public User getUserWithCache(Long userId) {
    return userClient.getUser(userId);
}

@CacheRemove(commandKey = "getUserWithCache")
public void updateUser(User user) {
    userClient.updateUser(user);
}

# 六、容错降级最佳实践

# 1、容错策略组合

不同的策略应该组合使用:

请求 → 限流 → 熔断 → 隔离 → 超时 → 重试 → 降级 → 返回
@Service
public class OrderService {
    
    @SentinelResource(
        value = "createOrder",
        blockHandler = "handleBlock",
        fallback = "handleFallback"
    )
    @HystrixCommand(
        fallbackMethod = "createOrderFallback",
        commandProperties = {
            @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", 
                           value = "3000")
        },
        threadPoolKey = "orderThreadPool"
    )
    @Retryable(
        value = {TimeoutException.class, RemoteException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 100, multiplier = 2)
    )
    public Order createOrder(OrderRequest request) {
        return orderClient.createOrder(request);
    }
    
    public Order handleBlock(OrderRequest request, BlockException ex) {
        return Order.blocked("系统繁忙,请稍后重试");
    }
    
    public Order handleFallback(OrderRequest request, Throwable ex) {
        return Order.error("创建订单失败:" + ex.getMessage());
    }
    
    public Order createOrderFallback(OrderRequest request, Throwable ex) {
        return orderQueue.enqueue(request);
    }
}

# 2、超时时间设计

超时时间应该层层递减:

API Gateway: 5000ms
  └─ Order Service: 4000ms
      └─ Inventory Service: 3000ms
          └─ Database: 2000ms
hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 4000
    getInventory:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 3000

# 3、熔断阈值设计

熔断阈值应该根据业务特点设置:

服务类型 失败率阈值 最小请求数 熔断时间 说明
核心服务 50% 20 10s 宽松阈值,避免误熔断
一般服务 40% 10 30s 标准配置
边缘服务 30% 5 60s 严格阈值,快速熔断
第三方服务 20% 3 120s 非常严格,长时间熔断

# 4、降级优先级

降级应该按优先级进行:

@Service
public class ProductDetailService {
    
    private enum FeaturePriority {
        BASIC_INFO(1),
        PRICE(2),
        INVENTORY(3),
        IMAGES(4),
        REVIEWS(5),
        RECOMMENDATIONS(6),
        QA(7),
        RELATED_PRODUCTS(8);
        
        private final int priority;
        
        FeaturePriority(int priority) {
            this.priority = priority;
        }
    }
    
    public ProductDetail getProductDetail(Long productId, int loadLevel) {
        ProductDetail detail = new ProductDetail();
        
        if (loadLevel >= 1) {
            detail.setBasicInfo(getBasicInfo(productId));
        }
        
        if (loadLevel >= 2) {
            detail.setPrice(getPrice(productId));
        }
        
        if (loadLevel >= 3) {
            detail.setInventory(getInventory(productId));
        }
        
        if (loadLevel >= 4) {
            detail.setImages(getImages(productId));
        }
        
        if (loadLevel >= 5) {
            detail.setReviews(getReviews(productId));
        }
        
        if (loadLevel >= 6) {
            detail.setRecommendations(getRecommendations(productId));
        }
        
        if (loadLevel >= 7) {
            detail.setQaList(getQAList(productId));
        }
        
        if (loadLevel >= 8) {
            detail.setRelatedProducts(getRelatedProducts(productId));
        }
        
        return detail;
    }
    
    private ProductBasicInfo getBasicInfo(Long productId) {
        return null;
    }
    
    private Price getPrice(Long productId) {
        return null;
    }
    
    private Inventory getInventory(Long productId) {
        return null;
    }
    
    private List<String> getImages(Long productId) {
        return null;
    }
    
    private List<Review> getReviews(Long productId) {
        return null;
    }
    
    private List<Product> getRecommendations(Long productId) {
        return null;
    }
    
    private List<QA> getQAList(Long productId) {
        return null;
    }
    
    private List<Product> getRelatedProducts(Long productId) {
        return null;
    }
}

# 5、监控与告警

完善的监控是容错降级的基础:

@Component
public class CircuitBreakerMetrics {
    
    @Autowired
    private MeterRegistry meterRegistry;
    
    public void recordCircuitBreakerState(String service, String state) {
        meterRegistry.counter("circuit_breaker_state", 
            "service", service, 
            "state", state
        ).increment();
    }
    
    public void recordFallback(String service) {
        meterRegistry.counter("fallback_count", 
            "service", service
        ).increment();
    }
    
    public void recordRateLimitRejection(String service) {
        meterRegistry.counter("rate_limit_rejection", 
            "service", service
        ).increment();
    }
}

# 6、优雅的降级响应

降级时应该给用户友好的提示:

public class FallbackResponse<T> {
    
    private boolean success;
    
    private String code;
    
    private String message;
    
    private T data;
    
    private FallbackReason reason;
    
    public static <T> FallbackResponse<T> rateLimited() {
        FallbackResponse<T> response = new FallbackResponse<>();
        response.setSuccess(false);
        response.setCode("RATE_LIMITED");
        response.setMessage("请求过于频繁,请稍后重试");
        response.setReason(FallbackReason.RATE_LIMITED);
        return response;
    }
    
    public static <T> FallbackResponse<T> circuitOpen() {
        FallbackResponse<T> response = new FallbackResponse<>();
        response.setSuccess(false);
        response.setCode("SERVICE_UNAVAILABLE");
        response.setMessage("服务暂时不可用,请稍后重试");
        response.setReason(FallbackReason.CIRCUIT_OPEN);
        return response;
    }
    
    public static <T> FallbackResponse<T> timeout() {
        FallbackResponse<T> response = new FallbackResponse<>();
        response.setSuccess(false);
        response.setCode("TIMEOUT");
        response.setMessage("请求超时,请稍后重试");
        response.setReason(FallbackReason.TIMEOUT);
        return response;
    }
    
    public static <T> FallbackResponse<T> withFallbackData(T fallbackData) {
        FallbackResponse<T> response = new FallbackResponse<>();
        response.setSuccess(true);
        response.setCode("FALLBACK_DATA");
        response.setMessage("返回默认数据");
        response.setData(fallbackData);
        response.setReason(FallbackReason.FALLBACK_DATA);
        return response;
    }
    
    public enum FallbackReason {
        RATE_LIMITED,
        CIRCUIT_OPEN,
        TIMEOUT,
        ERROR,
        FALLBACK_DATA
    }
}

# 七、总结

服务容错与降级是保障分布式系统高可用的关键技术。本文介绍了:

  1. 核心容错策略:超时控制、重试机制、限流、熔断、隔离
  2. 服务降级策略:自动降级、手动降级、功能降级、读写降级
  3. 工具与框架:Sentinel、Hystrix 的使用
  4. 最佳实践:策略组合、超时设计、熔断阈值、降级优先级、监控告警

关键原则:

  • 快速失败:不要让故障蔓延
  • 优雅降级:提供有损但可用的服务
  • 故障隔离:防止故障扩散
  • 自动恢复:系统能够自愈
  • 可观测性:完善的监控和告警

在实际应用中,需要根据业务特点、系统规模、容错要求选择合适的策略组合,并通过持续的监控和优化,不断提升系统的健壮性。

祝你变得更强!

编辑 (opens new window)
#高可用#容错#降级#熔断
上次更新: 2025/12/12
高可用-分布式基础之CAP理论
高性能-缓存架构设计

← 高可用-分布式基础之CAP理论 高性能-缓存架构设计→

最近更新
01
AI编程时代的一些心得
09-11
02
Claude Code与Codex的协同工作
09-01
03
Claude Code 最佳实践(个人版)
08-01
更多文章>
Theme by Vdoing | Copyright © 2018-2025 京ICP备2021021832号-2 | MIT License
  • 跟随系统
  • 浅色模式
  • 深色模式
  • 阅读模式