高可用-服务容错与降级策略
在分布式系统中,服务故障是常态而非异常。一个健壮的系统必须具备自我保护能力,当部分服务出现问题时,能够通过容错、降级、熔断等策略保证整体系统的可用性。本文将深入探讨服务容错与降级的核心原理、实现方式和最佳实践。
# 一、为什么需要容错与降级
# 1、分布式系统的脆弱性
在微服务架构中,一个业务请求可能需要经过多个服务的协作:
用户请求 → API网关 → 订单服务 → 库存服务 → 支付服务 → 物流服务
如果任何一个环节出现问题,整个链路都可能受到影响。常见的问题包括:
- 服务超时:下游服务响应缓慢,导致上游服务线程阻塞
- 服务异常:某个服务抛出异常,导致整个调用链失败
- 资源耗尽:数据库连接池、线程池被耗尽
- 雪崩效应:一个服务的故障引发连锁反应,导致整个系统崩溃
# 2、容错与降级的价值
容错是指系统能够在部分组件失效时继续运行,不至于完全崩溃。
降级是指在系统压力过大或部分功能不可用时,主动关闭或简化部分非核心功能,保证核心业务正常运行。
两者的核心目标都是:牺牲部分功能,保证整体可用。
# 3、容错降级的设计原则
- 快速失败:不要长时间等待无响应的服务
- 优雅降级:提供有损但可用的服务,而不是直接报错
- 故障隔离:防止故障扩散到其他服务
- 自动恢复:系统能够在故障消除后自动恢复
- 监控告警:及时发现问题并通知运维人员
# 二、核心容错策略
# 1、超时控制(Timeout)
问题:如果不设置超时,一个慢服务会拖垮整个系统。
解决方案:为每个远程调用设置合理的超时时间。
# 1.1、超时时间的设定
@RestController
public class OrderController {
@Autowired
private RestTemplate restTemplate;
@GetMapping("/order/{id}")
public Order getOrder(@PathVariable Long id) {
HttpComponentsClientHttpRequestFactory factory =
new HttpComponentsClientHttpRequestFactory();
factory.setConnectTimeout(1000);
factory.setReadTimeout(3000);
restTemplate.setRequestFactory(factory);
return restTemplate.getForObject(
"http://inventory-service/inventory/" + id,
Order.class
);
}
}
# 1.2、超时时间的计算
合理的超时时间应该考虑:
- 正常响应时间的 P99 值:99% 的请求能在这个时间内完成
- 业务容忍度:用户能接受的最长等待时间
- 重试次数:如果有重试机制,要预留重试时间
公式:
超时时间 = P99响应时间 × 1.5 + 网络延迟 + 容错时间
# 2、重试机制(Retry)
问题:网络抖动、临时故障可能导致偶发失败。
解决方案:对失败的请求进行重试,提高成功率。
# 2.1、重试策略
@Service
public class InventoryService {
private static final int MAX_RETRY = 3;
private static final long RETRY_DELAY = 100;
public Inventory getInventory(Long productId) {
int attempts = 0;
Exception lastException = null;
while (attempts < MAX_RETRY) {
try {
return inventoryClient.getInventory(productId);
} catch (Exception e) {
attempts++;
lastException = e;
if (attempts < MAX_RETRY) {
try {
Thread.sleep(RETRY_DELAY * attempts);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
}
}
}
}
throw new ServiceException("Failed after " + MAX_RETRY + " retries",
lastException);
}
}
# 2.2、重试的注意事项
- 幂等性保证:重试的操作必须是幂等的,否则可能导致数据重复
- 退避策略:使用指数退避(Exponential Backoff)避免重试风暴
- 重试次数限制:避免无限重试导致资源耗尽
- 区分错误类型:网络错误可以重试,业务错误不应重试
指数退避示例:
第1次重试: 延迟100ms
第2次重试: 延迟200ms
第3次重试: 延迟400ms
第4次重试: 延迟800ms
# 3、限流(Rate Limiting)
问题:突发流量可能压垮服务。
解决方案:限制单位时间内的请求数量,超过阈值则拒绝服务。
# 3.1、常见限流算法
# 3.1.1、固定窗口计数器
public class FixedWindowRateLimiter {
private final int maxRequests;
private final long windowSizeInMs;
private final AtomicInteger counter = new AtomicInteger(0);
private volatile long windowStart = System.currentTimeMillis();
public FixedWindowRateLimiter(int maxRequests, long windowSizeInMs) {
this.maxRequests = maxRequests;
this.windowSizeInMs = windowSizeInMs;
}
public boolean allowRequest() {
long now = System.currentTimeMillis();
if (now - windowStart >= windowSizeInMs) {
synchronized (this) {
if (now - windowStart >= windowSizeInMs) {
counter.set(0);
windowStart = now;
}
}
}
return counter.incrementAndGet() <= maxRequests;
}
}
优点:实现简单,内存占用小。
缺点:窗口临界点流量不均匀,可能出现突刺。
# 3.1.2、滑动窗口
public class SlidingWindowRateLimiter {
private final int maxRequests;
private final long windowSizeInMs;
private final Queue<Long> requestTimestamps = new ConcurrentLinkedQueue<>();
public SlidingWindowRateLimiter(int maxRequests, long windowSizeInMs) {
this.maxRequests = maxRequests;
this.windowSizeInMs = windowSizeInMs;
}
public boolean allowRequest() {
long now = System.currentTimeMillis();
long threshold = now - windowSizeInMs;
while (!requestTimestamps.isEmpty() &&
requestTimestamps.peek() < threshold) {
requestTimestamps.poll();
}
if (requestTimestamps.size() < maxRequests) {
requestTimestamps.offer(now);
return true;
}
return false;
}
}
优点:流量控制更平滑。
缺点:内存占用较大,需要存储每次请求的时间戳。
# 3.1.3、令牌桶(Token Bucket)
public class TokenBucketRateLimiter {
private final long capacity;
private final long refillRate;
private double tokens;
private long lastRefillTime;
public TokenBucketRateLimiter(long capacity, long refillRate) {
this.capacity = capacity;
this.refillRate = refillRate;
this.tokens = capacity;
this.lastRefillTime = System.currentTimeMillis();
}
public synchronized boolean allowRequest() {
refillTokens();
if (tokens >= 1) {
tokens -= 1;
return true;
}
return false;
}
private void refillTokens() {
long now = System.currentTimeMillis();
long elapsedTime = now - lastRefillTime;
double tokensToAdd = (elapsedTime / 1000.0) * refillRate;
tokens = Math.min(capacity, tokens + tokensToAdd);
lastRefillTime = now;
}
}
优点:允许一定程度的突发流量。
缺点:实现相对复杂。
# 3.1.4、漏桶(Leaky Bucket)
public class LeakyBucketRateLimiter {
private final long capacity;
private final long leakRate;
private long water = 0;
private long lastLeakTime;
public LeakyBucketRateLimiter(long capacity, long leakRate) {
this.capacity = capacity;
this.leakRate = leakRate;
this.lastLeakTime = System.currentTimeMillis();
}
public synchronized boolean allowRequest() {
leak();
if (water < capacity) {
water++;
return true;
}
return false;
}
private void leak() {
long now = System.currentTimeMillis();
long elapsedTime = now - lastLeakTime;
long leaked = (elapsedTime / 1000) * leakRate;
water = Math.max(0, water - leaked);
lastLeakTime = now;
}
}
优点:流量绝对均匀,适合保护下游服务。
缺点:不支持突发流量。
# 3.2、分布式限流
单机限流在分布式环境下不够用,需要使用 Redis 实现分布式限流:
@Service
public class RedisRateLimiter {
@Autowired
private StringRedisTemplate redisTemplate;
public boolean allowRequest(String key, int maxRequests, long windowInSeconds) {
String redisKey = "rate_limit:" + key;
long now = System.currentTimeMillis();
long windowStart = now - windowInSeconds * 1000;
redisTemplate.opsForZSet().removeRangeByScore(redisKey, 0, windowStart);
Long count = redisTemplate.opsForZSet().zCard(redisKey);
if (count != null && count < maxRequests) {
redisTemplate.opsForZSet().add(redisKey, String.valueOf(now), now);
redisTemplate.expire(redisKey, windowInSeconds, TimeUnit.SECONDS);
return true;
}
return false;
}
}
# 4、熔断(Circuit Breaker)
问题:当下游服务持续故障时,继续调用只会浪费资源。
解决方案:检测故障率,达到阈值后自动熔断,快速失败。
# 4.1、熔断器的三种状态
+----------------+
| 关闭 | <-- 正常状态,请求正常通过
| (Closed) |
+--------+-------+
|
| 失败率超过阈值
v
+--------+-------+
| 打开 | <-- 拒绝所有请求,快速失败
| (Open) |
+--------+-------+
|
| 经过冷却时间
v
+--------+-------+
| 半开 | <-- 尝试部分请求,测试恢复
| (Half-Open) |
+----------------+
|
+-----------+-----------+
| |
失败率仍然高 失败率降低
| |
v v
打开 关闭
# 4.2、熔断器实现
public class CircuitBreaker {
private enum State {
CLOSED, OPEN, HALF_OPEN
}
private State state = State.CLOSED;
private final int failureThreshold;
private final long retryTimeout;
private int failureCount = 0;
private int successCount = 0;
private long lastFailureTime = 0;
public CircuitBreaker(int failureThreshold, long retryTimeout) {
this.failureThreshold = failureThreshold;
this.retryTimeout = retryTimeout;
}
public synchronized boolean allowRequest() {
if (state == State.OPEN) {
if (System.currentTimeMillis() - lastFailureTime >= retryTimeout) {
state = State.HALF_OPEN;
failureCount = 0;
successCount = 0;
return true;
}
return false;
}
return true;
}
public synchronized void recordSuccess() {
if (state == State.HALF_OPEN) {
successCount++;
if (successCount >= 3) {
state = State.CLOSED;
failureCount = 0;
}
} else if (state == State.CLOSED) {
failureCount = 0;
}
}
public synchronized void recordFailure() {
failureCount++;
lastFailureTime = System.currentTimeMillis();
if (state == State.HALF_OPEN) {
state = State.OPEN;
} else if (failureCount >= failureThreshold) {
state = State.OPEN;
}
}
public State getState() {
return state;
}
}
# 4.3、使用熔断器
@Service
public class PaymentService {
private final CircuitBreaker circuitBreaker =
new CircuitBreaker(5, 60000);
public PaymentResult makePayment(PaymentRequest request) {
if (!circuitBreaker.allowRequest()) {
return PaymentResult.fallback("服务暂时不可用,请稍后重试");
}
try {
PaymentResult result = paymentClient.pay(request);
circuitBreaker.recordSuccess();
return result;
} catch (Exception e) {
circuitBreaker.recordFailure();
return PaymentResult.fallback("支付失败,请稍后重试");
}
}
}
# 5、隔离(Isolation)
问题:一个服务的故障可能占用所有资源,影响其他服务。
解决方案:为不同的服务分配独立的资源池,实现故障隔离。
# 5.1、线程池隔离
@Configuration
public class ThreadPoolConfig {
@Bean("orderThreadPool")
public ThreadPoolExecutor orderThreadPool() {
return new ThreadPoolExecutor(
10,
20,
60L,
TimeUnit.SECONDS,
new ArrayBlockingQueue<>(100),
new ThreadPoolExecutor.CallerRunsPolicy()
);
}
@Bean("paymentThreadPool")
public ThreadPoolExecutor paymentThreadPool() {
return new ThreadPoolExecutor(
5,
10,
60L,
TimeUnit.SECONDS,
new ArrayBlockingQueue<>(50),
new ThreadPoolExecutor.CallerRunsPolicy()
);
}
}
@Service
public class OrderService {
@Autowired
@Qualifier("orderThreadPool")
private ThreadPoolExecutor orderThreadPool;
public CompletableFuture<Order> createOrderAsync(OrderRequest request) {
return CompletableFuture.supplyAsync(
() -> createOrder(request),
orderThreadPool
);
}
private Order createOrder(OrderRequest request) {
return null;
}
}
优点:
- 故障隔离:某个服务的线程池耗尽不影响其他服务
- 资源控制:可以精确控制每个服务的并发数
缺点:
- 资源开销:每个线程池都需要占用内存
- 上下文切换:频繁的线程切换可能影响性能
# 5.2、信号量隔离
public class SemaphoreIsolation {
private final Semaphore semaphore;
public SemaphoreIsolation(int permits) {
this.semaphore = new Semaphore(permits);
}
public <T> T execute(Supplier<T> command, Supplier<T> fallback) {
if (semaphore.tryAcquire()) {
try {
return command.get();
} finally {
semaphore.release();
}
} else {
return fallback.get();
}
}
}
@Service
public class InventoryService {
private final SemaphoreIsolation isolation = new SemaphoreIsolation(10);
public Inventory getInventory(Long productId) {
return isolation.execute(
() -> inventoryClient.getInventory(productId),
() -> Inventory.unavailable()
);
}
}
优点:
- 轻量级:不需要创建线程,开销小
- 性能好:没有线程切换的开销
缺点:
- 不支持超时:无法主动中断执行
- 隔离性较弱:共享同一个线程池
# 三、服务降级策略
# 1、降级的分类
# 1.1、自动降级
系统根据监控指标自动触发降级:
@Service
public class RecommendationService {
@Autowired
private MetricsCollector metrics;
public List<Product> getRecommendations(Long userId) {
if (metrics.getCpuUsage() > 80 || metrics.getMemoryUsage() > 85) {
return getCachedRecommendations(userId);
}
if (metrics.getResponseTime("recommendation") > 1000) {
return getDefaultRecommendations();
}
return generateRecommendations(userId);
}
private List<Product> getCachedRecommendations(Long userId) {
return null;
}
private List<Product> getDefaultRecommendations() {
return null;
}
private List<Product> generateRecommendations(Long userId) {
return null;
}
}
# 1.2、手动降级
运维人员通过开关手动触发降级:
@RestController
public class FeatureToggleController {
@Autowired
private FeatureToggleService toggleService;
@PostMapping("/admin/toggle/{feature}")
public void toggleFeature(@PathVariable String feature,
@RequestParam boolean enabled) {
toggleService.setFeatureEnabled(feature, enabled);
}
}
@Service
public class CommentService {
@Autowired
private FeatureToggleService toggleService;
public List<Comment> getComments(Long articleId) {
if (!toggleService.isFeatureEnabled("comment")) {
return Collections.emptyList();
}
return commentRepository.findByArticleId(articleId);
}
}
# 2、降级的级别
# 2.1、功能降级
关闭非核心功能:
@Service
public class ProductDetailService {
@Autowired
private FeatureToggleService toggleService;
public ProductDetail getProductDetail(Long productId) {
ProductDetail detail = new ProductDetail();
detail.setBasicInfo(getBasicInfo(productId));
if (toggleService.isFeatureEnabled("product.reviews")) {
detail.setReviews(getReviews(productId));
}
if (toggleService.isFeatureEnabled("product.qa")) {
detail.setQaList(getQAList(productId));
}
if (toggleService.isFeatureEnabled("product.recommendations")) {
detail.setRecommendations(getRecommendations(productId));
}
return detail;
}
private ProductBasicInfo getBasicInfo(Long productId) {
return null;
}
private List<Review> getReviews(Long productId) {
return null;
}
private List<QA> getQAList(Long productId) {
return null;
}
private List<Product> getRecommendations(Long productId) {
return null;
}
}
# 2.2、读降级
使用缓存或默认值:
@Service
public class UserService {
@Autowired
private RedisTemplate<String, User> redisTemplate;
@Autowired
private UserRepository userRepository;
public User getUser(Long userId) {
String cacheKey = "user:" + userId;
User cachedUser = redisTemplate.opsForValue().get(cacheKey);
if (cachedUser != null) {
return cachedUser;
}
try {
User user = userRepository.findById(userId).orElse(null);
if (user != null) {
redisTemplate.opsForValue().set(cacheKey, user, 10, TimeUnit.MINUTES);
}
return user;
} catch (Exception e) {
return getDefaultUser(userId);
}
}
private User getDefaultUser(Long userId) {
User user = new User();
user.setId(userId);
user.setNickname("用户" + userId);
return user;
}
}
# 2.3、写降级
异步处理或延迟写入:
@Service
public class LogService {
@Autowired
private KafkaTemplate<String, LogEntry> kafkaTemplate;
@Autowired
private LogRepository logRepository;
private final Queue<LogEntry> logQueue = new ConcurrentLinkedQueue<>();
public void log(LogEntry entry) {
try {
kafkaTemplate.send("logs", entry);
} catch (Exception e) {
logQueue.offer(entry);
if (logQueue.size() > 1000) {
flushLogs();
}
}
}
@Scheduled(fixedDelay = 5000)
public void flushLogs() {
List<LogEntry> logs = new ArrayList<>();
LogEntry entry;
while ((entry = logQueue.poll()) != null) {
logs.add(entry);
}
if (!logs.isEmpty()) {
logRepository.saveAll(logs);
}
}
}
# 3、降级的实现方式
# 3.1、返回默认值
@Service
public class WeatherService {
public Weather getWeather(String city) {
try {
return weatherClient.getWeather(city);
} catch (Exception e) {
return Weather.defaultWeather(city);
}
}
}
# 3.2、返回缓存数据
@Service
public class NewsService {
@Autowired
private RedisTemplate<String, List<News>> redisTemplate;
@Cacheable(value = "hot_news", unless = "#result == null")
public List<News> getHotNews() {
return newsRepository.findHotNews();
}
public List<News> getHotNewsWithFallback() {
try {
return getHotNews();
} catch (Exception e) {
List<News> cachedNews = redisTemplate.opsForValue().get("hot_news");
if (cachedNews != null) {
return cachedNews;
}
return Collections.emptyList();
}
}
}
# 3.3、返回兜底数据
@Service
public class BannerService {
private static final List<Banner> FALLBACK_BANNERS = Arrays.asList(
new Banner("默认广告1", "/images/default1.jpg"),
new Banner("默认广告2", "/images/default2.jpg")
);
public List<Banner> getBanners() {
try {
return bannerRepository.findActiveBanners();
} catch (Exception e) {
return FALLBACK_BANNERS;
}
}
}
# 四、Sentinel 实战
# 1、Sentinel 简介
Sentinel 是阿里开源的流量控制和熔断降级组件,提供:
- 流量控制:QPS 限流、并发线程数限流
- 熔断降级:慢调用比例、异常比例、异常数
- 系统自适应保护:CPU、内存、入口 QPS、线程数
- 热点参数限流:对特定参数进行限流
# 2、快速开始
# 2.1、引入依赖
<dependency>
<groupId>com.alibaba.cloud</groupId>
<artifactId>spring-cloud-starter-alibaba-sentinel</artifactId>
</dependency>
# 2.2、配置 Sentinel Dashboard
spring:
cloud:
sentinel:
transport:
dashboard: localhost:8080
port: 8719
eager: true
# 2.3、定义资源
@RestController
public class OrderController {
@GetMapping("/order/{id}")
@SentinelResource(
value = "getOrder",
blockHandler = "handleBlock",
fallback = "handleFallback"
)
public Order getOrder(@PathVariable Long id) {
return orderService.getOrder(id);
}
public Order handleBlock(Long id, BlockException ex) {
return Order.blocked("请求被限流或降级");
}
public Order handleFallback(Long id, Throwable ex) {
return Order.error("系统异常,请稍后重试");
}
}
# 3、流量控制
# 3.1、QPS 限流
@PostConstruct
public void initFlowRules() {
List<FlowRule> rules = new ArrayList<>();
FlowRule rule = new FlowRule();
rule.setResource("getOrder");
rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
rule.setCount(100);
rules.add(rule);
FlowRuleManager.loadRules(rules);
}
# 3.2、并发线程数限流
@PostConstruct
public void initFlowRules() {
List<FlowRule> rules = new ArrayList<>();
FlowRule rule = new FlowRule();
rule.setResource("createOrder");
rule.setGrade(RuleConstant.FLOW_GRADE_THREAD);
rule.setCount(20);
rules.add(rule);
FlowRuleManager.loadRules(rules);
}
# 3.3、关联流控
@PostConstruct
public void initFlowRules() {
List<FlowRule> rules = new ArrayList<>();
FlowRule rule = new FlowRule();
rule.setResource("readOrder");
rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
rule.setCount(100);
rule.setStrategy(RuleConstant.STRATEGY_RELATE);
rule.setRefResource("writeOrder");
rules.add(rule);
FlowRuleManager.loadRules(rules);
}
当 writeOrder 的 QPS 超过阈值时,限制 readOrder 的访问。
# 3.4、链路流控
@PostConstruct
public void initFlowRules() {
List<FlowRule> rules = new ArrayList<>();
FlowRule rule = new FlowRule();
rule.setResource("getUser");
rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
rule.setCount(50);
rule.setStrategy(RuleConstant.STRATEGY_CHAIN);
rule.setRefResource("orderService");
rules.add(rule);
FlowRuleManager.loadRules(rules);
}
只统计从 orderService 调用 getUser 的流量。
# 4、熔断降级
# 4.1、慢调用比例
@PostConstruct
public void initDegradeRules() {
List<DegradeRule> rules = new ArrayList<>();
DegradeRule rule = new DegradeRule();
rule.setResource("getPaymentInfo");
rule.setGrade(RuleConstant.DEGRADE_GRADE_RT);
rule.setCount(500);
rule.setTimeWindow(10);
rule.setSlowRatioThreshold(0.5);
rule.setMinRequestAmount(5);
rule.setStatIntervalMs(1000);
rules.add(rule);
DegradeRuleManager.loadRules(rules);
}
含义:
- 响应时间超过 500ms 的请求视为慢调用
- 1 秒内至少有 5 个请求
- 慢调用比例超过 50%
- 触发熔断,熔断时长 10 秒
# 4.2、异常比例
@PostConstruct
public void initDegradeRules() {
List<DegradeRule> rules = new ArrayList<>();
DegradeRule rule = new DegradeRule();
rule.setResource("sendSms");
rule.setGrade(RuleConstant.DEGRADE_GRADE_EXCEPTION_RATIO);
rule.setCount(0.3);
rule.setTimeWindow(10);
rule.setMinRequestAmount(5);
rule.setStatIntervalMs(1000);
rules.add(rule);
DegradeRuleManager.loadRules(rules);
}
含义:
- 1 秒内至少有 5 个请求
- 异常比例超过 30%
- 触发熔断,熔断时长 10 秒
# 4.3、异常数
@PostConstruct
public void initDegradeRules() {
List<DegradeRule> rules = new ArrayList<>();
DegradeRule rule = new DegradeRule();
rule.setResource("uploadFile");
rule.setGrade(RuleConstant.DEGRADE_GRADE_EXCEPTION_COUNT);
rule.setCount(10);
rule.setTimeWindow(60);
rule.setMinRequestAmount(5);
rule.setStatIntervalMs(60000);
rules.add(rule);
DegradeRuleManager.loadRules(rules);
}
含义:
- 1 分钟内至少有 5 个请求
- 异常数超过 10
- 触发熔断,熔断时长 60 秒
# 5、系统自适应保护
@PostConstruct
public void initSystemRules() {
List<SystemRule> rules = new ArrayList<>();
SystemRule rule = new SystemRule();
rule.setHighestSystemLoad(8.0);
rule.setHighestCpuUsage(0.8);
rule.setQps(10000);
rule.setAvgRt(100);
rule.setMaxThread(100);
rules.add(rule);
SystemRuleManager.loadRules(rules);
}
# 6、热点参数限流
@PostConstruct
public void initParamFlowRules() {
ParamFlowRule rule = new ParamFlowRule();
rule.setResource("getProduct");
rule.setParamIdx(0);
rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
rule.setCount(100);
ParamFlowItem item = new ParamFlowItem();
item.setObject("1001");
item.setClassType("long");
item.setCount(200);
rule.setParamFlowItemList(Collections.singletonList(item));
ParamFlowRuleManager.loadRules(Collections.singletonList(rule));
}
@GetMapping("/product/{id}")
@SentinelResource(value = "getProduct")
public Product getProduct(@PathVariable Long id) {
return productService.getProduct(id);
}
含义:
- 对第 0 个参数(productId)进行限流
- 普通商品 QPS 限制为 100
- 商品 ID 为 1001 的热点商品 QPS 限制为 200
# 五、Netflix Hystrix
虽然 Hystrix 已经停止维护,但其设计思想仍然值得学习。
# 1、基本使用
# 1.1、引入依赖
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
</dependency>
# 1.2、启用 Hystrix
@SpringBootApplication
@EnableCircuitBreaker
public class Application {
public static void main(String[] args) {
SpringApplication.run(Application.class, args);
}
}
# 1.3、定义 Fallback
@Service
public class UserService {
@HystrixCommand(
fallbackMethod = "getUserFallback",
commandProperties = {
@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",
value = "1000"),
@HystrixProperty(name = "circuitBreaker.requestVolumeThreshold",
value = "10"),
@HystrixProperty(name = "circuitBreaker.errorThresholdPercentage",
value = "50"),
@HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds",
value = "5000")
}
)
public User getUser(Long userId) {
return userClient.getUser(userId);
}
public User getUserFallback(Long userId, Throwable e) {
User fallbackUser = new User();
fallbackUser.setId(userId);
fallbackUser.setNickname("默认用户");
return fallbackUser;
}
}
# 2、线程池隔离
@HystrixCommand(
fallbackMethod = "getInventoryFallback",
threadPoolKey = "inventoryThreadPool",
threadPoolProperties = {
@HystrixProperty(name = "coreSize", value = "10"),
@HystrixProperty(name = "maxQueueSize", value = "100"),
@HystrixProperty(name = "queueSizeRejectionThreshold", value = "80")
}
)
public Inventory getInventory(Long productId) {
return inventoryClient.getInventory(productId);
}
# 3、请求合并
@HystrixCollapser(
batchMethod = "getBatchUsers",
collapserProperties = {
@HystrixProperty(name = "timerDelayInMilliseconds", value = "100"),
@HystrixProperty(name = "maxRequestsInBatch", value = "200")
}
)
public Future<User> getUser(Long userId) {
return null;
}
@HystrixCommand
public List<User> getBatchUsers(List<Long> userIds) {
return userClient.getBatchUsers(userIds);
}
# 4、请求缓存
@HystrixCommand
@CacheResult
public User getUserWithCache(Long userId) {
return userClient.getUser(userId);
}
@CacheRemove(commandKey = "getUserWithCache")
public void updateUser(User user) {
userClient.updateUser(user);
}
# 六、容错降级最佳实践
# 1、容错策略组合
不同的策略应该组合使用:
请求 → 限流 → 熔断 → 隔离 → 超时 → 重试 → 降级 → 返回
@Service
public class OrderService {
@SentinelResource(
value = "createOrder",
blockHandler = "handleBlock",
fallback = "handleFallback"
)
@HystrixCommand(
fallbackMethod = "createOrderFallback",
commandProperties = {
@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",
value = "3000")
},
threadPoolKey = "orderThreadPool"
)
@Retryable(
value = {TimeoutException.class, RemoteException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 100, multiplier = 2)
)
public Order createOrder(OrderRequest request) {
return orderClient.createOrder(request);
}
public Order handleBlock(OrderRequest request, BlockException ex) {
return Order.blocked("系统繁忙,请稍后重试");
}
public Order handleFallback(OrderRequest request, Throwable ex) {
return Order.error("创建订单失败:" + ex.getMessage());
}
public Order createOrderFallback(OrderRequest request, Throwable ex) {
return orderQueue.enqueue(request);
}
}
# 2、超时时间设计
超时时间应该层层递减:
API Gateway: 5000ms
└─ Order Service: 4000ms
└─ Inventory Service: 3000ms
└─ Database: 2000ms
hystrix:
command:
default:
execution:
isolation:
thread:
timeoutInMilliseconds: 4000
getInventory:
execution:
isolation:
thread:
timeoutInMilliseconds: 3000
# 3、熔断阈值设计
熔断阈值应该根据业务特点设置:
| 服务类型 | 失败率阈值 | 最小请求数 | 熔断时间 | 说明 |
|---|---|---|---|---|
| 核心服务 | 50% | 20 | 10s | 宽松阈值,避免误熔断 |
| 一般服务 | 40% | 10 | 30s | 标准配置 |
| 边缘服务 | 30% | 5 | 60s | 严格阈值,快速熔断 |
| 第三方服务 | 20% | 3 | 120s | 非常严格,长时间熔断 |
# 4、降级优先级
降级应该按优先级进行:
@Service
public class ProductDetailService {
private enum FeaturePriority {
BASIC_INFO(1),
PRICE(2),
INVENTORY(3),
IMAGES(4),
REVIEWS(5),
RECOMMENDATIONS(6),
QA(7),
RELATED_PRODUCTS(8);
private final int priority;
FeaturePriority(int priority) {
this.priority = priority;
}
}
public ProductDetail getProductDetail(Long productId, int loadLevel) {
ProductDetail detail = new ProductDetail();
if (loadLevel >= 1) {
detail.setBasicInfo(getBasicInfo(productId));
}
if (loadLevel >= 2) {
detail.setPrice(getPrice(productId));
}
if (loadLevel >= 3) {
detail.setInventory(getInventory(productId));
}
if (loadLevel >= 4) {
detail.setImages(getImages(productId));
}
if (loadLevel >= 5) {
detail.setReviews(getReviews(productId));
}
if (loadLevel >= 6) {
detail.setRecommendations(getRecommendations(productId));
}
if (loadLevel >= 7) {
detail.setQaList(getQAList(productId));
}
if (loadLevel >= 8) {
detail.setRelatedProducts(getRelatedProducts(productId));
}
return detail;
}
private ProductBasicInfo getBasicInfo(Long productId) {
return null;
}
private Price getPrice(Long productId) {
return null;
}
private Inventory getInventory(Long productId) {
return null;
}
private List<String> getImages(Long productId) {
return null;
}
private List<Review> getReviews(Long productId) {
return null;
}
private List<Product> getRecommendations(Long productId) {
return null;
}
private List<QA> getQAList(Long productId) {
return null;
}
private List<Product> getRelatedProducts(Long productId) {
return null;
}
}
# 5、监控与告警
完善的监控是容错降级的基础:
@Component
public class CircuitBreakerMetrics {
@Autowired
private MeterRegistry meterRegistry;
public void recordCircuitBreakerState(String service, String state) {
meterRegistry.counter("circuit_breaker_state",
"service", service,
"state", state
).increment();
}
public void recordFallback(String service) {
meterRegistry.counter("fallback_count",
"service", service
).increment();
}
public void recordRateLimitRejection(String service) {
meterRegistry.counter("rate_limit_rejection",
"service", service
).increment();
}
}
# 6、优雅的降级响应
降级时应该给用户友好的提示:
public class FallbackResponse<T> {
private boolean success;
private String code;
private String message;
private T data;
private FallbackReason reason;
public static <T> FallbackResponse<T> rateLimited() {
FallbackResponse<T> response = new FallbackResponse<>();
response.setSuccess(false);
response.setCode("RATE_LIMITED");
response.setMessage("请求过于频繁,请稍后重试");
response.setReason(FallbackReason.RATE_LIMITED);
return response;
}
public static <T> FallbackResponse<T> circuitOpen() {
FallbackResponse<T> response = new FallbackResponse<>();
response.setSuccess(false);
response.setCode("SERVICE_UNAVAILABLE");
response.setMessage("服务暂时不可用,请稍后重试");
response.setReason(FallbackReason.CIRCUIT_OPEN);
return response;
}
public static <T> FallbackResponse<T> timeout() {
FallbackResponse<T> response = new FallbackResponse<>();
response.setSuccess(false);
response.setCode("TIMEOUT");
response.setMessage("请求超时,请稍后重试");
response.setReason(FallbackReason.TIMEOUT);
return response;
}
public static <T> FallbackResponse<T> withFallbackData(T fallbackData) {
FallbackResponse<T> response = new FallbackResponse<>();
response.setSuccess(true);
response.setCode("FALLBACK_DATA");
response.setMessage("返回默认数据");
response.setData(fallbackData);
response.setReason(FallbackReason.FALLBACK_DATA);
return response;
}
public enum FallbackReason {
RATE_LIMITED,
CIRCUIT_OPEN,
TIMEOUT,
ERROR,
FALLBACK_DATA
}
}
# 七、总结
服务容错与降级是保障分布式系统高可用的关键技术。本文介绍了:
- 核心容错策略:超时控制、重试机制、限流、熔断、隔离
- 服务降级策略:自动降级、手动降级、功能降级、读写降级
- 工具与框架:Sentinel、Hystrix 的使用
- 最佳实践:策略组合、超时设计、熔断阈值、降级优先级、监控告警
关键原则:
- 快速失败:不要让故障蔓延
- 优雅降级:提供有损但可用的服务
- 故障隔离:防止故障扩散
- 自动恢复:系统能够自愈
- 可观测性:完善的监控和告警
在实际应用中,需要根据业务特点、系统规模、容错要求选择合适的策略组合,并通过持续的监控和优化,不断提升系统的健壮性。
祝你变得更强!