服装网站怎么做的,seo软件安卓版,海山免费网站建设,无代码开发平台什么意思前言
可观测性这个词这两年被说烂了#xff0c;但很多团队的实际情况是#xff1a;Prometheus管指标、ELK管日志、Jaeger管链路#xff0c;三套系统各自为战#xff0c;排查问题时要在三个界面之间跳来跳去。
去年我们开始推OpenTelemetry#xff08;简称OTel#xff09;…前言可观测性这个词这两年被说烂了但很多团队的实际情况是Prometheus管指标、ELK管日志、Jaeger管链路三套系统各自为战排查问题时要在三个界面之间跳来跳去。去年我们开始推OpenTelemetry简称OTel目标是统一数据采集标准。折腾了大半年总算把三大支柱Metrics、Logs、Traces串起来了。这篇文章分享一下我们的落地经验包括架构设计、踩过的坑和最终效果。为什么要用OpenTelemetry先说现状------------- ------------- ------------- | Prometheus | | ELK Stack | | Jaeger | ------------ ------------ ------------ | | | v v v 指标采集SDK 日志采集Agent 链路追踪SDK (各种exporter) (Filebeat/Fluentd) (Jaeger client)问题很明显技术栈割裂三套采集方案三种数据格式上下文断裂告警触发后找不到对应的日志和链路维护成本高每种语言都要适配三套SDKOpenTelemetry要解决的就是这个问题——统一采集标准------------------ | OpenTelemetry | | Collector | ----------------- | 统一采集格式 (OTLP协议) | ----------------- | OTel SDK | | (MetricsLogs | | Traces一套搞定) | ------------------架构设计我们的最终架构±----------------| Grafana || (统一展示) |±------±--------|±--------------±--------------±--------------| | | |v v v v±---------- ±---------- ±---------- ±----------| Prometheus| | Loki | | Tempo | | Jaeger || (指标) | | (日志) | | (链路) | | (链路备选)|±---------- ±---------- ±---------- ±----------^ ^ ^ ^| | | |±--------------±------±------±--------------|±--------±--------| OTel Collector || (Gateway模式) |±--------±--------^| OTLP±--------------±--------------| | |±----±---- ±----±---- ±----±----| Service A | | Service B | | Service C || (OTel SDK)| | (OTel SDK)| | (OTel SDK)|±---------- ±---------- ±----------核心思路应用集成OTel SDK通过OTLP协议上报数据Collector作为网关统一接收、处理、分发后端存储可以替换不锁定特定厂商Grafana统一展示Metrics/Logs/Traces互相关联Collector部署OpenTelemetry Collector是核心组件负责数据的接收、处理和导出。Docker部署# docker-compose.ymlversion:3.8services:otel-collector:image:otel/opentelemetry-collector-contrib:0.92.0container_name:otel-collectorcommand:[--config/etc/otel-collector-config.yaml]volumes:-./otel-collector-config.yaml:/etc/otel-collector-config.yamlports:-4317:4317# OTLP gRPC-4318:4318# OTLP HTTP-8888:8888# Collector自身指标-8889:8889# Prometheus exporterrestart:unless-stoppedCollector配置# otel-collector-config.yamlreceivers:otlp:protocols:grpc:endpoint:0.0.0.0:4317http:endpoint:0.0.0.0:4318# 同时支持Prometheus格式兼容现有监控prometheus:config:scrape_configs:-job_name:otel-collectorscrape_interval:10sstatic_configs:-targets:[localhost:8888]processors:# 批量处理减少网络开销batch:timeout:5ssend_batch_size:1000# 内存限制防止OOMmemory_limiter:check_interval:1slimit_mib:1000spike_limit_mib:200# 添加通用属性resource:attributes:-key:deployment.environmentvalue:productionaction:upsertexporters:# 指标 - Prometheusprometheus:endpoint:0.0.0.0:8889namespace:otel# 链路 - Tempootlp/tempo:endpoint:tempo:4317tls:insecure:true# 日志 - Lokiloki:endpoint:http://loki:3100/loki/api/v1/pushlabels:attributes:service.name:service_namelevel:severity# 调试用logging:verbosity:detailedservice:pipelines:traces:receivers:[otlp]processors:[memory_limiter,batch]exporters:[otlp/tempo]metrics:receivers:[otlp,prometheus]processors:[memory_limiter,batch]exporters:[prometheus]logs:receivers:[otlp]processors:[memory_limiter,batch,resource]exporters:[loki]应用接入Go服务接入packagemainimport(contextlognet/httptimego.opentelemetry.io/otelgo.opentelemetry.io/otel/attributego.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpcgo.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpcgo.opentelemetry.io/otel/sdk/metricgo.opentelemetry.io/otel/sdk/resourcego.opentelemetry.io/otel/sdk/tracesemconvgo.opentelemetry.io/otel/semconv/v1.21.0)funcinitTracer()(*trace.TracerProvider,error){exporter,err:otlptracegrpc.New(context.Background(),otlptracegrpc.WithEndpoint(otel-collector:4317),otlptracegrpc.WithInsecure(),)iferr!nil{returnnil,err}res:resource.NewWithAttributes(semconv.SchemaURL,semconv.ServiceName(user-service),semconv.ServiceVersion(1.0.0),attribute.String(environment,production),)tp:trace.NewTracerProvider(trace.WithBatcher(exporter),trace.WithResource(res),trace.WithSampler(trace.TraceIDRatioBased(0.1)),// 采样率10%)otel.SetTracerProvider(tp)returntp,nil}funcinitMeter()(*metric.MeterProvider,error){exporter,err:otlpmetricgrpc.New(context.Background(),otlpmetricgrpc.WithEndpoint(otel-collector:4317),otlpmetricgrpc.WithInsecure(),)iferr!nil{returnnil,err}mp:metric.NewMeterProvider(metric.WithReader(metric.NewPeriodicReader(exporter,metric.WithInterval(15*time.Second))),)otel.SetMeterProvider(mp)returnmp,nil}funcmain(){tp,_:initTracer()defertp.Shutdown(context.Background())mp,_:initMeter()defermp.Shutdown(context.Background())tracer:otel.Tracer(user-service)meter:otel.Meter(user-service)// 创建指标requestCounter,_:meter.Int64Counter(http_requests_total)requestDuration,_:meter.Float64Histogram(http_request_duration_seconds)http.HandleFunc(/api/user,func(w http.ResponseWriter,r*http.Request){ctx,span:tracer.Start(r.Context(),GetUser)deferspan.End()start:time.Now()// 业务逻辑span.SetAttributes(attribute.String(user.id,r.URL.Query().Get(id)))// 模拟数据库查询_,dbSpan:tracer.Start(ctx,DB.Query)time.Sleep(50*time.Millisecond)dbSpan.End()// 记录指标requestCounter.Add(ctx,1,attribute.String(method,r.Method))requestDuration.Record(ctx,time.Since(start).Seconds())w.Write([]byte({name: test}))})log.Println(Server starting on :8080)http.ListenAndServe(:8080,nil)}Java服务接入Java用Agent方式更方便不用改代码# 下载Agentwgethttps://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.1.0/opentelemetry-javaagent.jar# 启动时加参数java -javaagent:opentelemetry-javaagent.jar\-Dotel.service.nameorder-service\-Dotel.exporter.otlp.endpointhttp://otel-collector:4317\-Dotel.traces.samplertraceidratio\-Dotel.traces.sampler.arg0.1\-jar order-service.jar自动埋点支持HTTP请求、数据库调用、Redis、Kafka等开箱即用。Python服务接入fromopentelemetryimporttrace,metricsfromopentelemetry.sdk.traceimportTracerProviderfromopentelemetry.sdk.metricsimportMeterProviderfromopentelemetry.exporter.otlp.proto.grpc.trace_exporterimportOTLPSpanExporterfromopentelemetry.exporter.otlp.proto.grpc.metric_exporterimportOTLPMetricExporterfromopentelemetry.sdk.trace.exportimportBatchSpanProcessorfromopentelemetry.sdk.metrics.exportimportPeriodicExportingMetricReaderfromopentelemetry.sdk.resourcesimportResource resourceResource.create({service.name:payment-service})# 配置Tracertrace_providerTracerProvider(resourceresource)trace_exporterOTLPSpanExporter(endpointotel-collector:4317,insecureTrue)trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))trace.set_tracer_provider(trace_provider)# 配置Metermetric_readerPeriodicExportingMetricReader(OTLPMetricExporter(endpointotel-collector:4317,insecureTrue),export_interval_millis15000)meter_providerMeterProvider(resourceresource,metric_readers[metric_reader])metrics.set_meter_provider(meter_provider)tracertrace.get_tracer(payment-service)metermetrics.get_meter(payment-service)# 使用tracer.start_as_current_span(process_payment)defprocess_payment(order_id):spantrace.get_current_span()span.set_attribute(order.id,order_id)# 业务逻辑...关联Metrics、Logs、Traces这是OTel最有价值的部分——三大支柱的关联。TraceID注入日志import(go.opentelemetry.io/otel/tracego.uber.org/zap)funcLogWithTrace(ctx context.Context,logger*zap.Logger)*zap.Logger{span:trace.SpanFromContext(ctx)ifspan.SpanContext().IsValid(){returnlogger.With(zap.String(trace_id,span.SpanContext().TraceID().String()),zap.String(span_id,span.SpanContext().SpanID().String()),)}returnlogger}// 使用funchandleRequest(ctx context.Context){logger:LogWithTrace(ctx,zap.L())logger.Info(Processing request,zap.String(user_id,123))}日志里带上trace_id后在Grafana里可以直接从日志跳转到对应的链路。Exemplar关联Prometheus 2.25支持Exemplar把指标和TraceID关联// 记录指标时带上TraceIDrequestDuration.Record(ctx,duration,metric.WithAttributes(attribute.String(method,GET)),)Grafana看到指标异常时可以直接跳转到具体的链路追踪。Grafana配置数据源配置# grafana/provisioning/datasources/datasources.yamlapiVersion:1datasources:-name:Prometheustype:prometheusurl:http://prometheus:9090isDefault:true-name:Tempotype:tempourl:http://tempo:3200jsonData:tracesToLogs:datasourceUid:lokitags:[service.name]mappedTags:[{key:service.name,value:service_name}]mapTagNamesEnabled:true-name:Lokitype:lokiurl:http://loki:3100jsonData:derivedFields:-datasourceUid:tempomatcherRegex:trace_id:(\w)name:TraceIDurl:$${__value.raw}效果配置好后排查问题的体验Prometheus告警某服务P99延迟飙升点击Exemplar跳转到具体的慢请求链路在Tempo看链路发现DB查询耗时异常从链路跳转日志看到具体的SQL和错误信息整个链路打通效率提升太多。生产经验采样策略全量采集不现实要设置采样率// 尾部采样异常请求一定采集trace.NewTracerProvider(trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1),// 正常请求10%采样)),)更智能的做法是用Collector做尾部采样processors:tail_sampling:decision_wait:10spolicies:# 错误一定采集-name:errorstype:status_codestatus_code:{status_codes:[ERROR]}# 慢请求一定采集-name:slow-requeststype:latencylatency:{threshold_ms:1000}# 其他随机采样-name:randomizedtype:probabilisticprobabilistic:{sampling_percentage:10}资源控制Collector本身也需要监控和限制processors:memory_limiter:check_interval:1slimit_mib:2000spike_limit_mib:400extensions:health_check:endpoint::13133zpages:endpoint::55679# 调试页面多集群管理我们有三个Kubernetes集群每个集群部署一个Collector。管理这些Collector时我用星空组网把三个集群的内网打通Grafana统一查询所有集群的数据。不然每个集群单独配一套Grafana运维成本太高。踩过的坑坑1Collector内存暴涨刚上线时Collector经常OOM。原因是batch processor积攒太多数据。解决加memory_limiter调小batch size坑2SDK版本不一致不同服务用的OTel SDK版本不一样导致数据格式有差异。解决统一SDK版本在Collector用transform processor做兼容坑3日志量太大OTel日志采集默认全量Loki扛不住。解决在应用层过滤只采集ERROR及以上级别或者在Collector用filter processorprocessors:filter:logs:exclude:match_type:strictseverity_texts:[DEBUG,INFO]总结OpenTelemetry带来的改变统一标准一套SDK搞定三大支柱数据关联从指标到链路到日志一键跳转厂商中立后端存储可以随时换社区活跃主流语言和框架都有官方支持落地成本确实不低但长期收益明显。特别是排查线上问题时能快速定位到具体代码这个效率提升是实打实的。建议新项目直接用OTel老项目可以逐步迁移——先接Collector再慢慢替换各服务的SDK。