Trace data is being generated even though there are no request from service A to service B #3689

finda-yeongjo · 2024-05-20T01:47:33Z

Describe the bug
When monitoring Java applications using OpenTelemetry Java Auto-Instrumentation, the trace data incorrectly shows service A calling service B (A -> B), even though there is no actual call between A to B.
Based on the concept of microservices, A and B are producing and consuming data through a Kafka topic, maintaining "loose coupling" between each other. This issue is evident in the traces_service_graph_request_total metric and the Zipkin trace data, which suggests a relationship that does not exist.

If I enable the following two options in the OpenTelemetry instrumentation, Service A changes to "user," but the data is still identified.

      - name: OTEL_INSTRUMENTATION_MESSAGING_RECEIVE_TELEMETRY_ENABLED
        value: "true"
      - name: OTEL_INSTRUMENTATION_MESSAGING_SEND_TELEMETRY_ENABLED
        value: "true"

I initially raised this issue with the OpenTelemetry team, but their response suggested raising the issue with Grafana and Tempo instead.
Ref. open-telemetry/opentelemetry-java-instrumentation#11348

To Reproduce
Steps to reproduce the behavior:

Deploy two Java Spring services, A and B, in Kubernetes using the OpenTelemetry Java auto-instrumentation agent with the following configuration:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: sample-instrumentation
  namespace: test
spec:
  propagators:
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"
  java:
    image: "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.29.0"
    env:
      - name: OTEL_METRICS_EXPORTER
        value: "prometheus"
      - name: OTEL_METRICS_EXEMPLAR_FILTER
        value: "trace_based"
      - name: OTEL_TRACES_EXPORTER
        value: "zipkin"
      - name: OTEL_EXPORTER_ZIPKIN_ENDPOINT
        value: "http://SOME_TEST_OTELCOLLECTOR_ENDPOINT:9411/api/v2/spans"
      - name: OTEL_INSTRUMENTATION_HTTP_SERVER_CAPTURE_REQUEST_HEADERS
        value: "content-type"
      - name: OTEL_INSTRUMENTATION_HTTP_SERVER_CAPTURE_RESPONSE_HEADERS
        value: "content-type"

Observe the trace data in Tempo (with Grafana)
Notice that the trace data incorrectly indicate service A is making requests to service B.
Check A->B in Grafana using the service graph or the traces_service_graph_request_total data.
Creating and testing a topic that they consume from each other would be more accurate.

Expected behavior
The trace data and metrics should accurately reflect the interactions between services. Specifically, no traces or metrics should suggest a direct interaction between service A and service B when there is none.

Environment:

Infrastructure: AWS EKS, AWS MSK
Deployment tool: Using Kubernetes manifests with gitops repo & ArgoCD

Additional Context
All services are deployed as pods in EKS.
The issue persists even after verifying that there are no overlapping or contaminated headers and that Trace IDs are unique and correctly configured.
The environment configuration for OpenTelemetry instrumentation includes settings for exporting to Prometheus and Zipkin, capturing content-type headers for HTTP requests and responses.

Service A
JDK: Amazon Corretto 17
Spring: 2.7.1
OS: Amazon Linux (EKS)

Service B
JDK: Amazon Corretto 17
Spring: 3.0.5
OS: Amazon Linux (EKS)

The text was updated successfully, but these errors were encountered:

mapno · 2024-05-20T07:58:04Z

Hi! Service graphs have a number of ways of identifying communication between services—for Tempo they're described in the docs. Connections not necessarily need represent HTTP requests.

* A request across a messaging system where the outgoing and the incoming span must have `span.kind`, `producer`, and `consumer` respectively.

This is what's identifying a connection between the two services.

finda-yeongjo · 2024-05-20T09:28:41Z

Hey @mapno Your answer was fantastic. I have perfectly removed the problematic parts from the dashboard and various graphs using Tempo as a data source. I blame myself for not carefully reading the docs.

May I ask one more question?
When specifying span_kind, there is no data (span_kind_consumer, producer, server, client and unspecified). Is there any additional configuration needed? Simply setting connection_type=messaging_system shows all servers communicating through MSK

I am using auto-instrumentation because I cannot enforce spans on all technical teams, which makes it difficult for me to directly control headers, span kinds, IDs and etc....

mapno · 2024-05-20T10:58:53Z

Hey! span_kind is not a label of service graph metrics (it's set on span-metrics though). I'm not sure if it'd make sense to add it in the first place, since it's implicit by the connection type—ie. if connection_type is messaging_system, the spans must have had kind consumer and producer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace data is being generated even though there are no request from service A to service B #3689

Trace data is being generated even though there are no request from service A to service B #3689

finda-yeongjo commented May 20, 2024

mapno commented May 20, 2024

finda-yeongjo commented May 20, 2024

mapno commented May 20, 2024

Trace data is being generated even though there are no request from service A to service B #3689

Trace data is being generated even though there are no request from service A to service B #3689

Comments

finda-yeongjo commented May 20, 2024

mapno commented May 20, 2024

finda-yeongjo commented May 20, 2024

mapno commented May 20, 2024