Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace data is being generated even though there are no request from service A to service B #3689

Open
finda-yeongjo opened this issue May 20, 2024 · 3 comments

Comments

@finda-yeongjo
Copy link

Describe the bug
When monitoring Java applications using OpenTelemetry Java Auto-Instrumentation, the trace data incorrectly shows service A calling service B (A -> B), even though there is no actual call between A to B.
Based on the concept of microservices, A and B are producing and consuming data through a Kafka topic, maintaining "loose coupling" between each other. This issue is evident in the traces_service_graph_request_total metric and the Zipkin trace data, which suggests a relationship that does not exist.
스크린샷 2024-05-20 오전 10 44 48

If I enable the following two options in the OpenTelemetry instrumentation, Service A changes to "user," but the data is still identified.

      - name: OTEL_INSTRUMENTATION_MESSAGING_RECEIVE_TELEMETRY_ENABLED
        value: "true"
      - name: OTEL_INSTRUMENTATION_MESSAGING_SEND_TELEMETRY_ENABLED
        value: "true"
스크린샷 2024-05-20 오전 10 41 45

I initially raised this issue with the OpenTelemetry team, but their response suggested raising the issue with Grafana and Tempo instead.
Ref. open-telemetry/opentelemetry-java-instrumentation#11348

To Reproduce
Steps to reproduce the behavior:

  1. Deploy two Java Spring services, A and B, in Kubernetes using the OpenTelemetry Java auto-instrumentation agent with the following configuration:
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: sample-instrumentation
  namespace: test
spec:
  propagators:
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"
  java:
    image: "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.29.0"
    env:
      - name: OTEL_METRICS_EXPORTER
        value: "prometheus"
      - name: OTEL_METRICS_EXEMPLAR_FILTER
        value: "trace_based"
      - name: OTEL_TRACES_EXPORTER
        value: "zipkin"
      - name: OTEL_EXPORTER_ZIPKIN_ENDPOINT
        value: "http://SOME_TEST_OTELCOLLECTOR_ENDPOINT:9411/api/v2/spans"
      - name: OTEL_INSTRUMENTATION_HTTP_SERVER_CAPTURE_REQUEST_HEADERS
        value: "content-type"
      - name: OTEL_INSTRUMENTATION_HTTP_SERVER_CAPTURE_RESPONSE_HEADERS
        value: "content-type"
  1. Observe the trace data in Tempo (with Grafana)
  2. Notice that the trace data incorrectly indicate service A is making requests to service B.
  3. Check A->B in Grafana using the service graph or the traces_service_graph_request_total data.
  4. Creating and testing a topic that they consume from each other would be more accurate.

Expected behavior
The trace data and metrics should accurately reflect the interactions between services. Specifically, no traces or metrics should suggest a direct interaction between service A and service B when there is none.

Environment:

  • Infrastructure: AWS EKS, AWS MSK
  • Deployment tool: Using Kubernetes manifests with gitops repo & ArgoCD

Additional Context
All services are deployed as pods in EKS.
The issue persists even after verifying that there are no overlapping or contaminated headers and that Trace IDs are unique and correctly configured.
The environment configuration for OpenTelemetry instrumentation includes settings for exporting to Prometheus and Zipkin, capturing content-type headers for HTTP requests and responses.

Service A
JDK: Amazon Corretto 17
Spring: 2.7.1
OS: Amazon Linux (EKS)

Service B
JDK: Amazon Corretto 17
Spring: 3.0.5
OS: Amazon Linux (EKS)

@mapno
Copy link
Member

mapno commented May 20, 2024

Hi! Service graphs have a number of ways of identifying communication between services—for Tempo they're described in the docs. Connections not necessarily need represent HTTP requests.

* A request across a messaging system where the outgoing and the incoming span must have `span.kind`, `producer`, and `consumer` respectively.

This is what's identifying a connection between the two services.

@finda-yeongjo
Copy link
Author

Hey @mapno Your answer was fantastic. I have perfectly removed the problematic parts from the dashboard and various graphs using Tempo as a data source. I blame myself for not carefully reading the docs.

May I ask one more question?
When specifying span_kind, there is no data (span_kind_consumer, producer, server, client and unspecified). Is there any additional configuration needed? Simply setting connection_type=messaging_system shows all servers communicating through MSK
스크린샷 2024-05-20 오후 6 24 14

I am using auto-instrumentation because I cannot enforce spans on all technical teams, which makes it difficult for me to directly control headers, span kinds, IDs and etc....

@mapno
Copy link
Member

mapno commented May 20, 2024

Hey! span_kind is not a label of service graph metrics (it's set on span-metrics though). I'm not sure if it'd make sense to add it in the first place, since it's implicit by the connection type—ie. if connection_type is messaging_system, the spans must have had kind consumer and producer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants