Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(processors): Traffic shaper processor plugin to shape uneven distribution of incoming metrics #15354

Closed
wants to merge 4 commits into from

Conversation

lakshmansai
Copy link

feat(processors): Traffic shaper processor plugin to shape uneven distribution of incoming metrics

Summary

Use Case
An in-memory traffic shaper processor which evens out traffic so that output traffic is uniform

We use telegraf as a proxy and we receive data that is spiky in nature, in our every 10 minute we receive a spike and this affects our downstream systems since it needs to also process at the same rate, this leads to wastage of resource since the cpu, memory needs to be provisioned for peaks

Screenshot of spiky behavior before and after using this plugin it is visible that after 1:00 the output rate is steady.
traffic_distribution

Checklist

  • [ x] No AI generated code was used in this PR

Related issues

resolves #15353

@telegraf-tiger
Copy link
Contributor

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label May 14, 2024
@lakshmansai
Copy link
Author

lakshmansai commented May 14, 2024

!signed-cla

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

If we are going to take a processor like this, the messaging to the user needs to be improved, but I still need to talk to the rest of the team if this is something we wish to support as well. I've given some initial comments.

If I send 3 metrics and use your traffic shaper as follows:

[[inputs.exec]]
    commands = [
        "echo metric,host=a value=42",
        "echo metric,host=b value=1",
        "echo metric,host=c value=2",
    ]
    data_format = "influx"

[[processors.traffic_shaper]]
    samples = 1
    buffer_size = 10000

I still see all 3 metrics sent at each interval.

I see time unit is not exposed in the config, which it should be, and defaults to 1 second. If I change this to 10 seconds to match the flush interval, I then see 1, sometimes 2, metrics get produces.

What I don't see is the processors buffer size at any given time. I think this is a major issue as a user would have no way to know or gauge if they are not sending enough metrics at any given time.

Thanks


## No of samples to be emitted per time unit, default is seconds
## This should be used in conjunction with number of telegraf instances.
samples = 20000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults can be commented out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Buffer Size
## If buffer is full the incoming metrics will be dropped
buffer_size = 1000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expose the time unit option as a config.Duration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done added rate in the config

output traffic is uniform

Example of uneven traffic distribution
![traffic_distribution](./docs/traffic_distribution.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we omit the image.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 21 to 22
Queue chan *telegraf.Metric
Acc telegraf.Accumulator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be exported?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope have changed it.

func (t *TrafficShaper) Stop() {
t.Log.Debugf("Got stop signal %s", time.Now().String())
close(t.Queue)
t.wg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block telegraf from exiting until all metrics are flushed from the queue? Not sure this is the behavior we want. When someone closes or stops telegraf, things should clean up, but this could block for 100s or 1000s of seconds.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this as a config so that for users can choose accordingly.

"github.com/influxdata/telegraf/metric"
"github.com/influxdata/telegraf/testutil"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a test with tracking metrics. See the other processors for examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lakshmansai
Copy link
Author

lakshmansai commented May 16, 2024

Hi,

If we are going to take a processor like this, the messaging to the user needs to be improved, but I still need to talk to the rest of the team if this is something we wish to support as well. I've given some initial comments.

If I send 3 metrics and use your traffic shaper as follows:

[[inputs.exec]]
    commands = [
        "echo metric,host=a value=42",
        "echo metric,host=b value=1",
        "echo metric,host=c value=2",
    ]
    data_format = "influx"

[[processors.traffic_shaper]]
    samples = 1
    buffer_size = 10000

I still see all 3 metrics sent at each interval.

I see time unit is not exposed in the config, which it should be, and defaults to 1 second. If I change this to 10 seconds to match the flush interval, I then see 1, sometimes 2, metrics get produces.

What I don't see is the processors buffer size at any given time. I think this is a major issue as a user would have no way to know or gauge if they are not sending enough metrics at any given time.

Thanks

Have added rate time interval as config, we have exposed metrics like messagesInFlight for observability.

@telegraf-tiger
Copy link
Contributor

@powersj powersj closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(processors): Traffic shaper processor plugin to shape uneven distribution of incoming metrics
2 participants