Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Counter metric data when ExportFailed #2366

Open
gliderkite opened this issue Nov 29, 2024 · 3 comments
Open

Wrong Counter metric data when ExportFailed #2366

gliderkite opened this issue Nov 29, 2024 · 3 comments
Labels
triage:needmoreinfo Has been triaged and requires more information from the filer.

Comments

@gliderkite
Copy link

gliderkite commented Nov 29, 2024

Related Problems?

We are using opentelemetry to send metric via an OTLP exporter that is setup as follow

global::set_error_handler(|e| warn!("opentelemetry: {e:?}"))?;

let provider = opentelemetry_otlp::new_pipeline()
    .metrics(runtime::Tokio)
    .with_exporter(
        opentelemetry_otlp::new_exporter()
            .tonic()
            .with_endpoint("http://endpoint:4317/"),
    )
    .with_resource(resource)
    .build()?;

global::set_meter_provider(provider.clone());

let meter = provider.meter("name");

In particular, we are interested in counting a particular occurrence, and we use a Counter to do the job, setup as follow:

let counter = meter.u64_counter("requests-counter").init();

...

info!("querying API");
counter.add(1, &[KeyValue::new("endpoint", API_ENDPOINT)]);

Everything seems to work as expected, and I can see on the metrics explorer (DataDog) the counter being increased as expected, and for each time a corresponding INFO log.

However, it has happened multiple times that we get huge spikes in the counter metric at times where there is no reason to suspect the counter has actually being changed (ie, no calls to counter.add()):

image

As you can see in the above example there was a spike at 6am, however no INFO log was recorded for it (no other trace recorded the call to that part of the code). What happened instead is that, for some unknown reason (maybe related to this other issue?), we see a log at the exact same time of the spike, coming from the global error handler:

Trace(ExportFailed(Status { code: Unavailable, message: ", detailed error message: tcp connect error: Connection refused (os error 111)" }))

Describe the solution you'd like:

I would like the counter to report the correct number of times it has been increased, and that, in particular, in case of (unknown) connection refused errors, no wrong metrics will be reported.

If this is an error on my side (for example wrong setup or usage of the meter) please let me know.

Considered Alternatives

No response

Additional Context

No response

@gliderkite gliderkite added enhancement New feature or request triage:todo Needs to be traiged. labels Nov 29, 2024
@cijothomas
Copy link
Member

Could you update to the newest version of the crates?
Its not clear what is the exact issue, but one guess is that - you are not explicitly setting temporality, so it'll be cumulative by-default, which will always export the total counter since the start of the process and this may not be what you are looking for? Chose cumulative vs delta is usually based on the backend, so you need to make sure the backends' recommendation is followed.

We won't be able to dig into the specific backend issues, so really appreciate if you can get a repro that we can investigate.

@gliderkite
Copy link
Author

gliderkite commented Dec 2, 2024

Could you update to the newest version of the crates?

yes, I will, and I will provide you in a separate crate the part of the code that I use related to the counter setup and usage, if this still doesn't work (however I posted this issue because I didn't see in the CHANGELOG any fix related to this in the last version).

Its not clear what is the exact issue, but one guess is that - you are not explicitly setting temporality, so it'll be cumulative by-default, which will always export the total counter since the start of the process and this may not be what you are looking for? Chose cumulative vs delta is usually based on the backend, so you need to make sure the backends' recommendation is followed.

Could you please elaborate what exactly you mean with this sentence? The issue is this:

  1. I send, for example, 10 requests between 10-11am, what I see is:
    • Value of 10 when considering the same time interval in the metric explorer UI
    • 10 INFO logs
  2. I then sen another 10 requests between 11-12am, what I see is:
    • Value of 10 when considering the same time interval in the metric explorer UI
    • Value of 20 when considering the time interval between 10-12am
    • 10 INFO logs (or 20 if we consider again the whole time interval from 10am)

In none of these cases I would expect a spike with no related INFO logs during a time interval when no requests were sent. Which is what is happening at the same time we got the ExportFailed error (or sometimes Metric(ExportErr(Status { code: Unavailable, message: ", detailed error message: tcp connect error: Connection refused (os error 111)" }))). And this has been happening consistently, even when no requests at all were sent in a whole day.

In summary the backend seems to be working as expected, as well as the counter implementation, except when the metrics/traces cannot be exported.

@cijothomas
Copy link
Member

Need to get a repro application to investigate this. (Note we won't be able to look into your backend, so need to show either stdout or OTLP exporter with a local collector outputting debug so we can see what is emitted from the application itself).

@cijothomas cijothomas added triage:needmoreinfo Has been triaged and requires more information from the filer. and removed enhancement New feature or request triage:todo Needs to be traiged. labels Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage:needmoreinfo Has been triaged and requires more information from the filer.
Projects
None yet
Development

No branches or pull requests

2 participants