Rate-limiting Sentry issues on the client
I love Sentry as an error tracker, but one issue I keep running into is a few noisy errors consuming the entire quota. After that, Sentry stops recording errors. That means no error tracking until your quota is renewed in 27 days 😢.
This can be caused by all sorts of things: infrastructure issues, a third-party API outage triggering repeated failures, or a scheduled job triggering thousands of failing tasks.
Whatever the cause, letting things crash is often reasonable1. It shouldn't break error tracking!
I did not find a satisfying solution with existing tools, so I built a basic client-side rate limiter. The code can be found here.
Read the "Existing solutions" sectionExisting solutions
Sentry offers built-in ways to control which errors get recorded/sent:
- Spike protection (enabled by default) prevents massive bursts from consuming your quota. Great, but does not help with sustained high-volume errors.
- inbound filters are nice and simple to completely ignore a few specific errors. Requires a 'Business' plan.
- Sentry rate-limiting runs on the Sentry servers and applies to all events equally. If a noisy event triggers rate limiting, rare, important events happening later will be dropped.
- The client's
sample_rate
option also applies equally to all errors. Not sure when that one is ever a good idea 🙂.
None of these fully address the "noisy errors" situation, so I explored a different approach: client-side rate limiting per issue.
Read the "Client-side rate limiting per issue" sectionClient-side rate limiting per issue
Sentry's before_send
callback lets us use custom logic to modify or drop events before they are sent. Perfect to filter noisy errors.
What we want:
- If an error occurs 1000 times per hour, record only the first 100.
- Rare events should never be dropped.
The API should look like this:
sentry_sdk.init(before_send=drop_event_if_we_seen_many_of_these_recently)
Sentry groups similar errors into issues by computing a fingerprint (some kind of hash). This happens on the server. To apply rate limiting per issue on the client, we:
- Compute a fingerprint for each event. A best-effort approximation of Sentry's issue fingerprint on the server.
- Track when we have seen this fingerprint.
- Drop events when they exceed the configured rate limit.
This requires storing timestamps of past events. This is done in memory, meaning the rate limiting is per process. If we run 4 processes, each will track rate limiting separately. Relevant when choosing rate-limiting numbers.
Read the "Computing a fingerprint" sectionComputing a fingerprint
We really don't want to reimplement all Sentry heuristics, but we don't have to! 'Good enough' is good enough here2.
Here's how the fingerprint for an exception could be computed (just a string representation of the stacktrace):
exc_tb: TracebackType
tb_summary: StackSummary = traceback.extract_tb(exc_tb)
fingerprint = "\n".join([f"{frame.filename}:{frame.lineno}"
for frame in tb_summary])
It might also be possible to reuse some of the server-side fingerprinting code from Sentry for a more robust solution.
Read the "Writing code in before_send" sectionWriting code in before_send
Testing before_send
isn't always straightforward. Some practical tips:
- Use
init(debug=True, ..)
in development. Without this, exceptions raised inbefore_send
are silently discarded (my favorite dev experience!). - Manual testing with Kent feels much better than testing with a real Sentry DSN. Was easy to set up.
- Avoid database calls/network inside
before_send
. This runs in the request/response cycle (assuming a web context).
Read the "Conclusion" sectionConclusion
I haven't seen a lot of discussion on this topic. If you've solved it differently, I'd love to hear about it!
Regardless of the implementation, client-side filtering of noisy errors feels like a better design. It saves unnecessary network calls, bandwidth, and prevents extra work on both client and server.
Now we can let things crash and stop worrying about our Sentry quota 🚀.
Read the "Footnotes" sectionFootnotes
-
You get reports in your error tracker and can make an informed decision on which code to write. Also, if you can't do anything about the error, at least you know it happened. ↩
-
If two events get grouped differently on the client vs. Sentry, one issue just gets more reports (or the reports are split in two issues). Not a big deal. ↩