FE Observability
Decision
We use Sentry as our observability provider. The firefighter is responsible to monitor frontend errors. Teams need to fix errors with high priority to ensure that monitoring keeps clean.
Problems
We don't have an observability solution that alerts us in real-time about exceptions and errors occurring in our frontend applications. This means we only discover bugs or problems when customers reach out to the Retention team and in turn they notify us, or we by chance discover them during development.
Context
- We previously had a self-hosted version of Sentry implemented but eventually removed it because the errors were not being strictly monitored either as part of firefighting nor by the corresponding teams.
- We need to consider GDPR aspect of the tooling. In general, it is easier for us if the tooling is hosted in EU.
Options
The following packages were considered
- Sentry
- Sentry self-hosted
- Communication with retention team is enough, continue without frontend observability
Reasoning
Sentry
Sentry focuses on Error reporting with added features related to performance monitoring. It offers a comprehensive breakdown of errors like:
- Tags (e.g. Handled vs Unhandled Exceptions, OS, etc...)
- Stack trace
- Breadcrumbs (events happening pre/post exception)
- Metadata context from the user
- Replays (similar to the recordings feature from PostHog)
- Statistics reporting (e.g. frequency of the error, first seen, last seen, etc...)
From the performance monitoring aspect:
- Statistic analysis of transactions (e.g. transactions per minute, failure rate, slow HTTP Ops...)
- Defining sampling rate of transactions
DevX:
- Available React Error boundary component to automatically catch and report exceptions
- Integration with Redux for added context within breadcrumbs
- Profiler HOC for React components
External integrations:
- Slack
- AWS SQS
- Asana
- Jira
- Github
- among others...
Sentry SaaS vs Self-hosted
The Teams tier pricing for Sentry SaaS gives us the following monthly allocations:
- 50k errors for monitoring
- 100k performance transaction units
- 500 session replays
- 150 file attachments to errors
With our current ~500 monthly active users, the Teams tier should be more than enough to cover our needs.
On the other hand, if we go via the self-hosted path we would need to pay for AWS hosting services (ECS, load balancers, RDS, etc...)
and developer time for software maintenance.
At 26 USD per month for the Teams tier, it is clear that SaaS is the most cost effective solution for us.
Consequences
How do we implement this change?
Initial implementation was done as part of writing this ADR.
Who will implement the change?
The Create team can start implementing frontend monitoring for client-dashboard-2. After gathering our learnings we can expand monitoring to admin dashboard, as well as the legacy client dashboard.
How do we teach this change?
Learning journeys and potentially a demo in a learning friday for the accompanying dashboard solution so that the rest of the department is comfortable for firefighting.
What could go wrong?
We might fall into the same behavior where frontend errors/exceptions are ignored, wasting resources and effort.
What do we do if something goes wrong?
Frontend observability SDKs are usually exposed in the form of a provider with a number of optional integrations. Removing them is easy and should not have any kind of impact to the rest of the code.
What is still unclear?
How do we best integrate frontend and backend monitoring? Should we use Sentry also for the backend, or should we introduce an observability solution like Honeycomb or AWS X-Rays?