FE Observability

Decision

We use Sentry as our observability provider. The firefighter is responsible to monitor frontend errors. Teams need to fix errors with high priority to ensure that monitoring keeps clean.

Problems

We don't have an observability solution that alerts us in real-time about exceptions and errors occurring in our frontend applications. This means we only discover bugs or problems when customers reach out to the Retention team and in turn they notify us, or we by chance discover them during development.

Context

We previously had a self-hosted version of Sentry implemented but eventually removed it because the errors were not being strictly monitored either as part of firefighting nor by the corresponding teams.
We need to consider GDPR aspect of the tooling. In general, it is easier for us if the tooling is hosted in EU.

Options

The following packages were considered

Sentry
Sentry self-hosted
Communication with retention team is enough, continue without frontend observability

Reasoning

Sentry

Sentry focuses on Error reporting with added features related to performance monitoring. It offers a comprehensive breakdown of errors like:

Tags (e.g. Handled vs Unhandled Exceptions, OS, etc...)
Stack trace
Breadcrumbs (events happening pre/post exception)
Metadata context from the user
Replays (similar to the recordings feature from PostHog)
Statistics reporting (e.g. frequency of the error, first seen, last seen, etc...)

From the performance monitoring aspect:

Statistic analysis of transactions (e.g. transactions per minute, failure rate, slow HTTP Ops...)
Defining sampling rate of transactions

DevX:

Available React Error boundary component to automatically catch and report exceptions
Integration with Redux for added context within breadcrumbs
Profiler HOC for React components

External integrations:

Slack
AWS SQS
Asana
Jira
Github
among others...

Sentry SaaS vs Self-hosted

The Teams tier pricing for Sentry SaaS gives us the following monthly allocations:

50k errors for monitoring
100k performance transaction units
500 session replays
150 file attachments to errors

With our current ~500 monthly active users, the Teams tier should be more than enough to cover our needs.

On the other hand, if we go via the self-hosted path we would need to pay for AWS hosting services (ECS, load balancers, RDS, etc...) and developer time for software maintenance. At 26 USD per month for the Teams tier, it is clear that SaaS is the most cost effective solution for us.

Consequences

How do we implement this change?

Initial implementation was done as part of writing this ADR.

Who will implement the change?

The Create team can start implementing frontend monitoring for client-dashboard-2. After gathering our learnings we can expand monitoring to admin dashboard, as well as the legacy client dashboard.

How do we teach this change?

Learning journeys and potentially a demo in a learning friday for the accompanying dashboard solution so that the rest of the department is comfortable for firefighting.

What could go wrong?

We might fall into the same behavior where frontend errors/exceptions are ignored, wasting resources and effort.

What do we do if something goes wrong?

Frontend observability SDKs are usually exposed in the form of a provider with a number of optional integrations. Removing them is easy and should not have any kind of impact to the rest of the code.

What is still unclear?

How do we best integrate frontend and backend monitoring? Should we use Sentry also for the backend, or should we introduce an observability solution like Honeycomb or AWS X-Rays?

Related ADRs

Posthog