Monitoring & SRE Questions

Observability, SLIs/SLOs, and Incident Management.

1. Golden Signals

Question: What are the "Four Golden Signals" of monitoring?

Latency, Traffic, Errors, and Saturation.

2. Push vs Pull Monitoring

Question: Compare Push vs Pull based monitoring systems.

Pull (Prometheus): The server scrapes metrics from agents. Better for knowing if an agent is down.
Push (Graphite/InfluxDB): Agents send metrics to the server. Better for short-lived jobs or behind firewalls.

3. SLI vs SLO vs SLA

Question: Define SLI, SLO, and SLA.

  • SLI (Service Level Indicator): A quantitative measure of some aspect of the level of service (e.g., request latency).
  • SLO (Service Level Objective): A target value or range of values for a service level that is measured by an SLI (e.g., 99.9% of requests should be faster than 200ms).
  • SLA (Service Level Agreement): An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs (e.g., refund if downtime > 1 hour).

4. Distributed Tracing

Question: What is Distributed Tracing and why is it needed?

Definition: A method used to profile and monitor applications, especially those built using a microservices architecture.

Need: In microservices, a single user request might travel through dozens of services. Tracing (using tools like Jaeger or Zipkin) helps visualize the entire request lifecycle to identify bottlenecks and failures.

5. Alert Fatigue

Question: What is Alert Fatigue and how do you prevent it?

Definition: When engineers become desensitized to alerts because they are too frequent or often false positives.

Prevention:

  • Make alerts actionable.
  • Tune thresholds to avoid noise.
  • Group related alerts.
  • Only page humans for urgent, user-impacting issues.