Loading navigation...
Infrastructural Documentation
Logo
L1/L2 Alerts

L1/L2 Alerts

Logo

5 mins READ

To outline our approach based on potential tools:

  • Prometheus and Grafana: If you’re using this setup, we can deploy Prometheus agents on all Kubernetes nodes to collect metrics like CPU, memory, etc., and configure alerts in Grafana with Prometheus as the data source.

  • Managed Services: For resources like Kafka, MemoryDB, etc., we’ll route CloudWatch metrics into Grafana for alerting.

  • Functional Alerts: For functional checks (e.g., failed runs), we’ll send metrics to Prometheus via a connector, enabling alerting in Grafana.

  • Datadog/Splunk: If Datadog or Splunk is the monitoring tool, we can deploy Datadog agents to monitor the entire infrastructure.

Alerts

Alert Name

Threshold

Severity

CPU Utilization (Percent)

70%

L1 Warning

CPU Utilization (Percent)

90%

L2 Critical

FreeLocalStorage (Bytes)

10Gb

L1 Warning

FreeLocalStorage (Bytes)

5Gb

L2 Critical

Database Connections (Count)

less than 50

L1 Warning

Database Connections (Count)

less than 25

L2 Critical

Read Latency (Seconds)

greater than 3sec

L1 Warning

Read Latency (Seconds)

greater than 5sec

L2 Critical

Write Latency (Seconds)

greater than 3sec

L1 Warning

Write Latency (Seconds)

greater than 5sec

L2 Critical

Database Memory Usage Percentage

85

L1 Warning

Database Memory Usage Percentage

90

L2 Critical

Engine CPU Utilization

85

L1 Warning

Engine CPU Utilization

90

L2 Critical

Number of client connection over last hour is less than

<15

L2 Critical

Authentication failures over last hour is more than

>3

L2 Critical

Disk usage by broker

80%

L1 Warning

Disk usage by broker

90%

L2 Critical

CPU (User) usage by broker

80%

L1 Warning

CPU (User) usage by broker

90%

L2 Critical

Lag alerts on topics

>500

L1 Warning

Lag alerts on topics

>1000

L2 Critical

Kafka - partition count per broker

>1000

L2 Critical

Kafka - connection count per broker

<= 0

L2 Critical

Container in waiting status (in minutes)

>1min

L2 Critical

Container restarts (over last 10 min)

>5

L2 Critical

Container terminated with error (over last 10 min)

1

L2 Critical

Pod High CPU Usage (in percentage)

>80%

L1 Warning

Pod High CPU Usage (in percentage)

>90%

L2 Critical

Pod High Memory Usage (in percentage)

>80%

L1 Warning

Pod High Memory Usage (in percentage)

>90%

L2 Critical

Kubernetes Pod Crash Looping (over last 5 min)

1

L2 Critical

Node Not Ready (duration in minutes)

4mins

L1 Warning

Node Not Ready (duration in minutes)

5min

L2 Critical

High CPU Node Utilization (in percentage)

>80%

L1 Warning

High CPU Node Utilization (in percentage)

>90%

L2 Critical

Kubernetes PVC available space (in percentage)

<20%

L1 Warning

Kubernetes PVC available space (in percentage)

<10%

L2 Critical

Kubernetes PVC Pending / Lost (over last 10 min)

1

L2 Critical

Full GC Alerts on pods (over last 5 min)

>0

L2 Critical

Subnet run out of free IP addresses

0

L2 Critical

401 status code

>50 hits within 5 minutes

L1 Warning

500 status code

>10 hits within a minute

L1 Warning

>500 status codes

>10 hits within a minute

L1 Warning

SQL Routine load error

>10 hits within 5 minute

L1 Warning

GRPC Exceptions

>5 hits within 5 minutes

L1 Warning

NullPointerException errors

>5 hits within 5 minutes

L1 Warning