失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > prometheus监控常用告警规则

prometheus监控常用告警规则

时间:2023-08-11 08:22:06

相关推荐

prometheus监控常用告警规则

1、监控服务器是否重启

- alert: CentosServiceRestartexpr: time() - node_boot_time_seconds < 180for: 2mlabels:severity: warningannotations:summary: "Instance is restart"description: "Instance is restarted, uptime <3min"

- alert: WindowsServiceRestartexpr: time() - windows_system_system_up_time < 180for: 2mlabels:severity: warningannotations:summary: "Instance is restart"description: "Instance is restarted, uptime <3min"

2、内存使用过高

- alert: InstanceMemUsageHighexpr: 100 - (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)*100 > 98for: 2mlabels:severity: criticalannotations:summary: "Memory usage high"description: "Memory usage above 98%.(current usage: {{ $value }}%)"

- alert: WinInstanceMemUsageHighexpr: 100-(windows_os_physical_memory_free_bytes/windows_cs_physical_memory_bytes)*100 > 98for: 3mlabels:severity: criticalannotations:summary: "Instance memory usage high"description: "Instance memory usage above 98%.(current usage: {{ $value }}%)"

3、CPU使用过高

- alert: CPUUsageHighexpr: 100-(avg(irate(node_cpu_seconds_total[2m])) by (instance,region) *100) > 90for: 3mlabels:severity: warningannotations:summary: "CPU usage high"description: "CPU usage above 90%.(current usage: {{ $value }})"

- alert: WinCpuUsageexpr: 100 - (avg by (instance,region) (irate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 90for: 3mlabels:severity: warningannotations:summary: "Instance CPU usage high"description: "Instance CPU Usage is more than 90%.(current usage: {{ $value }}%)"

4、磁盘使用率过高

- alert: DiskUsageHighexpr: 100 - (node_filesystem_avail_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes{fstype=~"ext4|xfs"} )*100 > 95for: 1mlabels:severity: criticalannotations:summary: "Disk usage high"description: "Disk {{ $labels.mountpoint }} usage above 95%.(current usage: {{ $value }})"

- alert: WinDiskUsageHighexpr: 100-(windows_logical_disk_free_bytes/windows_logical_disk_size_bytes)*100 > 95for: 1mlabels:severity: criticalannotations:summary: "Instance disk usage high"description: "Instance disk {{ $labels.volume }} usage above 95%.(current usage: {{ $value }}%)"

5、网络吞吐量

- alert: HostUnusualNetworkThroughputInexpr: sum by (instance,device,region) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 30for: 5mlabels:severity: warningannotations:summary: "Host unusual network throughput in"description: "Host network interfaces are receiving too much data (> 30 MB/s).(current speed:{{ $value }}MB/s)"

- alert: WinHostUnusualNetworkThroughputInexpr: sum by (instance,nic,region) (irate(windows_net_bytes_received_total{nic=~".*VirtIO.*"}[2m])) / 1024 / 1024>30for: 5mlabels:severity: warningannotations:summary: "Host unusual network throughput in"description: "Host network interfaces are probably receiving too much data (> 30 MB/s).(current speed: {{ $value }})"

- alert: HostUnusualNetworkThroughputOutexpr: sum by (instance,device,region) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 30for: 5mlabels:severity: warningannotations:summary: "Host unusual network throughput out"description: "Host network interfaces are sending too much data (> 30 MB/s).(current speed:{{ $value }}MB/s)"

6、TCP连接

- alert: TCPEstablishedNumexpr: node_netstat_Tcp_CurrEstab > 2000for: 1mlabels:severity: warningannotations:summary: "TCP established connect too many"description: "TCP establised connect count excess 2000.(current count: {{ $value }})"

7、服务器网络传输错误

- alert: HostNetworkTransmitErrorsexpr: increase(node_network_transmit_errs_total[5m]) > 2for: 5mlabels:severity: warningannotations:summary: "Host Network Transmit Errors"#description: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%v" $value }} transmit errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"description: "Interface {{ $labels.device }} has transmit errors in the last five minutes.(current error packages:{{ $value }})"

8、磁盘读写延迟

- alert: HostUnusualDiskReadLatencyexpr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) * 1000 > 100for: 5mlabels:severity: warningannotations:summary: "Host unusual disk read latency"description: "Disk read latency is growing (read operations > 100ms).(current latency: {{ $value }}ms)"

- alert: HostUnusualDiskWriteLatencyexpr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) * 1000 > 100for: 5mlabels:severity: warningannotations:summary: "Host unusual disk write latency"description: "Disk write latency is growing (write operations > 100ms).(current latency: {{ $value }}ms)"

9、磁盘IO过高

- alert: DiskIOTimePerSecexpr: irate(node_disk_io_time_seconds_total[1m])*100 > 60for: 2mlabels:severity: warning annotations:summary: "Host disk io time high"description: "Disk {{ $labels.device }} io time occupy above 60% (current rate: {{ $value }})"

如果觉得《prometheus监控常用告警规则》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。