1. Introduction
24/7/365 node monitoring and incident response is the key to maintain your node healthy; which lead to make Sui network robust.
However for small company and community members, it is hard to build operation center.
This article is described how to build 24/7/365 monitoring system by integrating popular tool Prometheus, Grafana and PagerDuty.
2. Procedure
2-1. Install Prometheus
2-1-1. Install Prometheus
Please refer the official guide.
2-1-2. Edit prometheus YAML
Add sui metrics info at prometheus.yml
.
You should change targets port depending on metrics-address
port at fullnode.yaml
or validator.yaml
.
2-2. Install Grafana
2-2-1. Install Grafana
Please refer the official guide.
2-2-2. Create Dashboard
Please set your own dashboard.
You can dig the metrics by the command.
curl 127.0.0.1:9184/metrics >/dev/null
Reference: My Fullnode Dashboard
Download the template here.
Click “Dashboards > + Import”
Click “Upload JSON file” and select the json file you downloaded.
Select your data source and Click “Import”
Now you can monitor realtime node metrics at Grafana Dashboard.
However this is not perfect one.
I’ll open the topic regarding with the key metrics you have to monitor.
Let’s discuss and create better one !
2-3. Generate PagerDuty API Key
In PagerDuty, Click “Services” and “+ New Service”
Fill out Service “Name” and “Description”, and then click “Next”
Select “Generate a new Escalation Policy” and proper Teams, and then click “Next”
Copy Integration key
The key is used for integration Grafana.
2-4. Integration Grafana and PagerDuty
2-4-1. Set Grafana Contact Points
Click “Alerting > Contact Points”
Click “+ New contact point”
Set “Name”
Select “PagerDuty” at “Contact point type”
Paste Integration Key you’ve copied at PagerDuty
Click “Save contact point”
2-4-2. Create Alert rule
Select the vital metrics you should monitor and click “Edit”
(Ex)
sui_network_peer is the vital to join consensus for validator or download the latest block and transfer txs for fullnode.
Click “Alert” tab
Click “Create alert rule from this panel”
Set the rule to make incident and save
3. Summary
You can make active monitoring system by using Prometheus, Grafana and PagerDuty.
If you set same rule above, if sui_network_peers
get below 1, alert will be issued.
In short, when your node is not connected sui network at all, your phone rings !!!
I hope this article useful for you !
Feel free to ask any question.