Monitoring and Alerts integration for your node

1. Introduction

24/7/365 node monitoring and incident response is the key to maintain your node healthy; which lead to make Sui network robust.
However for small company and community members, it is hard to build operation center.

This article is described how to build 24/7/365 monitoring system by integrating popular tool Prometheus, Grafana and PagerDuty.


2. Procedure

2-1. Install Prometheus

2-1-1. Install Prometheus
Please refer the official guide.

2-1-2. Edit prometheus YAML
Add sui metrics info at prometheus.yml.
You should change targets port depending on metrics-address port at fullnode.yaml or validator.yaml.

2-2. Install Grafana

2-2-1. Install Grafana
Please refer the official guide.

2-2-2. Create Dashboard
Please set your own dashboard.
You can dig the metrics by the command.
curl 127.0.0.1:9184/metrics >/dev/null

Reference: My Fullnode Dashboard
Download the template here.
Click “Dashboards > + Import”

Click “Upload JSON file” and select the json file you downloaded.

Select your data source and Click “Import”

Now you can monitor realtime node metrics at Grafana Dashboard.
However this is not perfect one.
I’ll open the topic regarding with the key metrics you have to monitor.
Let’s discuss and create better one ! :+1:



2-3. Generate PagerDuty API Key

In PagerDuty, Click “Services” and “+ New Service”

Fill out Service “Name” and “Description”, and then click “Next”

Select “Generate a new Escalation Policy” and proper Teams, and then click “Next”

Copy Integration key
The key is used for integration Grafana.

2-4. Integration Grafana and PagerDuty

2-4-1. Set Grafana Contact Points

Click “Alerting > Contact Points”

Click “+ New contact point”

Set “Name”
Select “PagerDuty” at “Contact point type”
Paste Integration Key you’ve copied at PagerDuty
Click “Save contact point”

2-4-2. Create Alert rule

Select the vital metrics you should monitor and click “Edit”

(Ex)
sui_network_peer is the vital to join consensus for validator or download the latest block and transfer txs for fullnode.

Click “Alert” tab

Click “Create alert rule from this panel”

Set the rule to make incident and save


3. Summary

You can make active monitoring system by using Prometheus, Grafana and PagerDuty.
If you set same rule above, if sui_network_peers get below 1, alert will be issued.
In short, when your node is not connected sui network at all, your phone rings !!! :telephone_receiver: :phone:

I hope this article useful for you ! :muscle:
Feel free to ask any question.

7 Likes

Do you have guide for newbie? :sunglasses:

4 Likes

Hello,

First things first, awesome topic. I appreciate the effort you put in creating it.

Second, I’ve modified your account to have a higher trust level. I believe if you attempt to edit your OP you should be able to include more picture now if you’d like.

Finally, I moved this thread to the ‘Sui Network’ section. It’s a bit less spammy over here and there’s a better shot that the people who will benefit from this topic will see it here.

Thanks for your post!

4 Likes

Thank you for your consideration and assistance in this matter. :blush:
I’ll add guide pics.