Monitoring
Time estimate: 20 minutes
The main purpose of this hands-on session is to familiarize you with the basics of monitoring SCION services.
At the end of this hands-on session, you should be able to monitor the services running in a SCION network using Grafana, building on top of Prometheus, and explore the logs of different SCION services.
Task 1. Dashboards
For the monitoring purposes, we utilize Grafana, which is an observability stack that allows one to monitor and analyze metrics, logs and traces. In particular, it allows us to create dashboards to query and visualize our monitoring data. In our setting, Grafana uses Prometheus under the hood, which collects and stores different metrics as time series data. You can access Grafana by navigating to Grafana.
By default, very little is displayed on the landing page. However, some monitoring dashboards are already created for you. To see them, click on the menu button in the top-left of the page. This will open a side panel. Then click on Dashboards and select autoprovision in the displayed list. You can now see a list of dashboards that are included to monitor the SCION services.
More information about Grafana and Prometheus can be found on the Grafana project page and the Prometheus page, respectively.
To get started with investigating the dashboards, select the EDGE Overview dashboard. This will load a view containing runtime information about different services running on the EDGEs.
On the top, you can see several filtering options such as PROJECT, HOST, ISD-AS and so forth. They can be used to apply different filtering on the monitoring data represented in the diagrams below. Please go ahead and familiarize yourself with the tool by experimenting with different filters. For example, try to discover how to:
- display only certain time series in a panel;
- filter the time series to only display ISD-AS
1-ff00:1:1 - focus on a certain window of time;
- view additional information about a certain panel.
Now, let us look at the data visualized in the diagrams more closely. Have a look at the first diagram on the top left whose title should be Interface State. On the top left corner, there is a small info field. Once you move your cursor on it, a description of the diagram will be displayed. In this case, it explains that 1 (resp. 0) indicates that an interface is up (resp. down). If your SCION training set-up is working correctly, all the interfaces must be up.
In the aforementioned diagram if you click on the title of the diagram, i.e., Interface State, then you can choose the option Edit from the list. In the Metrics entry, you can see the query used for this diagram written in the PromQL (Prometheus Query Language) . This is just to show you an example of the query used to generate the diagram and for this task you do not need to work with the PromQL queries directly.
What are the number of paths available?
Solution
There should be four paths available.
Go back to Home and again select autoprovision in the displayed panel, but this time choose the Control Plane PKI dashboard.
What is the validity of the certificates at the moment?
Solution
It should be a value between 2 and 3 days.
Feel free to explore some of the other dashboards that are already created, such as the IP-in-SCION Tunneling and Router dashboards.
Task 2. Logging
In this task, you are supposed to learn how to explore the logs of the services that we are running. As we learned in the previous task, we can discover if a service is misbehaving through our monitoring endpoints. (Of course, in practice one gets informed about such misbehavior via an alerting system.) To understand what is causing the error, it is useful to look at logs generated by the misbehaving service and the services that interact with it.
To learn how to investigate the logs produced by our services, open Grafana and click on the Explore button (the fourth option in the left bar). Then, at the top select Loki from the dropdown. Loki is a log aggregation system which we use to collect our desired logs. You can learn more about Loki in the Loki page.
In the panel below click Select Label, select unit as the label and appliance-controller as the value. Then click the Run query button. This shows you all the logs of the appliance controller during the last one hour.
As you can observe, similar to the dashboards setup there are some options on top, which allow you to filter the logs based on various criteria. For example, you might try to:
- Remove the time tag from the logs;
- Reverse the order that the logs appear;
- Display the logs from two hours ago instead of the default value of one hour.
Now, again go to Label filters and select some of the other jobs, for example control or gateway, and then click on the Run query button to see the relevant logs. It is worth to taking some time to skim through some of the messages to familiarize yourself with the kind of logs these services generate.