Monitoring your Anapaya SCION environment
Telemetry
Each Anapaya appliance can be configured to expose a telemetry endpoint that can be used to retrieve telemetry data from the appliance. To enable telemetry, refer to Set up a monitoring host.
Please note that the metrics on the telemetry endpoint are available on /metrics
paths. If the
management API is configured, the metrics are by default available on the management API endpoint,
and on the /metrics
(without /api/v1
) path.
The telemetry data is exported in the form of Prometheus metrics. Prometheus is an open-source systems monitoring and alerting tool. It collects and stores metrics as time series data alongside optional key-value pairs called labels. A metric is a numeric measurement of a specific event or condition, e.g., the number of packets sent on a specific interface. Recording metrics in time series provides then higher-level insights such as the rate of change of the sent packet counter to calculate the throughput of the interface. Labels add additional dimensions to a metric, e.g., the name of the interface for which the packet count is collected is added as a label.
Each Anapaya appliance internally has several modules that expose some of their internal states as metrics. Each module manages a particular part of the system, such as the SCION control plane, the SCION data plane, or the IP-in-SCION tunneling service. For each module, we list the exposed metrics, their names, the type of the metric, a brief description, and the attached labels. Please refer to the individual sections below for more information.
To access these metrics, a Prometheus server is needed that ingests the metrics from each appliance. How to set up a Prometheus server to collect appliance metrics is outside the scope of this document. Please refer to Set up a monitoring host. Should you require assistance with integrating appliance metrics in your monitoring setup, please contact Anapaya’s support team support@anapaya.net.
Control plane metrics
Metric | Description | Labels | Type |
---|---|---|---|
control_beaconing_originated_beacons_total | Total number of beacons originated. | egress_interface , result | counter |
control_beaconing_propagated_beacons_total | Total number of beacons propagated. | start_isd_as , ingress_interface , egress_interface , result | counter |
control_beaconing_received_beacons_total | Total number of beacons received. | ingress_interface , neighbor_isd_as , result | counter |
control_beaconing_registered_segments_total | Total number of segments registered. | start_isd_as , ingress_interface , seg_type , result | counter |
control_segment_expiration_deficient | Indicates whether the expiration time of the segment is below the configured maximum. This happens when the signer expiration time is lower than the maximum segment expiration time. | None | gauge |
control_segment_lookup_requests_total | Total number of path segments requests received. | dst_isd , seg_type , result | counter |
control_segment_registry_segments_received_total | Total number of path segments received through registrations. | src , seg_type , result | counter |
renewal_ca_health_status | Exposes the status of the CA (available, unavailable, starting, stopping), if the host acts as CA and is delegating certificate renewal to the CA service. | status | gauge |
renewal_handled_requests_total | Total number of renewal requests served by each handler type (legacy, in-process, delegating). | result , type | counter |
renewal_received_requests_total | Total number of renewal requests served. | result | counter |
renewal_registered_handlers | Exposes which handler type (legacy, in-process, delegating) is registered. | type | gauge |
trustengine_latest_trc_not_after_time_seconds | The not_after time of the latest TRC for the local ISD in seconds since UNIX epoch. | None | gauge |
trustengine_latest_trc_not_before_time_seconds | The not_before time of the latest TRC for the local ISD in seconds since UNIX epoch. | None | gauge |
trustengine_latest_trc_serial_number | The serial number of the latest TRC for the local ISD. | None | gauge |
Data plane metrics
Metric | Description | Labels | Type |
---|---|---|---|
dataplane_control_dataplane_sync_error | Indicates whether the last dataplane sync had an error (1) or not (0). | None | gauge |
dataplane_control_vrrp_state | Current state of VRRP nodes. 0=down, 1=init, 2=backup, 3=master. | interface , vr_id , protocol , vip | gauge |
router_dropped_pkts_total | Total number of packets dropped. | interface , isd_as , neighbor_isd_as , type | counter |
router_input_bytes_total | Total number of bytes received | interface , isd_as , neighbor_isd_as | counter |
router_input_pkts_total | Total number of packets received | interface , isd_as , neighbor_isd_as | counter |
router_interface_up | 1 indicates the interface is up, 0 otherwise. | interface , isd_as , link_to , neighbor_isd_as | gauge |
router_output_bytes_total | Total number of bytes sent. | interface , isd_as , neighbor_isd_as | counter |
router_output_pkts_total | Total number of packets sent. | interface , isd_as , neighbor_isd_as | counter |
IP-in-SCION tunneling metrics
Metric | Description | Labels | Type |
---|---|---|---|
gateway_as_certificate_expiration_time_second | The expiration time of the AS certificate. | isd_as | gauge |
gateway_domain_paths_total | The metric indicates the number of paths available for a domain and traffic matcher. The status indicates more details about the paths: 'total' indicates the total number of paths available to the domain, 'eligible' indicates the number of paths that were accepted by the path policies, 'monitored' indicates the number of paths out of the eligible that are being actively monitored, 'alive' indicates the number of paths out of the monitored that were recently seen alive. | domain , traffic_matcher , status | counter |
gateway_domain_traffic_matcher_sessions_total | The number of live sessions per traffic matcher in a domain. | domain , traffic_matcher | gauge |
gateway_domain_traffic_redirections_total | The metric is incremented each time some subset of traffic is potentially sent either via a different SCION path or to a different remote gateway instance. The reason indicates the cause of the redirection. | domain , traffic_matcher , reason | counter |
gateway_flow_exporter_export_errors | Number of errors encountered during flow exporting. | reason | counter |
gateway_flow_exporter_last_export_time | The timestamp of the last time when the flow metrics were exported, successfully. Measured in seconds since UNIX epoch. | None | gauge |
gateway_flow_exporter_last_import_time | The timestamp of the last time when the flow metrics were received, from the dataplane. Measured in seconds since UNIX epoch. | None | gauge |
gateway_flow_exporter_records_exported | Number of IPFIX records successfully sent to the flow collector. | None | counter |
gateway_flow_exporter_records_imported | Number of IPFIX records received from the dataplane. | None | counter |
gateway_frame_bytes_received_total | Total frame bytes received from remote gateways. | isd_as , remote_isd_as , frame_type | counter |
gateway_frame_bytes_sent_total | Total frame bytes sent to remote gateways. | isd_as , remote_isd_as , domain , traffic_class , path_filter , remote_address , frame_type | counter |
gateway_frames_discarded_total | Total number of discarded frames received from remote gateways. It can have one of the following reasons: INVALID : there was a validation/check error in the frame, e.g., truncated packet, wrong version, etc. FRAGMENTS_EVICTED : fragments (partial IP packets) dropped. SEQ_TOO_HIGH : the seq_num of the received frame is higher than the highest expected seq_num in the istream. SEQ_TOO_LOW : the seq_num of the received frame is lower than the highest seq_num seen in the istream. DUPLICATE : (consecutive duplicates) the seq_num of the received frame is the same as the highest seq_num seen in the istream. | isd_as , remote_isd_as , reason | counter |
gateway_frames_received_total | Total number of frames received from remote gateways. | isd_as , remote_isd_as , frame_type | counter |
gateway_frames_sent_total | Total number of frames sent to remote gateways. | isd_as , remote_isd_as , domain , traffic_class , path_filter , remote_address , frame_type | counter |
gateway_info_fetch_errors_total | Total number of errors fetching gateway info. | isd_as , remote_isd_as , remote_address | counter |
gateway_info_seccom_addresses_fetched | The number of fetched seccom addresses from the remote. | isd_as , remote_isd_as , remote_address | gauge |
gateway_ippkt_bytes_local_received_total | Total IP packet bytes received from the local network. | None | counter |
gateway_ippkt_bytes_local_sent_total | Total IP packet bytes sent to the local network. | isd_as , remote_isd_as | counter |
gateway_ippkt_bytes_received_filtered_total | Total IP packet bytes received from remote gateways that were filtered. | isd_as , remote_isd_as , reason | counter |
gateway_ippkt_bytes_received_total | Total IP packet bytes received from remote gateways. | isd_as , remote_isd_as | counter |
gateway_ippkt_bytes_sent_total | Total IP packet bytes sent to remote gateways. | isd_as , remote_isd_as , domain , traffic_class , path_filter , remote_address , frame_type | counter |
gateway_ippkts_discarded_total | Total number of discarded IP packets received from the local network. | reason | counter |
gateway_ippkts_local_received_total | Total number of IP packets received from the local network. | None | counter |
gateway_ippkts_local_sent_total | Total number of IP packets sent to the local network. | isd_as , remote_isd_as | counter |
gateway_ippkts_received_filtered_total | Total number of IP packets received from remote gateways that were filtered. | isd_as , remote_isd_as , reason | counter |
gateway_ippkts_received_total | Total number of IP packets received from remote gateways. | isd_as , remote_isd_as | counter |
gateway_ippkts_sent_total | Total number of IP packets sent to remote gateways. | isd_as , remote_isd_as , domain , traffic_class , path_filter , remote_address , frame_type | counter |
gateway_netlink_listener_subscribed | Flag reflecting whether the netlink listener is subscribed route updates. | object | gauge |
gateway_netlink_listener_updates_errors_total | Total number of netlink route updates errors. | object | counter |
gateway_next_hop_reachable | Set to 1 if the local IP address is reachable, 0 otherwise. | address | gauge |
gateway_path_fetch_errors_total | Total number of errors fetching paths from the daemon. | isd_as | counter |
gateway_paths_monitored | Total number of paths being monitored by the gateway. | isd_as , remote_isd_as | gauge |
gateway_ping_reachability_changes | The number of times the reachability of the gateway changed. | isd_as , remote_isd_as , remote_address , interface_group | counter |
gateway_ping_reachable | Whether the gateway is reachable via a specific SCION interface group. | isd_as , remote_isd_as , remote_address , interface_group | gauge |
gateway_ping_received_total | Total number of probe replies received from remote gateways. | isd_as , remote_isd_as , remote_address , interface_group | counter |
gateway_ping_sent_total | Total number of probes sent to remote gateways. | isd_as , remote_isd_as , remote_address , interface_group | counter |
gateway_prefix_fetch_errors_total | Total number of errors fetching prefixes via SGRP. | isd_as , remote_isd_as , remote_address | counter |
gateway_prefix_fetch_invalid_total | Total number of invalid prefixes received via SGRP. | isd_as , remote_isd_as , remote_address | gauge |
gateway_prefixes_advertised | Total number of IP prefixes advertised over SGRP. | isd_as , remote_isd_as , remote_address | gauge |
gateway_prefixes_fetched | Total number of IP prefixes fetched via SGRP. | isd_as , remote_isd_as , remote_address | gauge |
gateway_remote_discovery_errors_total | Total number of errors discovering remote gateways. | isd_as , remote_isd_as | counter |
gateway_remote_discovery_paths_available | Total number of SCION paths available to the remote gateway discovery. | isd_as , remote_isd_as , status | gauge |
gateway_remotes | Total number of discovered remote gateways. | isd_as , remote_isd_as | gauge |
gateway_remotes_changes | The number of times the remotes number changed. | isd_as , remote_isd_as | counter |
gateway_seccom_egress_sa_expiration | The timestamp the current SAs expire. Measured in seconds since UNIX epoch. | isd_as , remote_isd_as , remote_address , domain , traffic_class | gauge |
gateway_seccom_egress_sa_last_update | The timestamp the current SAs were created. Measured in seconds since UNIX epoch. | isd_as , remote_isd_as , remote_address , domain , traffic_class | gauge |
gateway_seccom_egress_sa_update_errors | Total number of failed updates of the egress SAs. | isd_as , remote_isd_as , remote_address , domain , traffic_class | counter |
gateway_seccom_egress_sas | Number of egress SAs that are currently configured. | isd_as , remote_isd_as , remote_address , domain , traffic_class | gauge |
gateway_seccom_ingress_request_errors_total | Total number of errors processing incoming security communication requests. | isd_as , remote_isd_as , remote_address , type , reason | counter |
gateway_seccom_ingress_requests_total | Total number of incoming security communication requests. | isd_as , remote_isd_as , remote_address , type | counter |
gateway_seccom_ingress_sas | Number of ingress SAs that are currently configured. | isd_as , remote_isd_as , remote_address | gauge |
gateway_seccom_ingress_sas_limit | The maximum number of ingress SAs that can be established. | None | gauge |
gateway_seccom_per_remote_ingress_sas_limit | The maximum number of ingress SAs that can be established per remote ISD-AS. | None | gauge |
gateway_session_is_healthy | Flag reflecting session healthiness. | isd_as , remote_isd_as , remote_address , path_filter , domain | gauge |
gateway_session_latest_path_expiration | Latest path expiration per session monitor. | isd_as , remote_isd_as , remote_address , path_filter , domain | gauge |
gateway_session_path_changes | Number of path changes per session monitor. | isd_as , remote_isd_as , remote_address , path_filter , domain | counter |
gateway_session_paths_available | Total number of paths available per session. | isd_as , remote_isd_as , remote_address , path_filter , domain , status | gauge |
gateway_session_state_changes | Number of state changes per session monitor. | isd_as , remote_isd_as , remote_address , path_filter , domain | counter |
gateway_sgrp_paths_available | Total number of paths available for SGRP per remote gateway. | remote_isd_as , remote_address , status | gauge |
scion_ipfix_active_flows_total | Total number of active SCION IPFIX flows (includes router and gateway IPFIX flows). | None | counter |
Appliance controller metrics
Metric | Description | Labels | Type |
---|---|---|---|
appliance_controller_enforcer_license_expiry | Time when the current license expires or when the current trial/grace period ends. | None | gauge |
appliance_core_dumps | Number of core dumps on the appliance. | None | gauge |
appliance_core_dumps_removed_total | Number of core dumps that were removed on the appliance. | result | counter |
appliance_extra_bgp_routes | Number of BGP routes present in the Linux routing table, but not known by FRR. | None | gauge |
appliance_missing_bgp_routes | Number of BGP routes know by FRR, but not present in the Linux routing table. | None | gauge |
frr_bgp_peer_prefixes_advertised_expected_diff | The difference between the actual number of prefixes advertised to a peer and and what is learned from IP-in-SCION tunneling. For now, the difference is equal tothe number of configured networks (/bgp/global/networks) of the correct family.This metric can be added to the planner_node_count_total metric such that it can be compared to the frr_bgp_peer_prefixes_advertised_count_total | local_as , peer_as , peer | gauge |
nodesync_topology_fetch_errors_total | The number of errors when fetching topology information from a remote node. | remote | counter |
nodesync_topology_merge_interface_conflicts_total | The number of topology merge conflicts. This indicates a severe misconfiguration of appliances. It means that multiple appliances have the same interfaces configured. | isd_as , interface | counter |
nodesync_topology_merge_service_conflicts_total | The number of topology merge conflicts. This indicates a severe misconfiguration of appliances. It means that multiple appliances have services configured with the same configuration. | service , isd_as , shard | counter |
Installer metrics
Metric | Description | Labels | Type |
---|---|---|---|
appliance_installer_checksum_consistent | Whether the checksum of the installed package does match the checksum in the package signature file. This may fail if a different package with the same version number was uploaded but it hasn't been installed. | pkgtype | gauge |
appliance_installer_controller_watchdog_errors_total | Total number of errors encountered by the appliance controller watchdog. If this counter increases, the installer logs should be inspected for more details. | None | counter |
appliance_installer_installed_package_versions | The version of the installed scion and system package. | pkgtype , version | gauge |
appliance_installer_metastore_inconsistent | Whether the appliance installer's metastore is in an inconsistent state. Value is 1 if the metastore is in an inconsistent state, 0 otherwise. | None | gauge |
appliance_installer_rollback_installations_total | Total number of rollback installations. Result label is the result of the installation. | result | counter |
appliance_installer_scion_installations_total | Total number of scion package installations. Result label is the result of the installation. | result | counter |
appliance_installer_system_installations_total | Total number of system package installations. Result label is the result of the installation. | result | counter |
BGP metrics
Metric | Description | Labels | Type |
---|---|---|---|
frr_bgp_peer_groups_count_total | Number of peer groups configured. | vrf , afi , safi , local_as | gauge |
frr_bgp_peer_groups_memory_bytes | Memory consumed by peer groups. | vrf , afi , safi , local_as | gauge |
frr_bgp_peer_message_received_total | Number of received messages. | vrf , afi , safi , local_as , peer , peer_as | counter |
frr_bgp_peer_message_sent_total | Number of sent messages. | vrf , afi , safi , local_as , peer , peer_as | counter |
frr_bgp_peer_prefixes_advertised_count_total | Number of prefixes advertised. | vrf , afi , safi , local_as , peer , peer_as | gauge |
frr_bgp_peer_prefixes_received_count_total | Number of prefixes received. | vrf , afi , safi , local_as , peer , peer_as | gauge |
frr_bgp_peer_state | State of the peer (2 = Administratively Down, 1 = Established, 0 = Down). | vrf , afi , safi , local_as , peer , peer_as | gauge |
frr_bgp_peer_types_up | Total Number of Peer Types that are Up. | type , afi , safi | gauge |
frr_bgp_peer_uptime_seconds | How long has the peer been up. | vrf , afi , safi , local_as , peer , peer_as | gauge |
frr_bgp_peers_count_total | Number peers configured. | vrf , afi , safi , local_as | gauge |
frr_bgp_peers_memory_bytes | Memory consumed by peers. | vrf , afi , safi , local_as | gauge |
frr_bgp_rib_count_total | Number of routes in the RIB. | vrf , afi , safi , local_as | gauge |
frr_bgp_rib_memory_bytes | Memory consumbed by the RIB. | vrf , afi , safi , local_as | gauge |
Host metrics
Metric | Description | Labels | Type |
---|---|---|---|
node_cpu_seconds_total | Seconds the CPU spends in each mode. | cpu , mode | counter |
node_filesystem_avail_bytes | Filesystem available bytes. | device , fstype , mountpoint | gauge |
node_filesystem_size_bytes | Filesystem size in bytes. | device , fstype , mountpoint | gauge |
node_load1 | 1 minute load average. | None | gauge |
node_load15 | 15 minute load average. | None | gauge |
node_load5 | 5 minute load average. | None | gauge |
node_memory_MemAvailable_bytes | Amount of available memory in the node. | None | gauge |
node_memory_MemTotal_bytes | Total amount of memory in the node. | None | gauge |
node_network_receive_bytes_total | Number of bytes received from the network. | device | counter |
node_network_transmit_bytes_total | Number of bytes transmitted to the network. | device | counter |