Skip to main content

Monitoring your Anapaya SCION environment

Telemetry

Each Anapaya appliance can be configured to expose a telemetry endpoint that can be used to retrieve telemetry data from the appliance. To enable telemetry, refer to Set up a monitoring host.

Metrics Path

Please note that the metrics on the telemetry endpoint are available on /metrics paths. If the management API is configured, the metrics are by default available on the management API endpoint, and on the /metrics (without /api/v1) path.

The telemetry data is exported in the form of Prometheus metrics. Prometheus is an open-source systems monitoring and alerting tool. It collects and stores metrics as time series data alongside optional key-value pairs called labels. A metric is a numeric measurement of a specific event or condition, e.g., the number of packets sent on a specific interface. Recording metrics in time series provides then higher-level insights such as the rate of change of the sent packet counter to calculate the throughput of the interface. Labels add additional dimensions to a metric, e.g., the name of the interface for which the packet count is collected is added as a label.

Each Anapaya appliance internally has several modules that expose some of their internal states as metrics. Each module manages a particular part of the system, such as the SCION control plane, the SCION data plane, or the IP-in-SCION tunneling service. For each module, we list the exposed metrics, their names, the type of the metric, a brief description, and the attached labels. Please refer to the individual sections below for more information.

To access these metrics, a Prometheus server is needed that ingests the metrics from each appliance. How to set up a Prometheus server to collect appliance metrics is outside the scope of this document. Please refer to Set up a monitoring host. Should you require assistance with integrating appliance metrics in your monitoring setup, please contact Anapaya’s support team support@anapaya.net.

Control plane metrics

MetricDescriptionLabelsType
control_beaconing_originated_beacons_totalTotal number of beacons originated.egress_interface, resultcounter
control_beaconing_propagated_beacons_totalTotal number of beacons propagated.start_isd_as, ingress_interface, egress_interface, resultcounter
control_beaconing_received_beacons_totalTotal number of beacons received.ingress_interface, neighbor_isd_as, resultcounter
control_beaconing_registered_segments_totalTotal number of segments registered.start_isd_as, ingress_interface, seg_type, resultcounter
control_segment_expiration_deficientIndicates whether the expiration time of the segment is below the configured maximum. This happens when the signer expiration time is lower than the maximum segment expiration time.Nonegauge
control_segment_lookup_requests_totalTotal number of path segments requests received.dst_isd, seg_type, resultcounter
control_segment_registry_segments_received_totalTotal number of path segments received through registrations.src, seg_type, resultcounter
renewal_ca_health_statusExposes the status of the CA (available, unavailable, starting, stopping), if the host acts as CA and is delegating certificate renewal to the CA service.statusgauge
renewal_handled_requests_totalTotal number of renewal requests served by each handler type (legacy, in-process, delegating).result, typecounter
renewal_received_requests_totalTotal number of renewal requests served.resultcounter
renewal_registered_handlersExposes which handler type (legacy, in-process, delegating) is registered.typegauge
trustengine_latest_trc_not_after_time_secondsThe not_after time of the latest TRC for the local ISD in seconds since UNIX epoch.Nonegauge
trustengine_latest_trc_not_before_time_secondsThe not_before time of the latest TRC for the local ISD in seconds since UNIX epoch.Nonegauge
trustengine_latest_trc_serial_numberThe serial number of the latest TRC for the local ISD.Nonegauge

Data plane metrics

MetricDescriptionLabelsType
dataplane_control_dataplane_sync_errorIndicates whether the last dataplane sync had an error (1) or not (0).Nonegauge
dataplane_control_vrrp_stateCurrent state of VRRP nodes. 0=down, 1=init, 2=backup, 3=master.interface, vr_id, protocol, vipgauge
router_dropped_pkts_totalTotal number of packets dropped.interface, isd_as, neighbor_isd_as, typecounter
router_input_bytes_totalTotal number of bytes receivedinterface, isd_as, neighbor_isd_ascounter
router_input_pkts_totalTotal number of packets receivedinterface, isd_as, neighbor_isd_ascounter
router_interface_up1 indicates the interface is up, 0 otherwise.interface, isd_as, link_to, neighbor_isd_asgauge
router_output_bytes_totalTotal number of bytes sent.interface, isd_as, neighbor_isd_ascounter
router_output_pkts_totalTotal number of packets sent.interface, isd_as, neighbor_isd_ascounter

IP-in-SCION tunneling metrics

MetricDescriptionLabelsType
gateway_as_certificate_expiration_time_secondThe expiration time of the AS certificate.isd_asgauge
gateway_domain_paths_totalThe metric indicates the number of paths available for a domain and traffic matcher. The status indicates more details about the paths: 'total' indicates the total number of paths available to the domain, 'eligible' indicates the number of paths that were accepted by the path policies, 'monitored' indicates the number of paths out of the eligible that are being actively monitored, 'alive' indicates the number of paths out of the monitored that were recently seen alive.domain, traffic_matcher, statuscounter
gateway_domain_traffic_matcher_sessions_totalThe number of live sessions per traffic matcher in a domain.domain, traffic_matchergauge
gateway_domain_traffic_redirections_totalThe metric is incremented each time some subset of traffic is potentially sent either via a different SCION path or to a different remote gateway instance. The reason indicates the cause of the redirection.domain, traffic_matcher, reasoncounter
gateway_flow_exporter_export_errorsNumber of errors encountered during flow exporting.reasoncounter
gateway_flow_exporter_last_export_timeThe timestamp of the last time when the flow metrics were exported, successfully. Measured in seconds since UNIX epoch.Nonegauge
gateway_flow_exporter_last_import_timeThe timestamp of the last time when the flow metrics were received, from the dataplane. Measured in seconds since UNIX epoch.Nonegauge
gateway_flow_exporter_records_exportedNumber of IPFIX records successfully sent to the flow collector.Nonecounter
gateway_flow_exporter_records_importedNumber of IPFIX records received from the dataplane.Nonecounter
gateway_frame_bytes_received_totalTotal frame bytes received from remote gateways.isd_as, remote_isd_as, frame_typecounter
gateway_frame_bytes_sent_totalTotal frame bytes sent to remote gateways.isd_as, remote_isd_as, domain, traffic_class, path_filter, remote_address, frame_typecounter
gateway_frames_discarded_totalTotal number of discarded frames received from remote gateways. It can have one of the following reasons: INVALID: there was a validation/check error in the frame, e.g., truncated packet, wrong version, etc. FRAGMENTS_EVICTED: fragments (partial IP packets) dropped. SEQ_TOO_HIGH: the seq_num of the received frame is higher than the highest expected seq_num in the istream. SEQ_TOO_LOW: the seq_num of the received frame is lower than the highest seq_num seen in the istream. DUPLICATE: (consecutive duplicates) the seq_num of the received frame is the same as the highest seq_num seen in the istream.isd_as, remote_isd_as, reasoncounter
gateway_frames_received_totalTotal number of frames received from remote gateways.isd_as, remote_isd_as, frame_typecounter
gateway_frames_sent_totalTotal number of frames sent to remote gateways.isd_as, remote_isd_as, domain, traffic_class, path_filter, remote_address, frame_typecounter
gateway_info_fetch_errors_totalTotal number of errors fetching gateway info.isd_as, remote_isd_as, remote_addresscounter
gateway_info_seccom_addresses_fetchedThe number of fetched seccom addresses from the remote.isd_as, remote_isd_as, remote_addressgauge
gateway_ippkt_bytes_local_received_totalTotal IP packet bytes received from the local network.Nonecounter
gateway_ippkt_bytes_local_sent_totalTotal IP packet bytes sent to the local network.isd_as, remote_isd_ascounter
gateway_ippkt_bytes_received_filtered_totalTotal IP packet bytes received from remote gateways that were filtered.isd_as, remote_isd_as, reasoncounter
gateway_ippkt_bytes_received_totalTotal IP packet bytes received from remote gateways.isd_as, remote_isd_ascounter
gateway_ippkt_bytes_sent_totalTotal IP packet bytes sent to remote gateways.isd_as, remote_isd_as, domain, traffic_class, path_filter, remote_address, frame_typecounter
gateway_ippkts_discarded_totalTotal number of discarded IP packets received from the local network.reasoncounter
gateway_ippkts_local_received_totalTotal number of IP packets received from the local network.Nonecounter
gateway_ippkts_local_sent_totalTotal number of IP packets sent to the local network.isd_as, remote_isd_ascounter
gateway_ippkts_received_filtered_totalTotal number of IP packets received from remote gateways that were filtered.isd_as, remote_isd_as, reasoncounter
gateway_ippkts_received_totalTotal number of IP packets received from remote gateways.isd_as, remote_isd_ascounter
gateway_ippkts_sent_totalTotal number of IP packets sent to remote gateways.isd_as, remote_isd_as, domain, traffic_class, path_filter, remote_address, frame_typecounter
gateway_netlink_listener_subscribedFlag reflecting whether the netlink listener is subscribed route updates.objectgauge
gateway_netlink_listener_updates_errors_totalTotal number of netlink route updates errors.objectcounter
gateway_next_hop_reachableSet to 1 if the local IP address is reachable, 0 otherwise.addressgauge
gateway_path_fetch_errors_totalTotal number of errors fetching paths from the daemon.isd_ascounter
gateway_paths_monitoredTotal number of paths being monitored by the gateway.isd_as, remote_isd_asgauge
gateway_ping_reachability_changesThe number of times the reachability of the gateway changed.isd_as, remote_isd_as, remote_address, interface_groupcounter
gateway_ping_reachableWhether the gateway is reachable via a specific SCION interface group.isd_as, remote_isd_as, remote_address, interface_groupgauge
gateway_ping_received_totalTotal number of probe replies received from remote gateways.isd_as, remote_isd_as, remote_address, interface_groupcounter
gateway_ping_sent_totalTotal number of probes sent to remote gateways.isd_as, remote_isd_as, remote_address, interface_groupcounter
gateway_prefix_fetch_errors_totalTotal number of errors fetching prefixes via SGRP.isd_as, remote_isd_as, remote_addresscounter
gateway_prefix_fetch_invalid_totalTotal number of invalid prefixes received via SGRP.isd_as, remote_isd_as, remote_addressgauge
gateway_prefixes_advertisedTotal number of IP prefixes advertised over SGRP.isd_as, remote_isd_as, remote_addressgauge
gateway_prefixes_fetchedTotal number of IP prefixes fetched via SGRP.isd_as, remote_isd_as, remote_addressgauge
gateway_remote_discovery_errors_totalTotal number of errors discovering remote gateways.isd_as, remote_isd_ascounter
gateway_remote_discovery_paths_availableTotal number of SCION paths available to the remote gateway discovery.isd_as, remote_isd_as, statusgauge
gateway_remotesTotal number of discovered remote gateways.isd_as, remote_isd_asgauge
gateway_remotes_changesThe number of times the remotes number changed.isd_as, remote_isd_ascounter
gateway_seccom_egress_sa_expirationThe timestamp the current SAs expire. Measured in seconds since UNIX epoch.isd_as, remote_isd_as, remote_address, domain, traffic_classgauge
gateway_seccom_egress_sa_last_updateThe timestamp the current SAs were created. Measured in seconds since UNIX epoch.isd_as, remote_isd_as, remote_address, domain, traffic_classgauge
gateway_seccom_egress_sa_update_errorsTotal number of failed updates of the egress SAs.isd_as, remote_isd_as, remote_address, domain, traffic_classcounter
gateway_seccom_egress_sasNumber of egress SAs that are currently configured.isd_as, remote_isd_as, remote_address, domain, traffic_classgauge
gateway_seccom_ingress_request_errors_totalTotal number of errors processing incoming security communication requests.isd_as, remote_isd_as, remote_address, type, reasoncounter
gateway_seccom_ingress_requests_totalTotal number of incoming security communication requests.isd_as, remote_isd_as, remote_address, typecounter
gateway_seccom_ingress_sasNumber of ingress SAs that are currently configured.isd_as, remote_isd_as, remote_addressgauge
gateway_seccom_ingress_sas_limitThe maximum number of ingress SAs that can be established.Nonegauge
gateway_seccom_per_remote_ingress_sas_limitThe maximum number of ingress SAs that can be established per remote ISD-AS.Nonegauge
gateway_session_is_healthyFlag reflecting session healthiness.isd_as, remote_isd_as, remote_address, path_filter, domaingauge
gateway_session_latest_path_expirationLatest path expiration per session monitor.isd_as, remote_isd_as, remote_address, path_filter, domaingauge
gateway_session_path_changesNumber of path changes per session monitor.isd_as, remote_isd_as, remote_address, path_filter, domaincounter
gateway_session_paths_availableTotal number of paths available per session.isd_as, remote_isd_as, remote_address, path_filter, domain, statusgauge
gateway_session_state_changesNumber of state changes per session monitor.isd_as, remote_isd_as, remote_address, path_filter, domaincounter
gateway_sgrp_paths_availableTotal number of paths available for SGRP per remote gateway.remote_isd_as, remote_address, statusgauge
scion_ipfix_active_flows_totalTotal number of active SCION IPFIX flows (includes router and gateway IPFIX flows).Nonecounter

Appliance controller metrics

MetricDescriptionLabelsType
appliance_controller_enforcer_license_expiryTime when the current license expires or when the current trial/grace period ends.Nonegauge
appliance_core_dumpsNumber of core dumps on the appliance.Nonegauge
appliance_core_dumps_removed_totalNumber of core dumps that were removed on the appliance.resultcounter
appliance_extra_bgp_routesNumber of BGP routes present in the Linux routing table, but not known by FRR.Nonegauge
appliance_missing_bgp_routesNumber of BGP routes know by FRR, but not present in the Linux routing table.Nonegauge
frr_bgp_peer_prefixes_advertised_expected_diffThe difference between the actual number of prefixes advertised to a peer and and what is learned from IP-in-SCION tunneling. For now, the difference is equal tothe number of configured networks (/bgp/global/networks) of the correct family.This metric can be added to the planner_node_count_total metric such that it can be compared to the frr_bgp_peer_prefixes_advertised_count_totallocal_as, peer_as, peergauge
nodesync_topology_fetch_errors_totalThe number of errors when fetching topology information from a remote node.remotecounter
nodesync_topology_merge_interface_conflicts_totalThe number of topology merge conflicts. This indicates a severe misconfiguration of appliances. It means that multiple appliances have the same interfaces configured.isd_as, interfacecounter
nodesync_topology_merge_service_conflicts_totalThe number of topology merge conflicts. This indicates a severe misconfiguration of appliances. It means that multiple appliances have services configured with the same configuration.service, isd_as, shardcounter

Installer metrics

MetricDescriptionLabelsType
appliance_installer_checksum_consistentWhether the checksum of the installed package does match the checksum in the package signature file. This may fail if a different package with the same version number was uploaded but it hasn't been installed.pkgtypegauge
appliance_installer_controller_watchdog_errors_totalTotal number of errors encountered by the appliance controller watchdog. If this counter increases, the installer logs should be inspected for more details.Nonecounter
appliance_installer_installed_package_versionsThe version of the installed scion and system package.pkgtype, versiongauge
appliance_installer_metastore_inconsistentWhether the appliance installer's metastore is in an inconsistent state. Value is 1 if the metastore is in an inconsistent state, 0 otherwise.Nonegauge
appliance_installer_rollback_installations_totalTotal number of rollback installations. Result label is the result of the installation.resultcounter
appliance_installer_scion_installations_totalTotal number of scion package installations. Result label is the result of the installation.resultcounter
appliance_installer_system_installations_totalTotal number of system package installations. Result label is the result of the installation.resultcounter

BGP metrics

MetricDescriptionLabelsType
frr_bgp_peer_groups_count_totalNumber of peer groups configured.vrf, afi, safi, local_asgauge
frr_bgp_peer_groups_memory_bytesMemory consumed by peer groups.vrf, afi, safi, local_asgauge
frr_bgp_peer_message_received_totalNumber of received messages.vrf, afi, safi, local_as, peer, peer_ascounter
frr_bgp_peer_message_sent_totalNumber of sent messages.vrf, afi, safi, local_as, peer, peer_ascounter
frr_bgp_peer_prefixes_advertised_count_totalNumber of prefixes advertised.vrf, afi, safi, local_as, peer, peer_asgauge
frr_bgp_peer_prefixes_received_count_totalNumber of prefixes received.vrf, afi, safi, local_as, peer, peer_asgauge
frr_bgp_peer_stateState of the peer (2 = Administratively Down, 1 = Established, 0 = Down).vrf, afi, safi, local_as, peer, peer_asgauge
frr_bgp_peer_types_upTotal Number of Peer Types that are Up.type, afi, safigauge
frr_bgp_peer_uptime_secondsHow long has the peer been up.vrf, afi, safi, local_as, peer, peer_asgauge
frr_bgp_peers_count_totalNumber peers configured.vrf, afi, safi, local_asgauge
frr_bgp_peers_memory_bytesMemory consumed by peers.vrf, afi, safi, local_asgauge
frr_bgp_rib_count_totalNumber of routes in the RIB.vrf, afi, safi, local_asgauge
frr_bgp_rib_memory_bytesMemory consumbed by the RIB.vrf, afi, safi, local_asgauge

Host metrics

MetricDescriptionLabelsType
node_cpu_seconds_totalSeconds the CPU spends in each mode.cpu, modecounter
node_filesystem_avail_bytesFilesystem available bytes.device, fstype, mountpointgauge
node_filesystem_size_bytesFilesystem size in bytes.device, fstype, mountpointgauge
node_load11 minute load average.Nonegauge
node_load1515 minute load average.Nonegauge
node_load55 minute load average.Nonegauge
node_memory_MemAvailable_bytesAmount of available memory in the node.Nonegauge
node_memory_MemTotal_bytesTotal amount of memory in the node.Nonegauge
node_network_receive_bytes_totalNumber of bytes received from the network.devicecounter
node_network_transmit_bytes_totalNumber of bytes transmitted to the network.devicecounter