Monitoring

Monitoring Validators and Proxies

Logging

Several command line options control logging:

  • --verbosity: Sets logging verbosity. 3 outputs logs up to INFO level and is recommended. 4 outputs up to DEBUG level; 5 is TRACE.

  • --vmodule: Overrides this verblosity in specific modules. For example, to configure TRACE level logging of consensus activity, use consensus/istanbul/*=5.

  • --consoleoutput: Sends output to the given path, or to stdout.

  • --consoleformat: Formats logs for easy viewing in a terminal (term), or as structured JSON (json).

Useful messages to record or set up log-based metrics on:

  • msg="Validator Election Results": When the last block of any epoch (number) has been agreed, elected shows whether the validator was selected in the validator election.

  • msg="Elected but didn't sign block": This validator was elected but did not have its signature included in the block given by number (in fact, in the child's parent seal). This block could count towards downtime if 12 successive blocks are missed.

Metrics

Celo Blockchain inherits go-ethereum's metrics system, but additional Celo-specific metrics have been added.

Metrics reporting is enabled with the --metrics flag.

Pull-based metrics are available using the --pprof flag. This enables the pprof debugging HTTP server, by default on http://localhost:6060. The --pprofaddr and --pprofport options can be used to configure the interface and port respectively. If the node is running inside a Docker container, you will need to set --pprofaddr 0.0.0.0, then on your Docker command line add -p 6060:localhost:6060.

Be sure never to expose the pprof service to the public internet.

Prometheus format metrics are available at http://localhost:6060/debug/metrics/prometheus.

ExpVar format metrics are available at http://localhost:6060/debug/metrics.

Support for pushing metrics to InfluxDB is available via --metrics.influxdb and related flags. This works without the pprof server.

Note that metric name separators differ between these endpoints.

All metrics are soft-state and are cleared when the process is restarted.

Memory metrics

Memory metrics derived from mstats:

  • system_memory_held: Gauge of virtual address space allocated by the Celo Blockchain process, measured in bytes.

  • system_memory_used: Gauge of Memory in use by the Celo Blockchain process, measured as bytes of allocated heap objects.

  • system_memory_allocs: Counter for memory allocations made, measured in bytes. Consider monitoring the rate.

  • system_memory_pauses: Counter for stop-the-world Garbage Collection pauses, measured in nanoseconds. Consider monitoring the rate.

CPU metrics

  • system_cpu_sysload: Gauge of load average for the system.

  • system_cpu_syswait: Gauge of IO wait time for the system.

  • system_cpu_procload: Gauge of load average for the Celo Blockchain process.

Network metrics

  • p2p_peers: The number of connected peers. This should remain at exactly 1 for a proxied validator (just its proxy). It should remain at a relatively steady level for proxy nodes.

  • p2p_ingress: Counter for total inbound traffic, measured in bytes. Consider monitoring the rate.

  • p2p_egress: Counter for total outbound traffic, measured in bytes. Consider monitoring the rate.

  • p2p_dials: Counter for outbound connection attempts. Consider monitoring the rate.

  • p2p_serves: Counter for accepted inbound connection attempts. Consider monitoring the rate.

Blockchain metrics

  • chain_inserts_count: The count of insertions of new blocks into this node's chain. The rate of this metric should be close to constant at 0.2 /second.

Validator health metrics

A number of metrics are tracked for the parent of the last sealed block received (i.e. this is always two fewer than the current consensus sequence):

  • consensus_istanbul_blocks_elected: Counts the number of blocks for which this validator has been elected

  • consensus_istanbul_blocks_signedbyus: Counts the blocks for which this validator was elected and its signature was included in the seal. This means the validator completed consensus correctly, sent a COMMIT, its commit was received in time to make the seal of the parent received by the next proposer, or was received directly by the next proposer itself, and so the block will not count as downtime. Consider monitoring the rate.

  • consensus_istanbul_blocks_missedbyus: Counts the blocks for which this validator was elected but not included in the child's parent seal (this block could count towards downtime if 12 successive blocks are missed). Consider monitoring the rate.

  • consensus_istanbul_blocks_missedbyusinarow: (since 1.0.2) Counts the blocks for which this validator was elected but not included in the child's parent seal in a row. Consider monitoring the gauge.

  • consensus_istanbul_blocks_proposedbyus: (since 1.0.2) Counts the blocks for which this validator was elected and for which a block it proposed was succesfully included in the chain. Consider monitoring the rate.

  • consensus_istanbul_blocks_downtimeevent: (since 1.0.2) Counts the blocks for which this validator was elected and for blocks where it is considered down (occurs when missedbyusinarow is >= 12). Consider monitoring the rate.

Consensus metrics

  • consensus_istanbul_core_desiredround: Current desired round for this validator, i.e the round we are waiting to see a quorum of validators send RoundChange messages for. Usually this value should be 0. Desired rounds increment with each timeout, which backoff exponentially. A value of 5 indicates consensus has stalled for more than 30 seconds. Values above that means the validator is unable to participate in quorum (either because it is disconnected, out of sync, etc, or because of network partition or failure of other validators).

  • consensus_istanbul_core_round: : Current consensus round for this validator, i.e the round for which this validator has received a quorum of RoundChange messages. Usually this value should be 0. If this value is less than consensus_istanbul_core_desiredround the validator is not connected to a quorum of other validators that are also unable to participate (for instance, they did see a proposed block, but this validator did not). If it is equal, it means the validator remains connected to a quorum of other validators but cannot agree on a block.

  • consensus_istanbul_core_sequence: Current consensus sequence number, i.e the block number currently being proposed.

Network consensus health metrics

  • consensus_istanbul_blocks_totalsigs: The number of validators whose signatures were included in the child's parent seal. This can be used to determine how many validators are up and contributing to consensus. If this number falls towards two thirds of validator set size, network block production is at risk.

  • consensus_istanbul_blocks_missedrounds: Sum of the round included in the parentAggregatedSeal for the blocks seen. That is, the cumulative number of consensus round changes these blocks needed to make to get to this agreed block. This metric is only incremented when a block is succesfully produced after consensus rounds fails, indicating down validators or network issues.

  • consensus_istanbul_blocks_missedroundsasproposer: (since 1.0.2) A meter noting when this validator was elected and could have proposed a block with their signature but did not. In some cases this could be required by the Istanbul BFT protocol.

  • consensus_istanbul_blocks_validators: (since 1.0.2) Total number of validators eligible to sign blocks.

  • consensus_istanbul_core_consensus_count: Count and timer for succesful completions of consensus (Use quantile tag to find percentiles: 0.5, 0.75, 0.95, 0.99, 0.999)

Management APIs

Celo blockchain inherits and extends go-ethereum's Javascript console, exposing management APIs and web3 DApp APIs.

Connect a client using a variant of the attach command line option:

geth attach --datadir DATADIR
geth attach ipc:PATH/TO/geth.ipc
geth attach http://localhost:8545
geth attach ws://localhost:8546

Monitoring Attestation Service

Logging

The Attestation Service provides JSON-format structured logs.

Metrics

It also exposes the following Prometheus format metrics at /metrics:

  • attestation_requests_total: Counter for the number of attestation requests.

  • attestation_requests_already_sent: Counter for the number of attestation requests that were received but dropped because the local database records that they have already been completed.

  • attestationRequestsWrongIssuer: Counter for the number of attestation requests that were received but dropped because they specified the incorrect validator.

  • attestationRequestsWOIncompleteAttestation: Counter for the number of attestation requests that were received but when querying the blockchain no matching incomplete attestation could be found.

  • attestationRequestsValid: Counter for the number of requests received that are for the correct issuer and an incomplete attestation exists.

  • attestationRequestsAttestationErrors: Counter for the number of requests for which producing the attestation failed. This could be due to phone number or salt that does not match the hash, or the attestation was recorded fewer than 4 blocks ago.

  • attestationRequestsUnableToServe: Counter for the number of requests that could not be served because no SMS provider was configured for the phone number in the request.

  • attestationRequestsSentSms: Counter for the number of SMS succesfully sent.

  • attestationRequestsFailedToSendSms: Counter for the number of SMS that failed to send.

  • attestationRequestUnexpectedErrors: Counter for the number of unexpected errors.

Test Endpoint

Attestation Service provides a test endpoint.

You can run the following command (reference) to test an Attestation Service and send an SMS to yourself:

celocli identity:test-attestation-service --from $CELO_ATTESTATION_SIGNER_ADDRESS --phoneNumber <YOUR-PHONE-NUMBER-E164-FORMAT> --message <YOUR_MESSAGE>

Community Moniting Tools

Shows current and historic data on validator signatures collected in each block on Mainnet.

Please raise a Pull Request against this page to add/amend details of any community services!