Version 2.5 of the documentation is no longer actively maintained. The site that you are currently viewing is an archived snapshot. For up-to-date documentation, see the latest version.

Metrics Reference

Description of all metrics available for monitoring

Blob-storage implementation metrics

com.engflow.blobstore/ops (no unit)

Fires every time an operation takes place..

Tags
  • operation

BEP Event Storage and Replay

com.engflow.eventstore/build_event_owners (no unit)

The total number of build owners residing in memory. A build owner is an internal piece of state associated to a particular build reporting build events..

com.engflow.eventstore/builds_finished (no unit)

The total number of BuildFinished BEP tool events seen..

Tags
  • exit_code: Bazel's exit code
com.engflow.eventstore/inbound_bep_events (no unit)

Fired whenever an event is received on an inbound stream..

Tags
  • type
com.engflow.eventstore/invocation_attempts (no unit)

Fired whenever a new BEP invocation attempt started event is received..

com.engflow.eventstore/new_outbound_streams (no unit)

Fired whenever a new outbound BEP stream is read..

com.engflow.eventstore/ongoing_streams (no unit)

The total number of streams that are inbound, outbound, or both..

com.engflow.eventstore/outbound_bep_events (no unit)

Fired whenever an event is sent on an outbound stream..

Virtual Machine Instances

com.engflow.instance/gc_count (no unit)

The total number of garbage collections during the lifecycle of this process..

Tags
  • gc_type
com.engflow.instance/gc_time (milliseconds)

The total estimated time in milliseconds performing garbage collection..

Tags
  • gc_type
com.engflow.instance/total_disk_space (bytes)

The size of the volume..

Tags
  • volume
com.engflow.instance/total_system_memory (bytes)

The total amount of system memory in bytes..

com.engflow.instance/used_disk_percentage (percentage)

The percentage of the volume that is currently used..

Tags
  • volume
com.engflow.instance/used_disk_space (bytes)

The total number of bytes used on the volume..

Tags
  • volume
com.engflow.instance/used_system_memory (bytes)

The amount of used system memory in bytes..

com.engflow.instance/used_system_memory_percentage (percentage)

The percentage of system memory used..

Netty monitoring

com.engflow.thirdparty.netty/used_direct_memory (bytes)

Direct (non-heap) memory use.

Tags
  • buffer_name
com.engflow.thirdparty.netty/used_heap_memory (bytes)

Heap memory use.

Tags
  • buffer_name
io.netty.buffer/used_direct_memory (bytes)

Direct (non-heap) memory use.

Tags
  • buffer_name
io.netty.buffer/used_heap_memory (bytes)

Heap memory use.

Tags
  • buffer_name

Action scheduling

com.engflow.re.scheduler/availability_map_size (no unit)

Number of busy executors in all pools.

Details

Only schedulers report this metric. All schedulers report their own values. You should sum up the time series to get the total number of busy executors in the cluster. The result should be equal to the sum of com.engflow.re.exec/used_executors.

com.engflow.re.scheduler/available_workers (no unit)

Deprecated; number of idle executors, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
Details

Deprecated. Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. This metric is deprecated, because it may be imprecise: schedulers that are started while workers are busy may report a higher value than they should for several minutes. We recommend monitoring com.engflow.re.scheduler/existing_executors instead.

com.engflow.re.scheduler/existing_executors (no unit)

Number of existing executors, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
Details

Number of existing executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

com.engflow.re.scheduler/existing_schedulers (no unit)

Number of existing schedulers.

Details

Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.

com.engflow.re.scheduler/owner_map_size (no unit)

Number of entries.

Details

Deprecated. This is rarely useful and measures an internal data structure that is subject to change. Schedulers use the owner map to keep track of actions. The size of this map indicates how many actions are being executed.

com.engflow.re.scheduler/pool_utilization (percentage)

Current executor utilization, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
Details

Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).

com.engflow.re.scheduler/queue_age (milliseconds)

Min/max age of queued actions, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
  • statistic: "min" (youngest) or "max" (oldest) action in the pool's queue
Details

Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.

com.engflow.re.scheduler/queue_size (no unit)

Number of waiting actions, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
Details

Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.

Storage implementation metrics

com.engflow.storage.ops/in_flight (no unit)

How many operations have not completed on the pool that initializes operations to the backing storage mechanism..

Tags
  • name: name of the storage service
com.engflow.storage.ops/stream_in_flight (no unit)

How many operations have not completed on the pool that proxies data to and from the backing storage mechanism..

Tags
  • name: name of the storage service
com.engflow.storage.read/time_per_gb (milliseconds)

Time taken per 1 billion bytes (1 GB) to download a file from storage..

Tags
  • name: name of the storage service
  • status: op result
com.engflow.storage.read/time_to_first_byte (milliseconds)

Time taken between initiating a download to receiving the first byte..

Tags
  • name: name of the storage service
  • status: op result

Action execution

com.engflow.re.exec/completed_actions (no unit)

Number of actions that ran to completion, grouped by exit code.

Tags
  • exit_code: the action's exit code
Details

This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by exit_code=0 and exit_code!=0, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.

com.engflow.re.exec/executors_existing (no unit)

Total number of executors on this worker, in all pools combined.

Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.

com.engflow.re.exec/used_executors (no unit)

Number of busy executors, in all pools.

Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.

Hazelcast monitoring

com.engflow.re.hazelcast/is_master (no unit)

Whether a machine is a cluster master; if this sums up to more than one (with the same name), then the cluster is unhealthy..

Tags
  • name: name of the Hazelcast cluster.
com.engflow.re.hazelcast/member_count (no unit)

The number of members in the cluster; only the master reports this value.

Tags
  • name: name of the Hazelcast cluster.
com.engflow.re.hazelcast/op_time (milliseconds)

Distribution of operation time.

Tags
  • name: name of the distributed hash map
  • status: op result
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.thirdparty.hazelcast/partition_migration_time (milliseconds)

Time of Hazelcast partition migrations, per Hazelcast cluster..

Tags
  • name: name of the Hazelcast cluster.
Details

Reports the time of Hazelcast partition migrations.

Uncaught exceptions

com.engflow.re/uncaught_exceptions (no unit)

Fires every time there is an uncaught exception.

CAS server metrics

com.engflow.re.cas/missing_digests (no unit)

The total number of missing digests seen by findMissingBlobs..

com.engflow.re.cas/requested_digests (no unit)

The total number of digests requested by a findMissingBlob call.

Invocation index monitoring

com.engflow.resultstore.index/sql_invocation_index_database_queue_size (no unit)

All enqueued or in-progress invocation index database operations.

Details

Reflects the number of incomplete operations (either queued or being worked on).

Every instance reports this metric. Every instance reports its own stats.

CAS usage

com.engflow.re.cas/available_replica_space (bytes)

Available storage space in the CAS that can be used for replicas.

Details

Only workers report this metric. All workers report their own values.

com.engflow.re.cas/available_space (bytes)

Available storage space in the CAS.

Details

Only workers report this metric. All workers report their own values.

com.engflow.re.cas/free_time (milliseconds)

Distribution of time needed to free space in the CAS.

Details

This is a distribution. It refers to the deletion of expired replicas.

Only workers report this metric. All workers report their own values.

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.cas/gc_time (milliseconds)

Distribution of time needed for the GC.

Details

This is a distribution. It refers to the collection of expired replicas.

Only workers report this metric. All workers report their own values.

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.cas/lost_files_count (no unit)

The number of files that were lost from the CAS.

Details

The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.

Only workers report this metric. All workers report their own values.

com.engflow.re.cas/max_total_replica_size (bytes)

The max total replica size.

Details

This is the maximum amount of storage space the CAS is allowed to use for replicas.

Only workers report this metric. All workers report their own values.

com.engflow.re.cas/max_total_size (bytes)

The max total CAS size on the node.

Details

This is the maximum amount of storage space the CAS is allowed to use.

Only workers report this metric. All workers report their own values.

Client authorization

com.engflow.re.auth.async/call_count (no unit)

Number of calls made.

Details

Deprecated. Though it may seem so, this metric doesn't actually track client connection attempts accurately.

Use com.engflow.re.auth.async/duration aggregated by count instead.

com.engflow.re.auth.async/duration (milliseconds)

Authentication call duration.

Details

This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

External storage use

com.engflow.re.storage.existence_cache/evictions (no unit)

Evictions from the ExternalStorage CAS existence cache.

com.engflow.re.storage.existence_cache/hits (no unit)

Hits on the ExternalStorage CAS existence cache.

com.engflow.re.storage.existence_cache/misses (no unit)

Misses on the ExternalStorage CAS existence cache.

com.engflow.re.storage/gc_check (no unit)

GC status updates.

Tags
  • result: result of the gc check, e.g. "changed", "unchanged", "failed"
Details

Logged when GC status is recomputed

com.engflow.re.storage/gc_deleted_objects (no unit)

count objects deleted for GC.

Details

Logged when GC deletes an objects

com.engflow.re.storage/ops (no unit)

All completed external storage operations.

Tags
  • operation: the type of operation, e.g. "cas_check", "ac_upload"
  • result: the result of the operation, e.g. "successful", "interrupted"
Details

This metric reflects the rate of change. Each measurement indicates how many operations completed on this instance since the last time this metric was reported.

Every instance reports this metric. Every instance reports its own stats.

com.engflow.re.storage/ops_queue_size (no unit)

All enqueued or in-progress external storage operations.

Tags
  • operation: the type of operation, e.g. "cas_check", "ac_upload"
Details

Reflects the number of incomplete operations (either queued or being worked on).

Every instance reports this metric. Every instance reports its own stats.

com.engflow.re.storage/traffic (bytes)

All external storage traffic.

Tags
  • operation
Details

This metric may be imprecise; the source of truth is the set of metrics published by the storage backend itself.

Every instance reports this metric. Every instance reports its own stats.

Docker use

com.engflow.re.exec.docker/container_shutdown_time (milliseconds)

The time needed to shutdown a docker container.

Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/container_startup_time (milliseconds)

The time needed to start a docker container.

Tags
  • status: result of the operation, e.g. "OK", "FAILED"
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/containers_failed (no unit)

The number of docker containers that failed.

com.engflow.re.exec.docker/image_pull_time (milliseconds)

The time needed to pull a docker image.

Tags
  • status: result of the operation, e.g. "OK", "FAILED"
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/network_create_time (milliseconds)

The time needed to create a docker network.

Tags
  • status: result of the operation, e.g. "OK", "FAILED"
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/network_destroy_time (milliseconds)

The time needed to destroy a docker network.

Tags
  • status: result of the operation, e.g. "OK", "FAILED"
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

Persistent worker use

com.engflow.re.exec.worker/actions (no unit)

The number of persistent worker actions run.

Tags
  • reuse_status: `new` or `reused`
Details

The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not

Scheduler metrics

com.engflow.re.ac.distributed/entries (no unit)

The number of action cache entries on a specific scheduler instance.

com.engflow.re.ac.distributed/memory_used (bytes)

The amount of memory used for the action cache, in bytes.

com.engflow.re.cas/entries_evicted (no unit)

The number of CAS entries that were evicted due to memory size limitations.

com.engflow.re.cas/entries_lost (no unit)

The number of CAS entries that could not be recovered on CAS node shutdown events.

com.engflow.re.profiler/events (no unit)

The number of server-side profile events recorded..

com.engflow.re.profiler/live_handles (no unit)

The number of profiles being streamed to the eventstore..

com.engflow.re.scheduler/build_id (no unit)

The number of distinct build ids for which the service received at least one action.

com.engflow.re/remaining_license_time (days)

The number of remaining days before the license expires.

Java memory metrics

com.engflow.re/java_heap (bytes)

The amount of heap memory used.

Details

Every instance reports this metric. Every instance reports its own stats.

DB Connection Pool usage

com.engflow.resultstore.index/db_cp_active_connections (no unit)

The number of active connections in the pool.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_acquire_time (ns)

The time it takes for the connection pool to acquire a DB connection.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_create_time (milliseconds)

The time it takes for the connection pool to create a new DB connection.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_timeout_count (no unit)

The count of timed-out connections.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_usage_time (milliseconds)

The duration of a use of a connection given by the connection pool.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_idle_connections (no unit)

The number of idle connections in the pool.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_max_connections (no unit)

Maximum number of connections existing in the pool.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_min_connections (no unit)

Minimum number of connections existing in the pool.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_pending_connections (no unit)

The number of pending connections in the pool.

Tags
  • db_connection_pool_name
com.engflow.resultstore.index/db_cp_total_connections (no unit)

The number of all currently existing connections in the pool.

Tags
  • db_connection_pool_name

DB Query stats

com.engflow.resultstore.index/duration (milliseconds)

The duration of a query.

Tags
  • query_name
  • query_outcome
com.engflow.resultstore.index/preparation (milliseconds)

The duration of creating a preparedQuery.

Tags
  • query_name
2022-04-28