Glossary
Glossary
Version 2.5 of the documentation is no longer actively maintained. The site that you are currently viewing is an archived snapshot. For up-to-date documentation, see the latest version.
This document gives a high-level overview of the EngFlow Remote Execution software, details technical requirements, and provides a brief summary of deployment options.
An EngFlow Remote Execution cluster consists of two types of instances, schedulers and workers. The scheduler instances are the “brain” of the cluster: they coordinate with each other to distribute the incoming work and perform automated routine maintenance operations such as recovering from an instance loss. All client calls first go to a scheduler, which in turn calls to other schedulers and workers to fulfill each operation.
The worker instances are the storage and execution units of the cluster. They store all uploaded and generated files persistently and are also responsible for executing actions. Typically, the majority of instances in a cluster are workers.
Clients (user and CI machines) connect to the service through a Load Balancer, which proxies requests to Schedulers.
Schedulers:
Workers:
By default, the EngFlow Remote Execution cluster uses the worker’s disks for persistent storage, automatically replicating files to multiple disks to ensure high availability. When a worker instance fails, the files it contains are automatically re-replicated from the remaining instances.
Configuring storage requires planning and monitoring to maintain disk space availability in production - the system cannot increase or decrease disk space autonomously.
Alternatively, the cluster can be connected to an external storage service. In this setup, all incoming and generated files are written through to the storage service, and files are fetched from the storage service as needed for action execution. For performance, worker instances cache and reuse files locally as much as possible.
Note that the worker instances still require significant disk space to cache files and to temporarily store input and output trees of executed actions.
Each worker instance can execute a certain number of actions concurrently. To that end, each worker provides a number of executors. Each executor can execute one action at a time, and has a number of properties such as the operating system it runs on, CPU and RAM provisioned, as well as the availability of additional hardware or software resources. Schedulers consider executors interchangable if they have identical properties.
Each incoming action has a set of requirements, such as the OS, CPU, RAM, and availability of other resources. The cluster can schedule an action to run on any executor that provides at least the required properties.
In order to provide predictable behavior and consistent performance, worker instances enforce a configurable level of action isolation. Fully isolated actions are allowed to access exactly the resources they are allocated, but nothing else.
Of particular interest is CPU isolation: actions are typically limited to a certain number of logical CPUs (CPU cores). Many compilers and tools are single-threaded and restricting them to a single core is fine. However, tests, especially integration tests, are often multi-threaded, and may run slowly when restricted to a single logical CPU.
You have to configure the required number of CPUs on the client side. When configuring CPU counts, you need to balance cluster utilization vs. build latency. Running with fewer CPUs improves utilization, while running multi-threaded actions with more CPUs reduces latency.
The EngFlow Remote Execution Service implements the open-source Remote Execution API version 2.0.0. Any client that faithfully implements the same version should work. We have successfully tested these clients:
Bazel versions 3.7.{0,1,2} have an issue in the bundled Java builder, which sometimes suppress Java compiler errors when used with remote execution (see the upstream bug).
Bazel versions before 4.0.0 have an issue with dynamic linking when used with
remote execution on macOS. Bazel 4.0.0 has a flag
--incompatible_macos_set_install_name
to fix the issue. As of 2021-02-04,
this flag is not enabled by default yet (see the upstream
bug).
As of 2020-07-13, Bazel does not support cross-platform remote execution: the host OS (what Bazel runs on) must be the same type of OS as the execution OS (what remote executors run on). For example you cannot run the remote executors on Linux and Bazel on Windows; both OSs have to be Linux. Note that there’s no restriction on the target OS or target architecture (what the compiled binaries will run on), so you can cross-compile to any platform as long as the compiler can run on the execution OS.
Update: as of 2021-02-04, we have successfully run builds on macOS, with remote execution running on Linux.
To deploy the EngFlow Remote Execution Service, you need to provide machines, virtual machines, or a Kubernetes cluster. The minimum hardware and software requirements per instance are:
General
Worker
Scheduler
Note:
The schedulers do not have to run on the same OS as the workers.
If you have multiple schedulers, you also need to provide a gRPC-compatible load balancer that distributes the incoming calls to the schedulers.
As of 2020-08-05, EngFlow RE does not support distributed clusters where there is significant network latency between one part of the cluster and the rest, e.g., an on-prem worker pool with dedicated hardware connected to a main cluster hosted in the cloud.
You must provide a reliable high-performance network connection between all the instances in a cluster, 1 Gigabit Ethernet equivalent or better. We recommend setting up a private network for each cluster such that a) only access from trusted networks is allowed, and b) only the public scheduler ports are reachable from outside the cluster. If your cluster is reachable from the public internet, you must provide an appropriately configured firewall, and also configure server and client authentication.
Ports on worker nodes must not be reachable from the public internet. Either the cluster must run on a private network, or an appropriately configured firewall must be used.
Only the public port on scheduler nodes may be reachable from the public internet. If a scheduler’s public port is reachable, then both server- and client-authentication must be configured such that it is impossible for an external party to impersonate either server (man-in-the-middle attacks) or client (no access to unauthenticated clients).
The only supported mechanism for server authentication is TLS, see Authentication. The included demo certificates should be considered public information, and must not be used for publicly accessible schedulers.
Supported mechanisms for client authentication include TLS client certificates as well as GCP authentication tokens, see Authentication.
Depending on configuration, actions can access and modify critical local resources as well as access any secrets stored on worker instances. Such clusters must only run trusted code.
EngFlow GmbH cannot be held liable for source code leaks, loss of data, or other issues arising from an improper network configuration.
The EngFlow Remote Execution software uses a monitoring middleware that supports integrating with a variety of monitoring backends and services. Out of the box, we support the following monitoring services:
The deployment process differs based on what operating system you deploy on, and whether you deploy to a set of bare-metal machines, virtual machines, or an existing Kubernetes cluster.
This section provides a brief summary of the pros and cons of the different deployment types.
Setup instructions? See our guides.
Virtual machine deployments are easier to setup and maintain than bare-metal, with both providing high performance and strong action isolation.
On Linux, we recommend using Docker for action isolation. This means that actions are run in individual Docker containers which restrict CPU and memory usage, as well as access to the underlying machine. Docker prevents actions from interfering with each other and with the underlying system, providing predictable performance and high reliability.
Docker action isolation requires client configuration. We recommend the open-source Bazel Toolchains Rules for configuring Bazel. Since the execution is controlled through client configuration, the action execution environment can be changed or updated without having to reconfigure the remote execution cluster.
Finally, these clusters can dynamically reconfigure their executor configuration according to the incoming load.
Setup on Kubernetes? See our guide.
Kubernetes deployments are similarly easy to setup as VM deployments, but can require higher maintenance due to their lower level of action isolation, which can cause increased build latency or unexpected and difficult-to-debug build failures.
The reason for the lower level of action isolation is that remote execution shares certain traits with Kubernetes which are difficult to replicate inside of Kubernetes.
Kubernetes runs individual services inside of Docker containers to isolate services from each other while allowing multiple services to run on the same underlying machine. Remote execution runs individual actions on the same underlying machine, and attempts to isolate these actions from each other.
However, processes running in Kubernetes only have limited options for isolating their subprocesses since Kubernetes restricts access to the relevant Kernel APIs. E.g., it is not generally possible to start a docker container from inside Kubernetes, which itself uses Docker to isolate services.
At the same time, Kubernetes is designed for fewer long-lived processes rather than many short-lived ones. This makes it impractical to use dedicated Kubernetes Pods for individual actions, because actions often only take 10s to 100s of milliseconds to run.
Therefore, in a Kubernetes deployment, we run actions inside the same Docker container as the worker instance itself runs, with only limited isolation from each other and from the worker instance itself.
That said, a Kubernetes deployment is still a viable option with some restrictions:
Changing the executor environment requires rebuilding the docker images and restarting an existing set of workers, or starting an additional set of workers using the new images. To minimize the need for rebuilding docker images and restarting clusters, we recommend avoiding dependencies on pre-installed tools and compilers, and instead using checked-in ones (either source or binary).
Dynamic reconfiguration of executors requires larger nodes which increases the risk of action conflicts - we recommend using static executor configurations with a very small set of different executor types, ideally only one or two. For example, use a large set of single-core executors for all build actions, and a small set of quad-core executors for multi-threaded tests.
Limited action isolation facilities increase the risk of action conflicts. We recommend using smaller nodes, ideally single-core nodes, to maintain as much isolation as possible.
Bare-Metal Deployments
Pros:
Cons:
Virtual Machine Deployments
Pros:
Cons:
Kubernetes Deployments
Pros:
Cons:
Glossary
Setup on a self-managed cluster
Frequently Asked Questions
How to configure the EngFlow Remote Execution Service
How to Configure Clients
How to monitor the health and performance of the service