About the Govstack execution environment
Background information
The "Govstack execution environment" is an environment for running a Govstack instance, a set of building block applications connected using the "information mediator BB" (X-Road).
For the purposes of this discussion, we assume that the instance is self-contained; that is, everything it needs to function runs within the sandbox environment. This seems a necessary restriction to provide a "safe environment to play with (and even break)" . It also seems logical in the presence of building blocks like messaging or payment -- we do not want to really send unsolicited SMS messages or spam email, or connect to banking infrastructure -- those functions should be mocked in the sandbox.
We have bet on Kubernetes for providing an abstraction for the execution environment. The (main) reason for this is the requirement for portability -- it should be sufficiently easy to reproduce the sandbox somewhere else (other public cloud, or even on-premises servers). We also need an abstraction (orchestration) that scales beyond one node (e.g. docker compose is limited to a single machine) since the environment needs to be able to host GovStack instances with several building blocks, and simultaneously running several different GovStack instances.
The "execution environment" at the core should be the most portable artifact -- meaning that deploying building blocks should use only suitable Kubernetes abstractions (OCI container images, Kubernetes manifests, and Helm charts, which we publish in Github). One Govstack instance (set of building blocks, configuration, and the necessary mock data) in the sandbox should be treated as ephemeral -- it will be deployed, it will run some time (hours, days, but not months), and it will be thrown away. Also, there can be multiple separate Govstack instances deployed at the same time.
The Kubernetes cluster does not materialize from the thin air, so we use suitable provisioning tools (Terraform) for creating and configuring the infrastructure around the execution environment. This infrastructure can be considered to be long-running, and it can be more dependent on the cloud we have selected (AWS).
Within AWS, we use the managed Kubernetes service (EKS) to provide the cluster for the execution environment. But EKS has also options; we can go "as managed as possible" with EKS+Fargate, or use EKS with (managed) node groups (or some combination thereof). Neither option is perfect and comes with some challenges, e.g.:
The workloads (building block applications) are going to be stateful applications, requiring databases and file storage. Fargate nodes do not support EBS volumes, only EFS which is a network file system. This means that running databases within the cluster becomes problematic since a network file system semantics (and properties like latency) are not really suitable for database storage. Same might be true for services requiring file storage with semantics a network file system does not provide on a sufficient level (atomic operations, durability guarantees, locks, etc.).
One possible workaround for the database issue is to use ACK RDS controller which makes it possible to provision RDS databases using Kubernetes. If a workload requires a database that RDS does not support, we need to provide some other workaround.
Applications likely have user interfaces. While it is possible to expose those somehow (but there are the obvious challenges like how to handle authentication), we have worked on an assumption that applications are headless and can be operated and configured through APIs.
Kubernetes in general (and Fargate specifically) more or less only supports Linux-based workloads. Windows containers exist, but those also have restrictions (e.g. no EFS volume support).
On EKS Fargate, EFS driver does not (yet) support dynamic volume provisioning. This means that creating (or removing) the EFS volumes (access points) can not be easily done from Kubernetes, complicating deployment
Our current tradeoff is to use AWS EKS with managed node groups – Fargate has too many restriction that would likely require modifying the existing workloads, and would make the execution environment more AWS specific. Auto-scaling the node groups is not yet solved, but solutions for that exist (e.g. Karperter).
The current choice of execution environment also creates some constraints for building blocks that can be deployed into that environment. For example:
Is the building block already available as one or more OCI (Docker) container
Is it designed to operate in a cloud environment (or specifically a Kubernetes cluster)
Is it packaged as a Helm chart for easy deployment
The Helm chart should be modular and allow selecting a minimal core setup with essential functionality only.
Does the building block require databases
Related, does the building block require some caching layer (memcached, redis, ...).
If it does, does the cache expect a durable cache with file storage.
What are the compute resource requirements (memory, CPU, ...)
Are there requirements for persistent file storage?
Are there some special requirements (e.g. the building block itself is deployed as a Kubernetes cluster), needs access to some special hardware (GPU, ...)