X-Road central server cloud native deployment
Overview of issues Nortal is facing and progress that has been made.
Making the X-Road central server deployable in a cloud environment has proven more difficult than initially expected. We did anticipate some difficulties in that regard and made provisions to deal with them during planning, however the scope of problems quickly expanded beyond estimations.
While the security servers had adequate containerization, the central server did not. Its deployment, instead of being divided into separate components, was bundled together in multiple Debian packages running on a single system. This meant all modules were running within a single operating system, relying on localhost to exchange data.
What we realized is that available and previously used central server deployments are based on a development image from NIIS, which is not production-ready and, at best, amounts to little more than using a container as a makeshift virtual machine. This means it's suitable for development, testing and demos, but it's not a viable option for a complete, operational solution.
Currently, a new deployment of the X-Road central server requires manual labor from a dedicated engineer familiar with the software. This is not to be confused with the deployments of security servers, a misconception that often occurred when looking into solutions. There are several solutions for a cloud deployment of security servers, and the process of setting one up is much better understood and documented overall.
The central server's documentation, particularly regarding its .deb packages and their construction, is sparse compared to the security servers. This is mainly because security servers are regularly deployed with NIIS members running active X-Road instances (in countries like Estonia, Finland, and Iceland), while the central server, once deployed, is rarely, if ever, redeployed, even for updates.
During the modernization of the central server deployment, we discovered multiple circular dependencies among components, which meant that some components relied on each other to function. In several places, the internal services of the central server are referred to with hard-coded references to a local environment (127.0.0.1 and a specified port).
The AKKA toolkit currently required for communication between the central server core and signer is implemented so that it relies on custom-built C code that uses the shared memory and inode information of the Debian machine it runs on to generate and exchange passwords.
Identity and Access Management (IAM) was solved using Linux's own Pluggable Authentication Modules (PAM), which can't be easily exposed or shared between components running in different containers.
All of which made it exceedingly clear that the design of the central server did not include any provisions for a cloud-native deployment and assumed a single system setup.
This left us with two options: either fork NIIS's code and create our own version of X-Road with the problems either fixed or (where safe) bypassed; or try to create a deployment that would take minimal liberties with the original code base and work around the issues. From a time and effort perspective, the options are comparable.
However, since designing yet another fork of X-Road (akin to UXP, for example) seemed more problematic in the long run, we opted for the second choice: working with the original code as much as possible.
By now, we have made substantial progress. The central server has been successfully decomposed into pods (via internal services) running containers. The components, derived from the official Debian package, are seamlessly deployed into a cloud environment using automation, which gives us the option to either use pre-built packages from NIIS (or other) artifactory or build the artifacts using NIIS's own build scripts (unaltered).
We have resolved or circumvented most of the issues arising from the central server's current design, including problems with hardcoded IP addresses and shared memory usage. The central server components report in and connect to each other within a local (K3D) Kubernetes cluster. The system generates GlobalConfig and is able to supply it to prospective security servers. We are about ready to start interfacing security servers with the deployment and moving it all from a local cluster deployment into AWS.
However, the process has been anything but smooth. We relied heavily on and analyzed the actual scripts Debian uses to deploy the .deb packages to understand what is really needed for deployment and connections. Oftentimes, we've had to fall back on the source code of the X-Road central server itself to understand why something isn't behaving the way it should. The deployment is rather opaque, with problems presenting themselves only after previous ones have been solved, making the effort involved hard to estimate—a problem we are still facing and likely will until everything runs end-to-end.
In several cases, we've had to experiment with various solutions in order not to modify the original X-Road code. Even then, there are still some small (and backward-compatible) modifications that are absolutely crucial, and we need/want to get accepted in the NIIS X-Road development upstream to avoid having to fork.