The Transporter

Dec 23, 2017

This is a story about software architecture, a personal itch, and about scalability. And like any good tech story, it begins with shaky architecture.

Our journey commenced with the modest Bash, buttressed by a plethora of scripts maneuvering Virtual Machines (VMs). Each VM was dedicated to a specific organization we were evaluating, conducting successive batch tasks that emulated the reconnaissance phase of a cyber-attackers lifecycle. To achieve parallelism at the corporate level, additional VMs were commissioned. This led to the development of our in-house orchestration mechanism, a marriage of Cron and Bash. Although this setup was functional, it was far from perfect. Our parallelism was limited to the company level, the procedure lacked transparency, server utilization remained disappointingly low, and manual triggering was still a necessity.

This was when the 'Transporter' made its entrance. The Transporter, a dynamic workflow engine, was designed to formulate workflows and execute them as Kubernetes Jobs. Its container-based architecture offers the flexibility of independent job configuration and the efficiency required for scaling. With its capacity to prioritize parallelism, guided by workflow dependencies, and its fully automated pipeline supported by a REST API, the Transporter heralds a new chapter in our journey.

Similar to the original model, the Transporter adheres to a few fundamental rules. The first, "The Deal is the Deal," entrusts workflow definition to you, while the execution is the Transporter's responsibility. The concept of a 'job' in this context is equivalent to running a Docker container, with a group of jobs forming a phase. Phases can be sequential or parallel, with a workflow being a sequence of these phases. This architecture allows us to relish parallelism within a defined set of rules.

Another rule, "Never Make a Promise You Can't Keep," underlines the Transporter's use of a distributed task queue architecture. Tasks are transported to queues in this system, where workers consume them and execute the required operations. This architecture enables the retry of failed tasks, setting of timeout and priority, scheduling tasks for later, and sending notifications to alert on workflow start, success, and failures.

Now, to peek under the hood. The Transporter offers endpoints to manipulate a workflow. In the background, this workflow is converted into Celery tasks using celery chains and celery groups to set dependencies. These tasks are then transported to queues according to their dependencies. On the other end, celery workers consume tasks from the queues and deploy the corresponding KubernetesJob. The result? A workflow accomplished in accordance with job dependencies. We've also added endpoints to control workers, offering convenience and defining the concurrent job limit based on the number of running workers.

In our revamped deployment process, security researchers build and push Docker images to the Registry. The Transporter then assigns the respective jobs according to the workflow and a ConfigMap, which outlines the version of each job. Kubernetes serves as the engine that executes the underlying Docker containers.

The naming process of KubernetesJobs was initially linked to the original job name along with a unique identifier. However, our exploration unearthed limitations with Kubernetes naming, leading us to adopt the use of labels extensively.

Before settling for our workflow engine, we evaluated various existing solutions such as Airflow and Google Pub/Sub. Each had its strengths, but ultimately didn't fit our dynamic needs, leading us to the creation of the Transporter.

Our next steps include creating a user interface to simplify monitoring and troubleshooting workflows and making the Transporter more generic, potentially releasing it as an open-source tool.

Made with 🤍 & AI by TP