Request for Comments: 001 - The KAT System (v1.0 Specification)
Network Working Group: DWS LLC
Author: T. Dubey
Contact: [email protected]
Organization: DWS LLC
URI: https://www.dws.rip
Date: May 2025
Obsoletes: None
Updates: None
The KAT System: A Simplified Container Orchestration Protocol and Architecture Design (Version 1.0)
Status of This Memo
This document specifies Version 1.0 of the KAT (pronounced "cat") system, a simplified container orchestration protocol and architecture developed by DWS LLC. It defines the system's components, operational semantics, resource model, networking, state management, and Application Programming Interface (API). This specification is intended for implementation, discussion, and interoperability. Distribution of this memo is unlimited.
Abstract
The KAT system provides a lightweight, opinionated container orchestration platform specifically designed for resource-constrained environments such as single servers, small clusters, development sandboxes, home labs, and edge deployments. It contrasts with complex, enterprise-scale orchestrators by prioritizing simplicity, minimal resource overhead, developer experience, and direct integration with Git-based workflows. KAT manages containerized long-running services and batch jobs using a declarative "Quadlet" configuration model. Key features include an embedded etcd state store, a Leader-Agent architecture, automated on-agent builds from Git sources, rootless container execution, integrated overlay networking (WireGuard-based), distributed agent-local DNS for service discovery, resource-based scheduling with basic affinity/anti-affinity rules, and structured workload updates. This document provides a comprehensive specification for KAT v1.0 implementation and usage.
Table of Contents
- Introduction
1.1. Motivation
1.2. Goals
1.3. Design Philosophy
1.4. Scope of KAT v1.0
1.5. Terminology - System Architecture
2.1. Overview
2.2. Components
2.3. Node Communication Protocol - Resource Model: KAT Quadlets
3.1. Overview
3.2. Workload Definition (workload.kat
)
3.3. Virtual Load Balancer Definition (VirtualLoadBalancer.kat
)
3.4. Job Definition (job.kat
)
3.5. Build Definition (build.kat
)
3.6. Volume Definition (volume.kat
)
3.7. Namespace Definition (namespace.kat
)
3.8. Node Resource (Internal)
3.9. Cluster Configuration (cluster.kat
) - Core Operations and Lifecycle Management
4.1. System Bootstrapping and Node Lifecycle
4.2. Workload Deployment and Source Management
4.3. Git-Native Build Process
4.4. Scheduling
4.5. Workload Updates and Rollouts
4.6. Container Lifecycle Management
4.7. Volume Lifecycle Management
4.8. Job Execution Lifecycle
4.9. Detached Node Operation and Rejoin - State Management
5.1. State Store Interface (Go)
5.2. etcd Implementation Details
5.3. Leader Election
5.4. State Backup (Leader Responsibility)
5.5. State Restore Procedure - Container Runtime Interface
6.1. Runtime Interface Definition (Go)
6.2. Default Implementation: Podman
6.3. Rootless Execution Strategy - Networking
7.1. Integrated Overlay Network
7.2. IP Address Management (IPAM)
7.3. Distributed Agent DNS and Service Discovery
7.4. Ingress (Opinionated Recipe via Traefik) - API Specification (KAT v1.0 Alpha)
8.1. General Principles and Authentication
8.2. Resource Representation (Proto3 & JSON)
8.3. Core API Endpoints - Observability
9.1. Logging
9.2. Metrics
9.3. Events - Security Considerations
10.1. API Security
10.2. Rootless Execution
10.3. Build Security
10.4. Network Security
10.5. Secrets Management
10.6. Internal PKI - Comparison to Alternatives
- Future Work
- Acknowledgements
- Author's Address
1. Introduction
1.1. Motivation
The landscape of container orchestration is dominated by powerful, feature-rich platforms designed for large-scale, enterprise deployments. While capable, these systems (e.g., Kubernetes) introduce significant operational complexity and resource requirements (CPU, memory overhead) that are often prohibitive or unnecessarily burdensome for smaller use cases. Developers and operators managing personal projects, home labs, CI/CD runners, small business applications, or edge devices frequently face a choice between the friction of manual deployment (SSH, scripts, docker-compose
) and the excessive overhead of full-scale orchestrators. This gap highlights the need for a solution that provides core orchestration benefits – declarative management, automation, scheduling, self-healing – without the associated complexity and resource cost. KAT aims to be that solution.
1.2. Goals
The primary objectives guiding the design of KAT v1.0 are:
- Simplicity: Offer an orchestration experience that is significantly easier to install, configure, learn, and operate than existing mainstream platforms. Minimize conceptual surface area and required configuration.
- Minimal Overhead: Design KAT's core components (Leader, Agent, etcd) to consume minimal system resources, ensuring maximum availability for application workloads, particularly critical in single-node or low-resource scenarios.
- Core Orchestration: Provide robust management for the lifecycle of containerized long-running services, scheduled/batch jobs, and basic daemon sets.
- Automation: Enable automated deployment updates, on-agent image builds triggered directly from Git repository changes, and fundamental self-healing capabilities (container restarts, service replica rescheduling).
- Git-Native Workflow: Facilitate a direct "push-to-deploy" model, integrating seamlessly with common developer workflows centered around Git version control.
- Rootless Operation: Implement container execution using unprivileged users by default to enhance security posture and reduce system dependencies.
- Integrated Experience: Provide built-in solutions for fundamental requirements like overlay networking and service discovery, reducing reliance on complex external components for basic operation.
1.3. Design Philosophy
KAT adheres to the following principles:
- Embrace Simplicity (Grug Brained): Actively combat complexity. Prefer simpler solutions even if they cover slightly fewer edge cases initially. Provide opinionated defaults based on common usage patterns. (The Grug Brained Developer)
- Declarative Configuration: Users define the desired state via Quadlet files; KAT implements the control loops to achieve and maintain it.
- Locality of Behavior: Group related configurations logically (Quadlet directories) rather than by arbitrary type separation across the system. (HTMX: Locality of Behaviour)
- Leverage Stable Foundations: Utilize proven, well-maintained technologies like etcd (for consensus) and Podman (for container runtime).
- Explicit is Better than Implicit (Mostly): While providing defaults, make configuration options clear and understandable. Avoid overly "magic" behavior.
- Build for the Common Case: Focus V1 on solving the 80-90% of use cases for the target audience extremely well.
- Fail Fast, Recover Simply: Design components to handle failures predictably. Prioritize simple recovery mechanisms (like etcd snapshots, agent restarts converging state) over complex distributed failure handling protocols where possible for V1.
1.4. Scope of KAT v1.0
This specification details KAT Version 1.0. It includes:
- Leader-Agent architecture with etcd-based state and leader election.
- Quadlet resource model (
Workload
,VirtualLoadBalancer
,JobDefinition
,BuildDefinition
,VolumeDefinition
,Namespace
). - Deployment of Services, Jobs, and DaemonServices.
- Source specification via direct image name or Git repository (with on-agent builds using Podman). Optional build caching via registry.
- Resource-based scheduling with
nodeSelector
and Taint/Toleration support, using a "most empty" placement strategy. - Workload updates via
Simultaneous
orRolling
strategies (maxSurge
control). Manual rollback support. - Container lifecycle management including restart policies (
Never
,MaxCount
,Always
with reset timer) and optional health checks. - Volume support for
HostMount
andSimpleClusterStorage
. - Integrated WireGuard-based overlay networking.
- Distributed agent-local DNS for service discovery, synchronized via etcd.
- Detached node operation mode with simplified rejoin logic.
- Basic state backup via Leader-driven etcd snapshots.
- Rootless container execution via systemd user services.
- A Proto3-defined, JSON-over-HTTP RESTful API (v1alpha1).
- Opinionated Ingress recipe using Traefik.
Features explicitly deferred beyond v1.0 are listed in Section 12.
1.5. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
- KAT System (Cluster): The complete set of KAT Nodes forming a single operational orchestration domain.
- Node: An individual machine (physical or virtual) running the KAT Agent software. Each Node has a unique name within the cluster.
- Leader Node (Leader): The single Node currently elected via the consensus mechanism to perform authoritative cluster management tasks.
- Agent Node (Agent): A Node running the KAT Agent software, responsible for local workload execution and status reporting. Includes the Leader node.
- Namespace: A logical partition within the KAT cluster used to organize resources (Workloads, Volumes). Defined by
namespace.kat
. Default is "default". System namespace is "kat-core". - Workload: The primary unit of application deployment, defined by a set of Quadlet files specifying desired state. Types:
Service
,Job
,DaemonService
. - Service: A Workload type representing a long-running application.
- Job: A Workload type representing a task that runs to completion.
- DaemonService: A Workload type ensuring one instance runs on each eligible Node.
- KAT Quadlet (Quadlet): A set of co-located YAML files (
*.kat
) defining a single Workload. Submitted and managed atomically. - Container: A running instance managed by the container runtime (Podman).
- Image: The template for a container, specified directly or built from Git.
- Volume: Persistent or ephemeral storage attached to a Workload's container(s). Types:
HostMount
,SimpleClusterStorage
. - Overlay Network: KAT-managed virtual network (WireGuard) for inter-node/inter-container communication.
- Service Discovery: Mechanism (distributed agent DNS) for resolving service names to overlay IPs.
- Ingress: Exposing internal services externally, typically via the Traefik recipe.
- Tick: Configurable interval for Agent heartbeats to the Leader.
- Taint: Key/Value/Effect marker on a Node to repel workloads.
- Toleration: Marker on a Workload allowing it to schedule on Nodes with matching Taints.
- API: Application Programming Interface (HTTP/JSON based on Proto3).
- CLI: Command Line Interface (
katcall
). - etcd: Distributed key-value store used for consensus and state.
2. System Architecture
2.1. Overview
KAT operates using a distributed Leader-Agent model built upon an embedded etcd consensus layer. One kat-agent
instance is elected Leader, responsible for maintaining the cluster's desired state, making scheduling decisions, and serving the API. All other kat-agent
instances act as workers (Agents), executing tasks assigned by the Leader and reporting their status. Communication occurs primarily between Agents and the Leader, facilitated by an integrated overlay network.
2.2. Components
kat-agent
(Binary): The single executable deployed on all nodes. Runs in one of two primary modes internally based on leader election status: Agent or Leader.- Common Functions: Node registration, heartbeating, overlay network participation, local container runtime interaction (Podman via CRI interface), local state monitoring, execution of Leader commands.
- Rootless Execution: Manages container workloads under distinct, unprivileged user accounts via systemd user sessions (preferred method).
- Leader Role (Internal state within an elected
kat-agent
):- Hosts API endpoints.
- Manages desired/actual state in etcd.
- Runs the scheduling logic.
- Executes the reconciliation control loop.
- Manages IPAM for the overlay network.
- Updates DNS records in etcd.
- Coordinates node join/leave/failure handling.
- Initiates etcd backups.
- Embedded etcd: Linked library providing Raft consensus for leader election and strongly consistent key-value storage for all cluster state (desired specs, actual status, network config, DNS records). Runs within the
kat-agent
process on quorum members (typically 1, 3, or 5 nodes).
2.3. Node Communication Protocol
- Transport: HTTP/1.1 or HTTP/2 over mandatory mTLS. KAT includes a simple internal PKI bootstrapped during
init
andjoin
. - Agent -> Leader: Periodic
POST /v1alpha1/nodes/{nodeName}/status
containing heartbeat and detailed node/workload status. Triggered everyTick
. Immediate reports for critical events MAY be sent. - Leader -> Agent: Commands (create/start/stop/remove container, update config) sent via
POST/PUT/DELETE
to agent-specific endpoints (e.g.,https://{agentOverlayIP}:{agentPort}/agent/v1alpha1/...
). - Payload: JSON, derived from Proto3 message definitions.
- Discovery/Join: Initial contact via leader hint uses HTTP API; subsequent peer discovery for etcd/overlay uses information distributed by the Leader via the API/etcd.
- Detached Mode Communication: Multicast/Broadcast UDP for
REJOIN_REQUEST
messages on local network segments. Direct HTTP response from parent Leader.
3. Resource Model: KAT Quadlets
3.1. Overview
KAT configuration is declarative, centered around the "Quadlet" concept. A Workload is defined by a directory containing YAML files (*.kat
), each specifying a different aspect (kind
). This promotes modularity and locality of behavior.
3.2. Workload Definition (workload.kat
)
REQUIRED. Defines the core identity, source, type, and lifecycle policies.
apiVersion: kat.dws.rip/v1alpha1
kind: Workload
metadata:
name: string # REQUIRED. Workload name.
namespace: string # OPTIONAL. Defaults to "default".
spec:
type: enum # REQUIRED: Service | Job | DaemonService
source: # REQUIRED. Exactly ONE of image or git must be present.
image: string # OPTIONAL. Container image reference.
git: # OPTIONAL. Build from Git.
repository: string # REQUIRED if git. URL of Git repo.
branch: string # OPTIONAL. Defaults to repo default.
tag: string # OPTIONAL. Overrides branch.
commit: string # OPTIONAL. Overrides tag/branch.
cacheImage: string # OPTIONAL. Registry path for build cache layers.
# Used only if 'git' source is specified.
replicas: int # REQUIRED for type: Service. Desired instance count.
# Ignored for Job, DaemonService.
updateStrategy: # OPTIONAL for Service/DaemonService.
type: enum # REQUIRED: Rolling | Simultaneous. Default: Rolling.
rolling: # Relevant if type is Rolling.
maxSurge: int | string # OPTIONAL. Max extra instances during update. Default 1.
restartPolicy: # REQUIRED for container lifecycle.
condition: enum # REQUIRED: Never | MaxCount | Always
maxRestarts: int # OPTIONAL. Used if condition=MaxCount. Default 5.
resetSeconds: int # OPTIONAL. Used if condition=MaxCount. Window to reset count. Default 3600.
nodeSelector: map[string]string # OPTIONAL. Schedule only on nodes matching all labels.
tolerations: # OPTIONAL. List of taints this workload can tolerate.
- key: string
operator: enum # OPTIONAL. Exists | Equal. Default: Exists.
value: string # OPTIONAL. Needed if operator=Equal.
effect: enum # OPTIONAL. NoSchedule | PreferNoSchedule. Matches taint effect.
# Empty matches all effects for the key/value pair.
# --- Container specification (V1 assumes one primary container per workload) ---
container: # REQUIRED.
name: string # OPTIONAL. Informational name for the container.
command: [string] # OPTIONAL. Override image CMD.
args: [string] # OPTIONAL. Override image ENTRYPOINT args or CMD args.
env: # OPTIONAL. Environment variables.
- name: string
value: string
volumeMounts: # OPTIONAL. Mount volumes defined in spec.volumes.
- name: string # Volume name.
mountPath: string # Path inside container.
subPath: string # Optional. Mount sub-directory.
readOnly: bool # Optional. Default false.
resources: # OPTIONAL. Resource requests and limits.
requests: # Used for scheduling. Defaults to limits if unspecified.
cpu: string # e.g., "100m"
memory: string # e.g., "64Mi"
limits: # Enforced by runtime. Container killed if memory exceeded.
cpu: string # CPU throttling limit (e.g., "1")
memory: string # e.g., "256Mi"
gpu: # OPTIONAL. Request GPU resources.
driver: enum # OPTIONAL: any | nvidia | amd
minVRAM_MB: int # OPTIONAL. Minimum GPU memory required.
# --- Volume Definitions for this Workload ---
volumes: # OPTIONAL. Defines volumes used by container.volumeMounts.
- name: string # REQUIRED. Name referenced by volumeMounts.
simpleClusterStorage: {} # OPTIONAL. Creates dir under agent's volumeBasePath.
# Use ONE OF simpleClusterStorage or hostMount.
hostMount: # OPTIONAL. Mounts a specific path from the host node.
hostPath: string # REQUIRED if hostMount. Absolute path on host.
ensureType: enum # OPTIONAL: DirectoryOrCreate | Directory | FileOrCreate | File | Socket```
#### 3.3. Virtual Load Balancer Definition (`VirtualLoadBalancer.kat`)
OPTIONAL. Only relevant for `Workload` of `type: Service`. Defines networking endpoints and health criteria for load balancing and ingress.
```yaml
apiVersion: kat.dws.rip/v1alpha1
kind: VirtualLoadBalancer # Identifies this Quadlet file's purpose
spec:
ports: # REQUIRED if this file exists. List of ports to expose/balance.
- name: string # OPTIONAL. Informational name (e.g., "web", "grpc").
containerPort: int # REQUIRED. Port the application listens on inside container.
protocol: string # OPTIONAL. TCP | UDP. Default TCP.
healthCheck: # OPTIONAL. Used for readiness in rollouts and LB target selection.
# If omitted, container running status implies health.
exec:
command: [string] # REQUIRED. Exit 0 = healthy.
initialDelaySeconds: int # OPTIONAL. Default 0.
periodSeconds: int # OPTIONAL. Default 10.
timeoutSeconds: int # OPTIONAL. Default 1.
successThreshold: int # OPTIONAL. Default 1.
failureThreshold: int # OPTIONAL. Default 3.
ingress: # OPTIONAL. Hints for external ingress controllers (like the Traefik recipe).
- host: string # REQUIRED. External hostname.
path: string # OPTIONAL. Path prefix. Default "/".
servicePortName: string # OPTIONAL. Name of port from spec.ports to target.
servicePort: int # OPTIONAL. Port number from spec.ports. Overrides name.
# One of servicePortName or servicePort MUST be provided if ports exist.
tls: bool # OPTIONAL. If true, signal ingress controller to manage TLS via ACME.
3.4. Job Definition (job.kat
)
REQUIRED if spec.type
in workload.kat
is Job
.
apiVersion: kat.dws.rip/v1alpha1
kind: JobDefinition # Identifies this Quadlet file's purpose
spec:
schedule: string # OPTIONAL. Cron schedule string.
completions: int # OPTIONAL. Desired successful completions. Default 1.
parallelism: int # OPTIONAL. Max concurrent instances. Default 1.
activeDeadlineSeconds: int # OPTIONAL. Timeout for the job run.
backoffLimit: int # OPTIONAL. Max failed instance restarts before job fails. Default 3.
3.5. Build Definition (build.kat
)
REQUIRED if spec.source.git
is specified in workload.kat
.
apiVersion: kat.dws.rip/v1alpha1
kind: BuildDefinition
spec:
buildContext: string # OPTIONAL. Path relative to repo root. Defaults to ".".
dockerfilePath: string # OPTIONAL. Path relative to buildContext. Defaults to "./Dockerfile".
buildArgs: # OPTIONAL. Map for build arguments.
map[string]string # e.g., {"VERSION": "1.2.3"}
targetStage: string # OPTIONAL. Target stage name for multi-stage builds.
platform: string # OPTIONAL. Target platform (e.g., "linux/arm64").
cache: # OPTIONAL. Defines build caching strategy.
registryPath: string # OPTIONAL. Registry path (e.g., "myreg.com/cache/myapp").
# Agent tags cache image with commit SHA.
3.6. Volume Definition (volume.kat
)
DEPRECATED in favor of defining volumes directly within workload.kat -> spec.volumes
. This enhances Locality of Behavior. Section 3.2 reflects this change. This file kind is reserved for potential future use with cluster-wide persistent volumes.
3.7. Namespace Definition (namespace.kat
)
REQUIRED for defining non-default namespaces.
apiVersion: kat.dws.rip/v1alpha1
kind: Namespace
metadata:
name: string # REQUIRED. Name of the namespace.
# labels: map[string]string # OPTIONAL.
3.8. Node Resource (Internal)
Represents node state managed by the Leader, queryable via API. Not defined by user Quadlets. Contains fields like name
, status
, addresses
, capacity
, allocatable
, labels
, taints
.
3.9. Cluster Configuration (cluster.kat
)
Used only during kat-agent init
via a flag (e.g., --config cluster.kat
). Defines immutable cluster-wide parameters.
apiVersion: kat.dws.rip/v1alpha1
kind: ClusterConfiguration
metadata:
name: string # REQUIRED. Informational name for the cluster.
spec:
clusterCIDR: string # REQUIRED. CIDR for overlay network IPs (e.g., "10.100.0.0/16").
serviceCIDR: string # REQUIRED. CIDR for internal virtual IPs (used by future internal proxy/LB).
# Not directly used by containers in V1 networking model.
nodeSubnetBits: int # OPTIONAL. Number of bits for node subnets within clusterCIDR.
# Default 7 (yielding /23 subnets if clusterCIDR=/16).
clusterDomain: string # OPTIONAL. DNS domain suffix. Default "kat.cluster.local".
# --- Port configurations ---
agentPort: int # OPTIONAL. Port agent listens on (internal). Default 9116.
apiPort: int # OPTIONAL. Port leader listens on for API. Default 9115.
etcdPeerPort: int # OPTIONAL. Default 2380.
etcdClientPort: int # OPTIONAL. Default 2379.
# --- Path configurations ---
volumeBasePath: string # OPTIONAL. Agent base path for SimpleClusterStorage. Default "/var/lib/kat/volumes".
backupPath: string # OPTIONAL. Path on Leader for etcd backups. Default "/var/lib/kat/backups".
# --- Interval configurations ---
backupIntervalMinutes: int # OPTIONAL. Frequency of etcd backups. Default 30.
agentTickSeconds: int # OPTIONAL. Agent heartbeat interval. Default 15.
nodeLossTimeoutSeconds: int # OPTIONAL. Time before marking node NotReady. Default 60.
4. Core Operations and Lifecycle Management
This section details the operational logic, state transitions, and management processes within the KAT system, from cluster initialization to workload execution and node dynamics.
4.1. System Bootstrapping and Node Lifecycle
4.1.1. Initial Leader Setup
The first KAT Node is initialized to become the Leader and establish the cluster.
- Command: The administrator executes
kat-agent init --config <path_to_cluster.kat>
on the designated initial node. Thecluster.kat
file (see Section 3.9) provides essential cluster-wide parameters. - Action:
- The
kat-agent
process starts. - It parses the
cluster.kat
file to obtain parameters like ClusterCIDR, ServiceCIDR, domain, agent/API ports, etcd ports, volume paths, and backup settings. - It generates a new internal Certificate Authority (CA) key and certificate (for the PKI, see Section 10.6) if one doesn't already exist at a predefined path.
- It initializes and starts an embedded single-node etcd server, using the configured etcd ports. The etcd data directory is created.
- The agent campaigns for leadership via etcd's election mechanism (Section 5.3) and, being the only member, becomes the Leader. It writes its identity (e.g., its advertise IP and API port) to a well-known key in etcd (e.g.,
/kat/config/leader_endpoint
). - The Leader initializes its IPAM module (Section 7.2) for the defined ClusterCIDR.
- It generates its own WireGuard key pair, stores the private key securely, and publishes its public key and overlay endpoint (external IP and WireGuard port) to etcd.
- It sets up its local
kat0
WireGuard interface using its assigned overlay IP (the first available from its own initial subnet). - It starts the API server on the configured API port.
- It starts its local DNS resolver (Section 7.3).
- The
kat-core
anddefault
Namespaces are created in etcd if they do not exist.
- The
4.1.2. Agent Node Join
Subsequent Nodes join an existing KAT cluster to act as workers (and potential future etcd quorum members or leaders if so configured, though V1 focuses on a static initial quorum).
- Command:
kat-agent join --leader-api <leader_api_ip:port> --advertise-address <ip_or_interface_name> [--etcd-peer]
(The--etcd-peer
flag indicates this node should attempt to join the etcd quorum). - Action:
- The
kat-agent
process starts. - It generates a WireGuard key pair.
- It contacts the specified Leader API endpoint to request joining the cluster, sending its intended
advertise-address
(for inter-node WireGuard communication) and its WireGuard public key. It also sends a Certificate Signing Request (CSR) for its mTLS client/server certificate. - The Leader, upon validating the join request (V1 has no strong token validation, relies on network trust):
- Assigns a unique Node Name (if not provided by agent, Leader generates one) and a Node Subnet from the ClusterCIDR (Section 7.2).
- Signs the Agent's CSR using the cluster CA, returning the signed certificate and the CA certificate.
- Records the new Node's name, advertise address, WireGuard public key, and assigned subnet in etcd (e.g., under
/kat/nodes/registration/{nodeName}
). - If
--etcd-peer
was requested and the quorum has capacity, the Leader MAY instruct the node to join the etcd quorum by providing current peer URLs. (For V1, etcd peer addition post-init is considered an advanced operation, default is static initial quorum). - Provides the new Agent with the list of all current Nodes' WireGuard public keys, overlay endpoint addresses, and their assigned overlay subnets (for
AllowedIPs
).
- The joining Agent:
- Stores the received mTLS certificate and CA certificate.
- Configures its local
kat0
WireGuard interface with an IP from its assigned subnet (typically the.1
address) and sets up peers for all other known nodes. - If instructed to join etcd quorum, configures and starts its embedded etcd as a peer.
- Registers itself formally with the Leader via a status update.
- Starts its local DNS resolver and begins syncing DNS state from etcd.
- Becomes
Ready
and available for scheduling workloads.
- The
4.1.3. Node Heartbeat and Status Reporting
Each Agent Node (including the Leader acting as an Agent for its local workloads) MUST periodically send a status update to the active Leader.
- Interval: Configurable
agentTickSeconds
(fromcluster.kat
, e.g., default 15 seconds). - Content: The payload is a JSON object reflecting the Node's current state:
nodeName
: string (its unique identifier)nodeUID
: string (a persistent unique ID for the node instance)timestamp
: int64 (Unix epoch seconds)resources
:capacity
:{"cpu": "2000m", "memory": "4096Mi"}
allocatable
:{"cpu": "1800m", "memory": "3800Mi"}
(capacity minus system overhead)
workloadInstances
: Array of objects, each detailing a locally managed container:workloadName
: stringnamespace
: stringinstanceID
: string (unique ID for this replica/run of the workload)containerID
: string (from Podman)imageID
: string (from Podman)state
: string ("running", "exited", "paused", "unknown")exitCode
: int (if exited)healthStatus
: string ("healthy", "unhealthy", "pending_check") (fromVirtualLoadBalancer.kat
health check)restarts
: int (count of Agent-initiated restarts for this instance)
overlayNetwork
:{"status": "connected", "lastPeerSync": "timestamp"}
- Protocol: HTTP
POST
to Leader's/v1alpha1/nodes/{nodeName}/status
endpoint, authenticated via mTLS. The Leader updates the Node's actual state in etcd (e.g.,/kat/nodes/actual/{nodeName}/status
).
4.1.4. Node Departure and Failure Detection
- Graceful Departure:
- Admin action:
katcall drain <nodeName>
. This sets aNoSchedule
Taint on the Node object in etcd and marks its desired state as "Draining". - Leader reconciliation loop:
- Stops scheduling new workloads to the Node.
- For existing
Service
andDaemonService
instances on the draining node, it initiates a process to reschedule them to other eligible nodes (respecting update strategies where applicable, e.g., not violatingmaxUnavailable
for the service cluster-wide). - For
Job
instances, they are allowed to complete. If a Job is actively running and the node is drained, KAT V1's behavior is to let it finish; more sophisticated preemption is future work.
- Once all managed workload instances are terminated or rescheduled, the Agent MAY send a final "departing" message, and the admin can decommission the node. The Leader eventually removes the Node object from etcd after a timeout if it stops heartbeating.
- Admin action:
- Failure Detection:
- The Leader monitors Agent heartbeats. If an Agent misses
nodeLossTimeoutSeconds
(fromcluster.kat
, e.g., 3 *agentTickSeconds
), the Leader marks the Node's status in etcd asNotReady
. - Reconciliation Loop for
NotReady
Node:- For
Service
instances previously on theNotReady
node: The Leader attempts to schedule replacement instances on otherReady
eligible nodes to maintainspec.replicas
. - For
DaemonService
instances: No action, as the node is not eligible. - For
Job
instances: If the job has a restart policy allowing it, the Leader MAY attempt to reschedule the failed job instance on another eligible node.
- For
- If the Node rejoins (starts heartbeating again): The Leader marks it
Ready
. The reconciliation loop will then assess if any workloads should be on this node (e.g., DaemonServices, or if it's now the best fit for some pending Services). Any instances that were rescheduled off this node and are now redundant (because the original instance might still be running locally on the rejoined node if it only had a network partition) will be identified. The Leader will instruct the rejoined Agent to stop any such zombie/duplicate containers based oninstanceID
tracking.
- The Leader monitors Agent heartbeats. If an Agent misses
4.2. Workload Deployment and Source Management
Workloads are the primary units of deployment, defined by Quadlet directories.
- Submission:
- Client (e.g.,
katcall apply -f ./my-workload-dir/
) archives the Quadlet directory (e.g.,my-workload-dir/
) into atar.gz
file. - Client sends an HTTP
POST
(for new) orPUT
(for update) to/v1alpha1/n/{namespace}/workloads
(if name is inworkload.kat
) or/v1alpha1/n/{namespace}/workloads/{workloadName}
(forPUT
). The body is thetar.gz
archive. - Leader validates the
metadata.name
inworkload.kat
against the URL path forPUT
.
- Client (e.g.,
- Validation & Storage:
- Leader unpacks the archive.
- It validates each
.kat
file against its known schema (e.g.,Workload
,VirtualLoadBalancer
,BuildDefinition
,JobDefinition
). - Cross-Quadlet file consistency is checked (e.g., referenced port names in
VirtualLoadBalancer.kat -> spec.ingress
exist inVirtualLoadBalancer.kat -> spec.ports
). - If valid, Leader persists each Quadlet file's content into etcd under
/kat/workloads/desired/{namespace}/{workloadName}/{fileName}
. Themetadata.generation
for the workload is incremented on spec changes. - Leader responds
201 Created
or200 OK
with the workload's metadata.
- Source Handling Precedence by Agent (upon receiving deployment command):
- If
workload.kat -> spec.source.git
is defined: a. Ifworkload.kat -> spec.source.cacheImage
is also defined, Agent first attempts to pull thiscacheImage
(see Section 4.3.3). If successful and image hash matches an expected value (e.g., if git commit is also specified and used to tag cache), this image is used, and local build MAY be skipped. b. If no cache image or cache pull fails/mismatches, proceed to Git Build (Section 4.3). The resulting locally built image is used. - Else if
workload.kat -> spec.source.image
is defined (and nogit
source): Agent pulls this image (Section 4.6.1). - If neither
git
norimage
is specified, it's a validation error by the Leader.
- If
4.3. Git-Native Build Process
Triggered when an Agent is instructed to run a Workload instance with spec.source.git
.
- Setup: Agent creates a temporary, isolated build directory.
- Cloning:
git clone --depth 1 --branch <branch_or_tag_or_default> <repository_url> .
(orgit fetch origin <commit> && git checkout <commit>
). - Context & Dockerfile Path: Agent uses
buildContext
anddockerfilePath
frombuild.kat
(defaults to.
and./Dockerfile
respectively). - Build Execution:
- Construct
podman build
command with:-t <internal_image_tag>
(e.g.,kat-local/{namespace}_{workloadName}:{git_commit_sha_short}
)-f {dockerfilePath}
within the{buildContext}
.--build-arg
for each frombuild.kat -> spec.buildArgs
.--target {targetStage}
if specified.--platform {platform}
if specified (else Podman defaults).- The build context path.
- Execute as the Agent's rootless user or a dedicated build user for that workload.
- Construct
- Build Caching (
build.kat -> spec.cache.registryPath
):- Pre-Build Pull (Cache Hit): Before Step 2 (Cloning), Agent constructs a tag based on
registryPath
and the specific Git commit SHA (if available, else latest of branch/tag). Attemptspodman pull
. If successful, uses this image and skips local build steps. - Post-Build Push (Cache Miss/New Build): After successful local build, Agent tags the new image with
{registryPath}:{git_commit_sha_short}
and attemptspodman push
. Registry credentials MUST be configured locally on the Agent (e.g., in Podman's auth file for the build user). KATv1 does not manage these credentials centrally.
- Pre-Build Pull (Cache Hit): Before Step 2 (Cloning), Agent constructs a tag based on
- Outcome: Agent reports build success (with internal image tag) or failure to Leader.
4.4. Scheduling
The Leader performs scheduling in its reconciliation loop for new or rescheduled Workload instances.
- Filter Nodes - Resource Requests:
- Identify
spec.container.resources.requests
(CPU, memory). - Filter out Nodes whose
status.allocatable
resources are less than requested.
- Identify
- Filter Nodes - nodeSelector:
- If
spec.nodeSelector
is present, filter out Nodes whose labels do not match all key-value pairs in the selector.
- If
- Filter Nodes - Taints and Tolerations:
- For each remaining Node, check its
taints
. - A Workload instance is repelled if the Node has a taint with
effect=NoSchedule
that is not tolerated byspec.tolerations
. - (Nodes with
PreferNoSchedule
taints not tolerated are kept but deprioritized in scoring).
- For each remaining Node, check its
- Filter Nodes - GPU Requirements:
- If
spec.container.resources.gpu
is specified:- Filter out Nodes that do not report matching GPU capabilities (e.g.,
gpu.nvidia.present=true
based ondriver
request). - Filter out Nodes whose reported available VRAM (a node-level attribute, potentially dynamically tracked by agent) is less than
minVRAM_MB
.
- Filter out Nodes that do not report matching GPU capabilities (e.g.,
- If
- Score Nodes ("Most Empty" Proportional):
- For each remaining candidate Node:
cpu_used_percent = (node_total_cpu_requested_by_workloads / node_allocatable_cpu) * 100
mem_used_percent = (node_total_mem_requested_by_workloads / node_allocatable_mem) * 100
score = (100 - cpu_used_percent) + (100 - mem_used_percent)
(Higher is better, gives more weight to balanced free resources). Orscore = 100 - max(cpu_used_percent, mem_used_percent)
.
- For each remaining candidate Node:
- Select Node:
- Prioritize nodes not having untolerated
PreferNoSchedule
taints. - Among those (or all, if all preferred are full), pick the Node with the highest score.
- If multiple nodes tie for the highest score, pick one randomly.
- Prioritize nodes not having untolerated
- Replica Spreading (Services/DaemonServices): For multi-instance workloads, when choosing among equally scored nodes, the scheduler MAY prefer nodes currently running fewer instances of the same workload to achieve basic anti-affinity. For
DaemonService
, it schedules one instance on every eligible node identified after filtering. - If no suitable node is found, the instance remains
Pending
.
4.5. Workload Updates and Rollouts
Triggered by PUT
to Workload API endpoint with changed Quadlet specs. Leader compares new desiredSpecHash
with status.observedSpecHash
.
Simultaneous
Strategy (spec.updateStrategy.type
):- Leader instructs Agents to stop and remove all old-version instances.
- Once confirmed (or timeout), Leader schedules all new-version instances as per Section 4.4. This causes downtime.
Rolling
Strategy (spec.updateStrategy.type
):max_surge_val = calculate_absolute(spec.updateStrategy.rolling.maxSurge, new_replicas_count)
- Total allowed instances =
new_replicas_count + max_surge_val
. - The Leader updates instances incrementally:
a. Scales up by launching new-version instances until
total_running_instances
reachesnew_replicas_count
orold_replicas_count + max_surge_val
, whichever is smaller and appropriate for making progress. New instances use the updated Quadlet spec. b. Once a new-version instance becomesHealthy
(passesVirtualLoadBalancer.kat
health checks, or just starts if no checks), an old-version instance is selected and terminated. c. The process continues until all instances are new-version andnew_replicas_count
are healthy. d. Ifnew_replicas_count < old_replicas_count
, surplus old instances are terminated first, respecting a conceptual (not explicitly defined in V1, but can bemax_surge_val
effectively acting asmaxUnavailable
) limit to maintain availability.
- Rollbacks (Manual):
- Leader stores the Quadlet files of the previous successfully deployed version in etcd (e.g., at
/kat/workloads/archive/{namespace}/{workloadName}/{generation-1}/
). - User command:
katcall rollback workload {namespace}/{name}
. - Leader retrieves archived Quadlets, treats them as a new desired state, and applies the workload's configured
updateStrategy
to revert.
- Leader stores the Quadlet files of the previous successfully deployed version in etcd (e.g., at
4.6. Container Lifecycle Management
Managed by the Agent based on Leader commands and local policies.
- Image Pull/Availability: Before creating, Agent ensures the target image (from Git build, cache, or direct ref) is locally available, pulling if necessary.
- Creation & Start: Agent uses
ContainerRuntime
to create and start the container with parameters derived fromworkload.kat -> spec.container
andVirtualLoadBalancer.kat -> spec.ports
(translated to runtime port mappings). Node-allocated IP is assigned. - Health Checks (for Services with
VirtualLoadBalancer.kat
): Agent periodically runsspec.healthCheck.exec.command
inside the container afterinitialDelaySeconds
. Status (Healthy/Unhealthy) based onsuccessThreshold
/failureThreshold
is reported in heartbeats. - Restart Policy (
workload.kat -> spec.restartPolicy
):Never
: No automatic restart by Agent. Leader reschedules for Services/DaemonServices.Always
: Agent always restarts on exit, with exponential backoff.MaxCount
: Agent restarts on non-zero exit, up tomaxRestarts
times. IfresetSeconds
elapses since the first restart in a series without hittingmaxRestarts
, the restart count for that series resets. Persistent failure aftermaxRestarts
withinresetSeconds
window causes instance to be markedFailed
by Agent. Leader acts accordingly.
4.7. Volume Lifecycle Management
Defined in workload.kat -> spec.volumes
and mounted via spec.container.volumeMounts
.
- Agent Responsibility: Before container start, Agent ensures specified volumes are available:
SimpleClusterStorage
: Creates directory{agent.volumeBasePath}/{namespace}/{workloadName}/{volumeName}
if it doesn't exist. Permissions should allow container user access.HostMount
: ValidateshostPath
exists. IfensureType
isDirectoryOrCreate
orFileOrCreate
, attempts creation. Mounts into container.
- Persistence: Data in
SimpleClusterStorage
on a node persists across container restarts on that same node. If the underlyingagent.volumeBasePath
is on network storage (user-managed), it's cluster-persistent.HostMount
data persists with the host path.
4.8. Job Execution Lifecycle
Defined by workload.kat -> spec.type: Job
and job.kat
.
- Leader schedules Job instances based on
schedule
,completions
,parallelism
. - Agent runs container. On exit:
- Exit code 0: Instance
Succeeded
. - Non-zero: Instance
Failed
. Agent appliesrestartPolicy
up tojob.kat -> spec.backoffLimit
for the Job instance (distinct from container restarts).
- Exit code 0: Instance
- Leader tracks
completions
andactiveDeadlineSeconds
.
4.9. Detached Node Operation and Rejoin
Revised mechanism for dynamic nodes (e.g., laptops):
- Configuration: Agents have
--parent-cluster-name
and--node-type
(e.g.,laptop
,stable
). - Detached Mode: If Agent cannot reach parent Leader after
nodeLossTimeoutSeconds
, it sets an internaldetached=true
flag. - Local Leadership: Agent becomes its own single-node Leader (trivial election).
- Local Operations:
- Continues running pre-detachment workloads.
- New workloads submitted to its local API get an automatic
nodeSelector
constraint:kat.dws.rip/nodeName: <current_node_name>
.
- Rejoin Attempt: Periodically multicasts
(REJOIN_REQUEST, <parent_cluster_name>, ...)
on local LAN. - Parent Response & Rejoin: Parent Leader responds. Detached Agent clears flag, submits its locally-created (nodeSelector-constrained) workloads to parent Leader API, then performs standard Agent join.
- Parent Reconciliation: Parent Leader accepts new workloads, respecting their nodeSelector.
5. State Management
5.1. State Store Interface (Go)
KAT components interact with etcd via a Go interface for abstraction.
package store
import (
"context"
"time"
)
type KV struct { Key string; Value []byte; Version int64 /* etcd ModRevision */ }
type WatchEvent struct { Type EventType; KV KV; PrevKV *KV }
type EventType int
const ( EventTypePut EventType = iota; EventTypeDelete )
type StateStore interface {
Put(ctx context.Context, key string, value []byte) error
Get(ctx context.Context, key string) (*KV, error)
Delete(ctx context.Context, key string) error
List(ctx context.Context, prefix string) ([]KV, error)
Watch(ctx context.Context, keyOrPrefix string, startRevision int64) (<-chan WatchEvent, error) // Added startRevision
Close() error
Campaign(ctx context.Context, leaderID string, leaseTTLSeconds int64) (leadershipCtx context.Context, err error) // Returns context cancelled on leadership loss
Resign(ctx context.Context) error // Uses context from Campaign to manage lease
GetLeader(ctx context.Context) (leaderID string, err error)
DoTransaction(ctx context.Context, checks []Compare, onSuccess []Op, onFailure []Op) (committed bool, err error) // For CAS operations
}
type Compare struct { Key string; ExpectedVersion int64 /* 0 for key not exists */ }
type Op struct { Type OpType; Key string; Value []byte /* for Put */ }
type OpType int
const ( OpPut OpType = iota; OpDelete; OpGet /* not typically used in Txn success/fail ops */)
The Campaign
method returns a context that is cancelled when leadership is lost or Resign
is called, simplifying leadership management. DoTransaction
enables conditional writes for atomicity.
5.2. etcd Implementation Details
- Client: Uses
go.etcd.io/etcd/client/v3
. - Embedded Server: Uses
go.etcd.io/etcd/server/v3/embed
withinkat-agent
on quorum nodes. Configuration (data-dir, peer/client URLs) fromcluster.kat
and agent flags. - Key Schema Examples:
/kat/schema_version
:v1.0
/kat/config/cluster_uid
: UUID generated at init./kat/config/leader_endpoint
: Current Leader's API endpoint./kat/nodes/registration/{nodeName}
: Node's static registration info (UID, WireGuard pubkey, advertise addr)./kat/nodes/status/{nodeName}
: Node's dynamic status (heartbeat timestamp, resources, local instances). Leased by agent./kat/workloads/desired/{namespace}/{workloadName}/manifest/{fileName}
: Content of each Quadlet file./kat/workloads/desired/{namespace}/{workloadName}/meta
: Workload metadata (generation, overall spec hash)./kat/workloads/status/{namespace}/{workloadName}
: Leader-maintained status of the workload./kat/network/config/overlay_cidr
: ClusterCIDR./kat/network/nodes/{nodeName}/subnet
: Assigned overlay subnet./kat/network/allocations/{instanceID}/ip
: Assigned container overlay IP. Leased by agent managing instance./kat/dns/{namespace}/{workloadName}/{recordType}/{value}
: Flattened DNS records./kat/leader_election/
(etcd prefix): Used byclientv3/concurrency/election
.
5.3. Leader Election
Utilizes go.etcd.io/etcd/client/v3/concurrency#NewElection
and Campaign
. All agents configured as potential quorum members participate. The elected Leader renews its lease continuously. If the lease expires (e.g., Leader crashes), other candidates campaign.
5.4. State Backup (Leader Responsibility)
The active Leader periodically performs an etcd snapshot.
- Interval:
backupIntervalMinutes
fromcluster.kat
. - Action: Executes
etcdctl snapshot save {backupPath}/{timestamped_filename.db}
against its own embedded etcd member. - Path:
backupPath
fromcluster.kat
. - Rotation: Leader maintains the last N snapshots locally (e.g., N=5, configurable), deleting older ones.
- User Responsibility: These are local snapshots on the Leader node. Users MUST implement external mechanisms to copy these snapshots to secure, off-node storage.
5.5. State Restore Procedure
For disaster recovery (total cluster loss or etcd quorum corruption):
- STOP all
kat-agent
processes on all nodes. - Identify the desired etcd snapshot file (
.db
). - On one designated node (intended to be the first new Leader):
- Clear its old etcd data directory (
--data-dir
for etcd). - Restore the snapshot:
etcdctl snapshot restore <snapshot.db> --name <member_name> --initial-cluster <member_name>=http://<node_ip>:<etcdPeerPort> --initial-cluster-token <new_token> --data-dir <new_data_dir_path>
- Modify the
kat-agent
startup for this node to use thenew_data_dir_path
and configure it as if initializing a new cluster but pointing to this restored data (specific flags for etcd embed).
- Clear its old etcd data directory (
- Start the
kat-agent
on this restored node. It will become Leader of a new single-member cluster with the restored state. - On all other KAT nodes:
- Clear their old etcd data directories.
- Clear any KAT agent local state (e.g., WireGuard configs, runtime state).
- Join them to the new Leader using
kat-agent join
as if joining a fresh cluster.
- The Leader's reconciliation loop will then redeploy workloads according to the restored desired state. In-flight data or states not captured in the last etcd snapshot will be lost.
6. Container Runtime Interface
6.1. Runtime Interface Definition (Go)
Defines the abstraction KAT uses to manage containers.
package runtime
import (
"context"
"io"
"time"
)
type ImageSummary struct { ID string; Tags []string; Size int64 }
type ContainerState string
const (
ContainerStateRunning ContainerState = "running"
ContainerStateExited ContainerState = "exited"
ContainerStateCreated ContainerState = "created"
ContainerStatePaused ContainerState = "paused"
ContainerStateRemoving ContainerState = "removing"
ContainerStateUnknown ContainerState = "unknown"
)
type HealthState string
const (
HealthStateHealthy HealthState = "healthy"
HealthStateUnhealthy HealthState = "unhealthy"
HealthStatePending HealthState = "pending_check" // Health check defined but not yet run
HealthStateNotApplicable HealthState = "not_applicable" // No health check defined
)
type ContainerStatus struct {
ID string
ImageID string
ImageName string // Image used to create container
State ContainerState
ExitCode int
StartedAt time.Time
FinishedAt time.Time
Health HealthState
Restarts int // Number of times runtime restarted this specific container instance
OverlayIP string
}
type BuildOptions struct { // From Section 3.5, expanded
ContextDir string
DockerfilePath string
ImageTag string // Target tag for the build
BuildArgs map[string]string
TargetStage string
Platform string
CacheTo []string // e.g., ["type=registry,ref=myreg.com/cache/img:latest"]
CacheFrom []string // e.g., ["type=registry,ref=myreg.com/cache/img:latest"]
NoCache bool
Pull bool // Whether to attempt to pull base images
}
type PortMapping struct { HostPort int; ContainerPort int; Protocol string /* TCP, UDP */; HostIP string /* 0.0.0.0 default */}
type VolumeMount struct {
Name string // User-defined name of the volume from workload.spec.volumes
Type string // "hostMount", "simpleClusterStorage" (translated to "bind" for Podman)
Source string // Resolved host path for the volume
Destination string // Mount path inside container
ReadOnly bool
// SELinuxLabel, Propagation options if needed later
}
type GPUOptions struct { DeviceIDs []string /* e.g., ["0", "1"] or ["all"] */; Capabilities [][]string /* e.g., [["gpu"], ["compute","utility"]] */}
type ResourceSpec struct {
CPUShares int64 // Relative weight
CPUQuota int64 // Microseconds per period (e.g., 50000 for 0.5 CPU with 100000 period)
CPUPeriod int64 // Microseconds (e.g., 100000)
MemoryLimitBytes int64
GPUSpec *GPUOptions // If GPU requested
}
type ContainerCreateOptions struct {
WorkloadName string
Namespace string
InstanceID string // KAT-generated unique ID for this replica/run
ImageName string // Image to run (after pull/build)
Hostname string
Command []string
Args []string
Env map[string]string
Labels map[string]string // Include KAT ownership labels
RestartPolicy string // "no", "on-failure", "always" (Podman specific values)
Resources ResourceSpec
Ports []PortMapping
Volumes []VolumeMount
NetworkName string // Name of Podman network to join (e.g., for overlay)
IPAddress string // Static IP within Podman network, if assigned by KAT IPAM
User string // User to run as inside container (e.g., "1000:1000")
CapAdd []string
CapDrop []string
SecurityOpt []string
HealthCheck *ContainerHealthCheck // Podman native healthcheck config
Systemd bool // Run container with systemd as init
}
type ContainerHealthCheck struct {
Test []string // e.g., ["CMD", "curl", "-f", "http://localhost/health"]
Interval time.Duration
Timeout time.Duration
Retries int
StartPeriod time.Duration
}
type ContainerRuntime interface {
BuildImage(ctx context.Context, opts BuildOptions) (imageID string, err error)
PullImage(ctx context.Context, imageName string, platform string) (imageID string, err error)
PushImage(ctx context.Context, imageName string, destinationRegistry string) error
CreateContainer(ctx context.Context, opts ContainerCreateOptions) (containerID string, err error)
StartContainer(ctx context.Context, containerID string) error
StopContainer(ctx context.Context, containerID string, timeoutSeconds uint) error
RemoveContainer(ctx context.Context, containerID string, force bool, removeVolumes bool) error
GetContainerStatus(ctx context.Context, containerOrName string) (*ContainerStatus, error)
StreamContainerLogs(ctx context.Context, containerID string, follow bool, since time.Time, stdout io.Writer, stderr io.Writer) error
PruneAllStoppedContainers(ctx context.Context) (reclaimedSpace int64, err error)
PruneAllUnusedImages(ctx context.Context) (reclaimedSpace int64, err error)
EnsureNetworkExists(ctx context.Context, networkName string, driver string, subnet string, gateway string, options map[string]string) error
RemoveNetwork(ctx context.Context, networkName string) error
ListManagedContainers(ctx context.Context) ([]ContainerStatus, error) // Lists containers labelled by KAT
}
6.2. Default Implementation: Podman
The default and only supported ContainerRuntime
for KAT v1.0 is Podman. The implementation will primarily shell out to the podman
CLI, using appropriate JSON output flags for parsing. It assumes podman
is installed and correctly configured for rootless operation on Agent nodes. Key commands used: podman build
, podman pull
, podman push
, podman create
, podman start
, podman stop
, podman rm
, podman inspect
, podman logs
, podman system prune
, podman network create/rm/inspect
.
6.3. Rootless Execution Strategy
KAT Agents MUST orchestrate container workloads rootlessly. The PREFERRED strategy is:
- Dedicated User per Workload/Namespace: The
kat-agent
(running as root, or with specific sudo rights foruseradd
,loginctl
,systemctl --user
) creates a dedicated, unprivileged system user account (e.g.,kat_wl_mywebapp
) when a workload is first scheduled to the node, or uses a pre-existing user from a pool. - Enable Linger:
loginctl enable-linger <username>
. - Generate Systemd Unit: The Agent translates the KAT workload definition into container create options and uses
podman generate systemd --new --name {instanceID} --files --time 10 {imageName} {command...}
to produce a.service
unit file. This unit will include environment variables, volume mounts, port mappings (if host-mapped), resource limits, etc. TheRestart=
directive in the systemd unit will be set according toworkload.kat -> spec.restartPolicy
. - Place and Manage Unit: The unit file is placed in
/etc/systemd/user/
(if agent is root, enabling it for the target user) or~{username}/.config/systemd/user/
. The Agent then usessystemctl --user --machine={username}@.host daemon-reload
,systemctl --user --machine={username}@.host enable --now {service_name}.service
to start and manage it. - Status and Logs: Agent queries
systemctl --user --machine... status
andjournalctl --user-unit ...
for status and logs.
This leverages systemd's robust process supervision and cgroup management for rootless containers.
7. Networking
7.1. Integrated Overlay Network
KAT v1.0 implements a mandatory, simple, encrypted Layer 3 overlay network connecting all Nodes using WireGuard.
- Configuration: Defined by
cluster.kat -> spec.clusterCIDR
. - Key Management:
- Each Agent generates a WireGuard key pair locally upon first start/join. Private key is stored securely (e.g.,
/etc/kat/wg_private.key
, mode 0600). Public key is reported to the Leader during registration. - Leader stores all registered Node public keys and their external advertise IPs (for WireGuard endpoint) in etcd under
/kat/network/nodes/{nodeName}/wg_pubkey
and/kat/network/nodes/{nodeName}/wg_endpoint
.
- Each Agent generates a WireGuard key pair locally upon first start/join. Private key is stored securely (e.g.,
- Peer Configuration: Each Agent watches
/kat/network/nodes/
in etcd. When a new node joins or an existing node's WireGuard info changes, the Agent updates its local WireGuard configuration (e.g., for interfacekat0
):- Adds/updates a
[Peer]
section for every other node. PublicKey = {peer_public_key}
Endpoint = {peer_advertise_ip}:{configured_wg_port}
AllowedIPs = {peer_assigned_overlay_subnet_cidr}
(see IPAM below).- PersistentKeepalive MAY be used if nodes are behind NAT.
- Adds/updates a
- Interface Setup: Agent ensures
kat0
interface is up with its assigned overlay IP. Standard OS routing rules handle traffic for theclusterCIDR
viakat0
.
7.2. IP Address Management (IPAM)
The Leader manages IP allocation for the overlay network.
- Node Subnets: From
clusterCIDR
andnodeSubnetBits
(fromcluster.kat
), the Leader carves out a distinct subnet for each Node that joins (e.g., if clusterCIDR is10.100.0.0/16
andnodeSubnetBits
is 7, each node gets a/23
, like10.100.0.0/23
,10.100.2.0/23
, etc.). This Node-to-Subnet mapping is stored in etcd. - Container IPs: When the Leader schedules a Workload instance to a Node, it allocates the next available IP address from that Node's assigned subnet. This
instanceID -> containerIP
mapping is stored in etcd, possibly with a lease. The Agent is informed of this IP to pass topodman create --ip ...
. - Maximum Instances: The size of the node subnet implicitly limits the number of container instances per node.
7.3. Distributed Agent DNS and Service Discovery
Each KAT Agent runs an embedded DNS resolver, synchronized via etcd, providing service discovery.
- DNS Server Implementation: Agents use
github.com/miekg/dns
to run a DNS server goroutine, listening on theirkat0
overlay IP (port 53). - Record Source:
- When a
Workload
instance (especiallyService
orDaemonService
) with an assigned overlay IP becomes healthy (or starts, if no health check), the Leader writes DNS A records to etcd:A <instanceID>.<workloadName>.<namespace>.<clusterDomain>
-><containerOverlayIP}
- For Services with
VirtualLoadBalancer.kat -> spec.ports
:A <workloadName>.<namespace>.<clusterDomain>
-><containerOverlayIP}
(multiple A records for different healthy instances are created).
- The etcd key structure might be
/kat/dns/{clusterDomain}/{namespace}/{workloadName}/{instanceID_or_service_A}
.
- When a
- Agent DNS Sync: Each Agent's DNS server
Watches
the/kat/dns/
prefix in etcd. On changes, it updates its in-memory DNS zone data. - Container Configuration: Agents configure the
/etc/resolv.conf
of all managed containers to use the Agent's ownkat0
overlay IP as the sole nameserver. - Query Handling:
- The local Agent DNS resolver first attempts to resolve queries based on the source container's namespace (e.g.,
app
fromns-foo
triesapp.ns-foo.kat.cluster.local
). - If not found, it tries fully qualified name as-is.
- Implements basic negative caching (NXDOMAIN for short TTL) to reduce load.
- It does NOT forward to upstream resolvers for KAT domain names. For external names, it may forward or containers must have a secondary configured upstream resolver (V1: no upstream forwarding by agent DNS for simplicity).
- The local Agent DNS resolver first attempts to resolve queries based on the source container's namespace (e.g.,
7.4. Ingress (Opinionated Recipe via Traefik)
KAT provides a standardized way to deploy Traefik for ingress.
- Ingress Node Designation: Admins label Nodes intended for ingress with
kat.dws.rip/role=ingress
. kat-traefik-ingress
Quadlet: DWS LLC provides standard Quadlet files:workload.kat
: Deploys Traefik as aDaemonService
with anodeSelector
forkat.dws.rip/role=ingress
. Includes thekat-ingress-updater
container.VirtualLoadBalancer.kat
: Exposes Traefik's ports (80, 443) viaHostPort
on the ingress Nodes. Specifies health checks for Traefik itself.volume.kat
: Mounts host paths for/etc/traefik/traefik.yaml
(static config),/data/traefik/dynamic_conf/
(forkat-ingress-updater
), and/data/traefik/acme/
(for Let's Encrypt certs).
kat-ingress-updater
Container:- Runs alongside Traefik. Watches KAT API for
VirtualLoadBalancer
Quadlets withspec.ingress
stanzas. - Generates Traefik dynamic configuration files (routers, services) mapping external host/path to internal KAT service FQDNs (e.g.,
<service>.<namespace>.kat.cluster.local:<port>
). - Configures Traefik
certResolver
for Let's Encrypt for services requesting TLS. - Traefik watches its dynamic configuration directory.
- Runs alongside Traefik. Watches KAT API for
8. API Specification (KAT v1.0 Alpha)
8.1. General Principles and Authentication
- Protocol: HTTP/1.1 or HTTP/2. Mandatory mTLS for Agent-Leader and CLI-Leader.
- Data Format: Request/Response bodies MUST be JSON.
- Versioning: Endpoints prefixed with
/v1alpha1
. - Authentication: Static Bearer Token in
Authorization
header for CLI/external API clients. For KATv1, this token grants full cluster admin rights. Agent-to-Leader mTLS serves as agent authentication. - Error Reporting: Standard HTTP status codes. JSON body for errors:
{"error": "code", "message": "details"}
.
8.2. Resource Representation (Proto3 & JSON)
All API resources (Workloads, Namespaces, Nodes, etc., and their Quadlet file contents) are defined using Protocol Buffer v3 messages. The HTTP API transports these as JSON. Common metadata (name, namespace, uid, generation, resourceVersion, creationTimestamp) and status structures are standardized.
8.3. Core API Endpoints
(Referencing the structure from prior discussion in RFC draft section 8.3, ensuring:
- Namespace CRUD.
- Workload CRUD:
POST/PUT
acceptstar.gz
of Quadlet dir.GET
returns metadata+status. Endpoints for individual Quadlet file content (.../files/{fileName}
). Endpoint for logs (.../instances/{instanceID}/logs
). Endpoint for rollback (.../rollback
). - Node read endpoints:
GET /nodes
,GET /nodes/{name}
. Agent status update:POST /nodes/{name}/status
. Admin taint update:PUT /nodes/{name}/taints
. - Event query endpoint:
GET /events
. - ClusterConfiguration read endpoint:
GET /config/cluster
(shows sanitized running config). No separate top-level Volume API for KAT v1; volumes are defined within workloads.)
9. Observability
9.1. Logging
- Container Logs: Agents capture stdout/stderr, make available via
podman logs
mechanism, and stream via API tokatcall logs
. Local rotation on agent node. - Agent Logs:
kat-agent
logs to systemd journal or local files. - API Audit (Basic): Leader logs API requests (method, path, source IP, user if distinguishable) at a configurable level.
9.2. Metrics
- Agent Metrics: Node resource usage (CPU, memory, disk, network), container resource usage. Included in heartbeats.
- Leader Metrics: API request latencies/counts, scheduling attempts/successes/failures, etcd health.
- Exposure (V1): Minimal exposure via a
/metrics
JSON endpoint on Leader and Agent, not Prometheus formatted yet. - Future: Standardized Prometheus exposition format.
9.3. Events
Leader records significant cluster events (Workload create/update/delete, instance schedule/fail/health_change, Node ready/not_ready/join/leave, build success/fail, detached/rejoin actions) into a capped, time-series like structure in etcd.
- API:
GET /v1alpha1/events?[resourceType=X][&resourceName=Y][&namespace=Z]
- Fields per event: Timestamp, Type, Reason, InvolvedObject (kind, name, ns, uid), Message.
10. Security Considerations
10.1. API Security
- mTLS REQUIRED for all inter-KAT component communication (Agent-Leader).
- Bearer token for external API clients (e.g.,
katcall
). V1: single admin token. No granular RBAC. - API server should implement rate limiting.
10.2. Rootless Execution
Core design principle. Agents execute workloads via Podman in rootless mode, leveraging systemd user sessions for enhanced isolation. Minimizes container escape impact.
10.3. Build Security
- Building arbitrary Git repositories on Agent nodes is a potential risk.
- Builds run as unprivileged users via rootless Podman.
- Network access during build MAY be restricted in future (V1: unrestricted).
- Users are responsible for trusting Git sources.
cacheImage
provides a way to use pre-vetted images.
10.4. Network Security
- WireGuard overlay provides inter-node and inter-container encryption.
- Host firewalls are user responsibility.
nodePort
or Ingress exposure requires careful firewall configuration. - API/Agent communication ports should be firewalled from public access.
10.5. Secrets Management
- KAT v1 has NO dedicated secret management.
- Sensitive data passed via environment variables in
workload.kat -> spec.container.env
is stored plain in etcd. This is NOT secure for production secrets. - Registry credentials for
cacheImage
push/pull are local Agent configuration. - Recommendation: For sensitive data, users should use application-level encryption or sidecars that fetch from external secret stores (e.g., Vault), outside KAT's direct management in V1.
10.6. Internal PKI
- Initialization (
kat-agent init
):- Generates a self-signed CA key (
ca.key
) and CA certificate (ca.crt
). Stored securely on the initial Leader node (e.g.,/var/lib/kat/pki/
). - Generates a Leader server key/cert signed by this CA for its API and Agent communication endpoints.
- Generates a Leader client key/cert signed by this CA for authenticating to etcd and Agents.
- Generates a self-signed CA key (
- Node Join (
kat-agent join
):- Agent generates a keypair and a CSR.
- Sends CSR to Leader over an initial (potentially untrusted, or token-protected if implemented later) channel.
- Leader signs the Agent's CSR using the CA key, returns the signed Agent certificate and the CA certificate.
- Agent stores its key, its signed certificate, and the CA cert for mTLS.
- mTLS Usage: All Agent-Leader and Leader-Agent (for commands) communications use mTLS, validating peer certificates against the cluster CA.
- Certificate Lifespan & Rotation: For V1, certificates might have a long lifespan (e.g., 1-10 years). Automated rotation is deferred. Manual regeneration/redistribution would be needed.
13. Acknowledgements
The KAT system design, while aiming for novel simplicity, stands on the shoulders of giants. Its architecture and concepts draw inspiration and incorporate lessons learned from numerous preceding systems and bodies of work in distributed computing and container orchestration. We specifically acknowledge the influence of:
- Kubernetes: For establishing many of the core concepts and terminology in modern container orchestration, even as KAT diverges in implementation complexity and API specifics.
- k3s and MicroK8s: For demonstrating the demand and feasibility of lightweight Kubernetes distributions, validating the need KAT aims to fill more radically.
- Podman & Quadlets: For pioneering robust rootless containerization and providing the direct inspiration for KAT's declarative Quadlet configuration model and systemd user service execution strategy.
- Docker Compose: For setting the standard in single-host multi-container application definition simplicity.
- HashiCorp Nomad: For demonstrating an alternative, successful approach to simplified, flexible orchestration beyond the Kubernetes paradigm, particularly its use of HCL and clear deployment primitives.
- Google Borg: For concepts in large-scale cluster management, scheduling, and the importance of introspection, as documented in their published research.
- The "Hints for Computer System Design" (Butler Lampson): For principles regarding simplicity, abstraction, performance trade-offs, and fault tolerance that heavily influenced KAT's philosophy.
- "A Note on Distributed Computing" (Waldo et al.): For articulating the fundamental differences between local and distributed computing that KAT attempts to manage pragmatically, rather than hide entirely.
- The Grug Brained Developer: For the essential reminder to relentlessly fight complexity and prioritize understandability.
- Open Source Community: For countless libraries, tools, discussions, and prior art that make a project like KAT feasible.
Finally, thanks to Simba, my cat, for providing naming inspiration.
14. Author's Address
Tanishq Dubey
DWS LLC
Email: [email protected]
URI: https://www.dws.rip