Request for Comments: 001 - The KAT System (v1.0 Specification)

Network Working Group: DWS LLC
Author: T. Dubey
Contact: [email protected]
Organization: DWS LLC
URI: https://www.dws.rip
Date: May 2025
Obsoletes: None
Updates: None

The KAT System: A Simplified Container Orchestration Protocol and Architecture Design (Version 1.0)

Status of This Memo

This document specifies Version 1.0 of the KAT (pronounced "cat") system, a simplified container orchestration protocol and architecture developed by DWS LLC. It defines the system's components, operational semantics, resource model, networking, state management, and Application Programming Interface (API). This specification is intended for implementation, discussion, and interoperability. Distribution of this memo is unlimited.

Abstract

The KAT system provides a lightweight, opinionated container orchestration platform specifically designed for resource-constrained environments such as single servers, small clusters, development sandboxes, home labs, and edge deployments. It contrasts with complex, enterprise-scale orchestrators by prioritizing simplicity, minimal resource overhead, developer experience, and direct integration with Git-based workflows. KAT manages containerized long-running services and batch jobs using a declarative "Quadlet" configuration model. Key features include an embedded etcd state store, a Leader-Agent architecture, automated on-agent builds from Git sources, rootless container execution, integrated overlay networking (WireGuard-based), distributed agent-local DNS for service discovery, resource-based scheduling with basic affinity/anti-affinity rules, and structured workload updates. This document provides a comprehensive specification for KAT v1.0 implementation and usage.

Table of Contents

  1. Introduction
    1.1. Motivation
    1.2. Goals
    1.3. Design Philosophy
    1.4. Scope of KAT v1.0
    1.5. Terminology
  2. System Architecture
    2.1. Overview
    2.2. Components
    2.3. Node Communication Protocol
  3. Resource Model: KAT Quadlets
    3.1. Overview
    3.2. Workload Definition (workload.kat)
    3.3. Virtual Load Balancer Definition (VirtualLoadBalancer.kat)
    3.4. Job Definition (job.kat)
    3.5. Build Definition (build.kat)
    3.6. Volume Definition (volume.kat)
    3.7. Namespace Definition (namespace.kat)
    3.8. Node Resource (Internal)
    3.9. Cluster Configuration (cluster.kat)
  4. Core Operations and Lifecycle Management
    4.1. System Bootstrapping and Node Lifecycle
    4.2. Workload Deployment and Source Management
    4.3. Git-Native Build Process
    4.4. Scheduling
    4.5. Workload Updates and Rollouts
    4.6. Container Lifecycle Management
    4.7. Volume Lifecycle Management
    4.8. Job Execution Lifecycle
    4.9. Detached Node Operation and Rejoin
  5. State Management
    5.1. State Store Interface (Go)
    5.2. etcd Implementation Details
    5.3. Leader Election
    5.4. State Backup (Leader Responsibility)
    5.5. State Restore Procedure
  6. Container Runtime Interface
    6.1. Runtime Interface Definition (Go)
    6.2. Default Implementation: Podman
    6.3. Rootless Execution Strategy
  7. Networking
    7.1. Integrated Overlay Network
    7.2. IP Address Management (IPAM)
    7.3. Distributed Agent DNS and Service Discovery
    7.4. Ingress (Opinionated Recipe via Traefik)
  8. API Specification (KAT v1.0 Alpha)
    8.1. General Principles and Authentication
    8.2. Resource Representation (Proto3 & JSON)
    8.3. Core API Endpoints
  9. Observability
    9.1. Logging
    9.2. Metrics
    9.3. Events
  10. Security Considerations
    10.1. API Security
    10.2. Rootless Execution
    10.3. Build Security
    10.4. Network Security
    10.5. Secrets Management
    10.6. Internal PKI
  11. Comparison to Alternatives
  12. Future Work
  13. Acknowledgements
  14. Author's Address

1. Introduction

1.1. Motivation

The landscape of container orchestration is dominated by powerful, feature-rich platforms designed for large-scale, enterprise deployments. While capable, these systems (e.g., Kubernetes) introduce significant operational complexity and resource requirements (CPU, memory overhead) that are often prohibitive or unnecessarily burdensome for smaller use cases. Developers and operators managing personal projects, home labs, CI/CD runners, small business applications, or edge devices frequently face a choice between the friction of manual deployment (SSH, scripts, docker-compose) and the excessive overhead of full-scale orchestrators. This gap highlights the need for a solution that provides core orchestration benefits – declarative management, automation, scheduling, self-healing – without the associated complexity and resource cost. KAT aims to be that solution.

1.2. Goals

The primary objectives guiding the design of KAT v1.0 are:

  • Simplicity: Offer an orchestration experience that is significantly easier to install, configure, learn, and operate than existing mainstream platforms. Minimize conceptual surface area and required configuration.
  • Minimal Overhead: Design KAT's core components (Leader, Agent, etcd) to consume minimal system resources, ensuring maximum availability for application workloads, particularly critical in single-node or low-resource scenarios.
  • Core Orchestration: Provide robust management for the lifecycle of containerized long-running services, scheduled/batch jobs, and basic daemon sets.
  • Automation: Enable automated deployment updates, on-agent image builds triggered directly from Git repository changes, and fundamental self-healing capabilities (container restarts, service replica rescheduling).
  • Git-Native Workflow: Facilitate a direct "push-to-deploy" model, integrating seamlessly with common developer workflows centered around Git version control.
  • Rootless Operation: Implement container execution using unprivileged users by default to enhance security posture and reduce system dependencies.
  • Integrated Experience: Provide built-in solutions for fundamental requirements like overlay networking and service discovery, reducing reliance on complex external components for basic operation.

1.3. Design Philosophy

KAT adheres to the following principles:

  • Embrace Simplicity (Grug Brained): Actively combat complexity. Prefer simpler solutions even if they cover slightly fewer edge cases initially. Provide opinionated defaults based on common usage patterns. (The Grug Brained Developer)
  • Declarative Configuration: Users define the desired state via Quadlet files; KAT implements the control loops to achieve and maintain it.
  • Locality of Behavior: Group related configurations logically (Quadlet directories) rather than by arbitrary type separation across the system. (HTMX: Locality of Behaviour)
  • Leverage Stable Foundations: Utilize proven, well-maintained technologies like etcd (for consensus) and Podman (for container runtime).
  • Explicit is Better than Implicit (Mostly): While providing defaults, make configuration options clear and understandable. Avoid overly "magic" behavior.
  • Build for the Common Case: Focus V1 on solving the 80-90% of use cases for the target audience extremely well.
  • Fail Fast, Recover Simply: Design components to handle failures predictably. Prioritize simple recovery mechanisms (like etcd snapshots, agent restarts converging state) over complex distributed failure handling protocols where possible for V1.

1.4. Scope of KAT v1.0

This specification details KAT Version 1.0. It includes:

  • Leader-Agent architecture with etcd-based state and leader election.
  • Quadlet resource model (Workload, VirtualLoadBalancer, JobDefinition, BuildDefinition, VolumeDefinition, Namespace).
  • Deployment of Services, Jobs, and DaemonServices.
  • Source specification via direct image name or Git repository (with on-agent builds using Podman). Optional build caching via registry.
  • Resource-based scheduling with nodeSelector and Taint/Toleration support, using a "most empty" placement strategy.
  • Workload updates via Simultaneous or Rolling strategies (maxSurge control). Manual rollback support.
  • Container lifecycle management including restart policies (Never, MaxCount, Always with reset timer) and optional health checks.
  • Volume support for HostMount and SimpleClusterStorage.
  • Integrated WireGuard-based overlay networking.
  • Distributed agent-local DNS for service discovery, synchronized via etcd.
  • Detached node operation mode with simplified rejoin logic.
  • Basic state backup via Leader-driven etcd snapshots.
  • Rootless container execution via systemd user services.
  • A Proto3-defined, JSON-over-HTTP RESTful API (v1alpha1).
  • Opinionated Ingress recipe using Traefik.

Features explicitly deferred beyond v1.0 are listed in Section 12.

1.5. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

  • KAT System (Cluster): The complete set of KAT Nodes forming a single operational orchestration domain.
  • Node: An individual machine (physical or virtual) running the KAT Agent software. Each Node has a unique name within the cluster.
  • Leader Node (Leader): The single Node currently elected via the consensus mechanism to perform authoritative cluster management tasks.
  • Agent Node (Agent): A Node running the KAT Agent software, responsible for local workload execution and status reporting. Includes the Leader node.
  • Namespace: A logical partition within the KAT cluster used to organize resources (Workloads, Volumes). Defined by namespace.kat. Default is "default". System namespace is "kat-core".
  • Workload: The primary unit of application deployment, defined by a set of Quadlet files specifying desired state. Types: Service, Job, DaemonService.
  • Service: A Workload type representing a long-running application.
  • Job: A Workload type representing a task that runs to completion.
  • DaemonService: A Workload type ensuring one instance runs on each eligible Node.
  • KAT Quadlet (Quadlet): A set of co-located YAML files (*.kat) defining a single Workload. Submitted and managed atomically.
  • Container: A running instance managed by the container runtime (Podman).
  • Image: The template for a container, specified directly or built from Git.
  • Volume: Persistent or ephemeral storage attached to a Workload's container(s). Types: HostMount, SimpleClusterStorage.
  • Overlay Network: KAT-managed virtual network (WireGuard) for inter-node/inter-container communication.
  • Service Discovery: Mechanism (distributed agent DNS) for resolving service names to overlay IPs.
  • Ingress: Exposing internal services externally, typically via the Traefik recipe.
  • Tick: Configurable interval for Agent heartbeats to the Leader.
  • Taint: Key/Value/Effect marker on a Node to repel workloads.
  • Toleration: Marker on a Workload allowing it to schedule on Nodes with matching Taints.
  • API: Application Programming Interface (HTTP/JSON based on Proto3).
  • CLI: Command Line Interface (katcall).
  • etcd: Distributed key-value store used for consensus and state.

2. System Architecture

2.1. Overview

KAT operates using a distributed Leader-Agent model built upon an embedded etcd consensus layer. One kat-agent instance is elected Leader, responsible for maintaining the cluster's desired state, making scheduling decisions, and serving the API. All other kat-agent instances act as workers (Agents), executing tasks assigned by the Leader and reporting their status. Communication occurs primarily between Agents and the Leader, facilitated by an integrated overlay network.

2.2. Components

  • kat-agent (Binary): The single executable deployed on all nodes. Runs in one of two primary modes internally based on leader election status: Agent or Leader.
    • Common Functions: Node registration, heartbeating, overlay network participation, local container runtime interaction (Podman via CRI interface), local state monitoring, execution of Leader commands.
    • Rootless Execution: Manages container workloads under distinct, unprivileged user accounts via systemd user sessions (preferred method).
  • Leader Role (Internal state within an elected kat-agent):
    • Hosts API endpoints.
    • Manages desired/actual state in etcd.
    • Runs the scheduling logic.
    • Executes the reconciliation control loop.
    • Manages IPAM for the overlay network.
    • Updates DNS records in etcd.
    • Coordinates node join/leave/failure handling.
    • Initiates etcd backups.
  • Embedded etcd: Linked library providing Raft consensus for leader election and strongly consistent key-value storage for all cluster state (desired specs, actual status, network config, DNS records). Runs within the kat-agent process on quorum members (typically 1, 3, or 5 nodes).

2.3. Node Communication Protocol

  • Transport: HTTP/1.1 or HTTP/2 over mandatory mTLS. KAT includes a simple internal PKI bootstrapped during init and join.
  • Agent -> Leader: Periodic POST /v1alpha1/nodes/{nodeName}/status containing heartbeat and detailed node/workload status. Triggered every Tick. Immediate reports for critical events MAY be sent.
  • Leader -> Agent: Commands (create/start/stop/remove container, update config) sent via POST/PUT/DELETE to agent-specific endpoints (e.g., https://{agentOverlayIP}:{agentPort}/agent/v1alpha1/...).
  • Payload: JSON, derived from Proto3 message definitions.
  • Discovery/Join: Initial contact via leader hint uses HTTP API; subsequent peer discovery for etcd/overlay uses information distributed by the Leader via the API/etcd.
  • Detached Mode Communication: Multicast/Broadcast UDP for REJOIN_REQUEST messages on local network segments. Direct HTTP response from parent Leader.

3. Resource Model: KAT Quadlets

3.1. Overview

KAT configuration is declarative, centered around the "Quadlet" concept. A Workload is defined by a directory containing YAML files (*.kat), each specifying a different aspect (kind). This promotes modularity and locality of behavior.

3.2. Workload Definition (workload.kat)

REQUIRED. Defines the core identity, source, type, and lifecycle policies.

apiVersion: kat.dws.rip/v1alpha1
kind: Workload
metadata:
  name: string # REQUIRED. Workload name.
  namespace: string # OPTIONAL. Defaults to "default".
spec:
  type: enum # REQUIRED: Service | Job | DaemonService
  source: # REQUIRED. Exactly ONE of image or git must be present.
    image: string # OPTIONAL. Container image reference.
    git: # OPTIONAL. Build from Git.
      repository: string # REQUIRED if git. URL of Git repo.
      branch: string # OPTIONAL. Defaults to repo default.
      tag: string # OPTIONAL. Overrides branch.
      commit: string # OPTIONAL. Overrides tag/branch.
    cacheImage: string # OPTIONAL. Registry path for build cache layers.
                       # Used only if 'git' source is specified.
  replicas: int # REQUIRED for type: Service. Desired instance count.
                # Ignored for Job, DaemonService.
  updateStrategy: # OPTIONAL for Service/DaemonService.
    type: enum # REQUIRED: Rolling | Simultaneous. Default: Rolling.
    rolling: # Relevant if type is Rolling.
      maxSurge: int | string # OPTIONAL. Max extra instances during update. Default 1.
  restartPolicy: # REQUIRED for container lifecycle.
    condition: enum # REQUIRED: Never | MaxCount | Always
    maxRestarts: int # OPTIONAL. Used if condition=MaxCount. Default 5.
    resetSeconds: int # OPTIONAL. Used if condition=MaxCount. Window to reset count. Default 3600.
  nodeSelector: map[string]string # OPTIONAL. Schedule only on nodes matching all labels.
  tolerations: # OPTIONAL. List of taints this workload can tolerate.
    - key: string
      operator: enum # OPTIONAL. Exists | Equal. Default: Exists.
      value: string # OPTIONAL. Needed if operator=Equal.
      effect: enum # OPTIONAL. NoSchedule | PreferNoSchedule. Matches taint effect.
                   # Empty matches all effects for the key/value pair.
  # --- Container specification (V1 assumes one primary container per workload) ---
  container: # REQUIRED.
    name: string # OPTIONAL. Informational name for the container.
    command: [string] # OPTIONAL. Override image CMD.
    args: [string] # OPTIONAL. Override image ENTRYPOINT args or CMD args.
    env: # OPTIONAL. Environment variables.
      - name: string
        value: string
    volumeMounts: # OPTIONAL. Mount volumes defined in spec.volumes.
      - name: string # Volume name.
        mountPath: string # Path inside container.
        subPath: string # Optional. Mount sub-directory.
        readOnly: bool # Optional. Default false.
    resources: # OPTIONAL. Resource requests and limits.
      requests: # Used for scheduling. Defaults to limits if unspecified.
        cpu: string # e.g., "100m"
        memory: string # e.g., "64Mi"
      limits: # Enforced by runtime. Container killed if memory exceeded.
        cpu: string # CPU throttling limit (e.g., "1")
        memory: string # e.g., "256Mi"
      gpu: # OPTIONAL. Request GPU resources.
        driver: enum # OPTIONAL: any | nvidia | amd
        minVRAM_MB: int # OPTIONAL. Minimum GPU memory required.
  # --- Volume Definitions for this Workload ---
  volumes: # OPTIONAL. Defines volumes used by container.volumeMounts.
    - name: string # REQUIRED. Name referenced by volumeMounts.
      simpleClusterStorage: {} # OPTIONAL. Creates dir under agent's volumeBasePath.
                               # Use ONE OF simpleClusterStorage or hostMount.
      hostMount: # OPTIONAL. Mounts a specific path from the host node.
        hostPath: string # REQUIRED if hostMount. Absolute path on host.
        ensureType: enum # OPTIONAL: DirectoryOrCreate | Directory | FileOrCreate | File | Socket```

#### 3.3. Virtual Load Balancer Definition (`VirtualLoadBalancer.kat`)

OPTIONAL. Only relevant for `Workload` of `type: Service`. Defines networking endpoints and health criteria for load balancing and ingress.

```yaml
apiVersion: kat.dws.rip/v1alpha1
kind: VirtualLoadBalancer # Identifies this Quadlet file's purpose
spec:
  ports: # REQUIRED if this file exists. List of ports to expose/balance.
    - name: string # OPTIONAL. Informational name (e.g., "web", "grpc").
      containerPort: int # REQUIRED. Port the application listens on inside container.
      protocol: string # OPTIONAL. TCP | UDP. Default TCP.
  healthCheck: # OPTIONAL. Used for readiness in rollouts and LB target selection.
               # If omitted, container running status implies health.
    exec:
      command: [string] # REQUIRED. Exit 0 = healthy.
    initialDelaySeconds: int # OPTIONAL. Default 0.
    periodSeconds: int # OPTIONAL. Default 10.
    timeoutSeconds: int # OPTIONAL. Default 1.
    successThreshold: int # OPTIONAL. Default 1.
    failureThreshold: int # OPTIONAL. Default 3.
  ingress: # OPTIONAL. Hints for external ingress controllers (like the Traefik recipe).
    - host: string # REQUIRED. External hostname.
      path: string # OPTIONAL. Path prefix. Default "/".
      servicePortName: string # OPTIONAL. Name of port from spec.ports to target.
      servicePort: int # OPTIONAL. Port number from spec.ports. Overrides name.
                       # One of servicePortName or servicePort MUST be provided if ports exist.
      tls: bool # OPTIONAL. If true, signal ingress controller to manage TLS via ACME.

3.4. Job Definition (job.kat)

REQUIRED if spec.type in workload.kat is Job.

apiVersion: kat.dws.rip/v1alpha1
kind: JobDefinition # Identifies this Quadlet file's purpose
spec:
  schedule: string # OPTIONAL. Cron schedule string.
  completions: int # OPTIONAL. Desired successful completions. Default 1.
  parallelism: int # OPTIONAL. Max concurrent instances. Default 1.
  activeDeadlineSeconds: int # OPTIONAL. Timeout for the job run.
  backoffLimit: int # OPTIONAL. Max failed instance restarts before job fails. Default 3.

3.5. Build Definition (build.kat)

REQUIRED if spec.source.git is specified in workload.kat.

apiVersion: kat.dws.rip/v1alpha1
kind: BuildDefinition
spec:
  buildContext: string # OPTIONAL. Path relative to repo root. Defaults to ".".
  dockerfilePath: string # OPTIONAL. Path relative to buildContext. Defaults to "./Dockerfile".
  buildArgs: # OPTIONAL. Map for build arguments.
    map[string]string # e.g., {"VERSION": "1.2.3"}
  targetStage: string # OPTIONAL. Target stage name for multi-stage builds.
  platform: string # OPTIONAL. Target platform (e.g., "linux/arm64").
  cache: # OPTIONAL. Defines build caching strategy.
    registryPath: string # OPTIONAL. Registry path (e.g., "myreg.com/cache/myapp").
                         # Agent tags cache image with commit SHA.

3.6. Volume Definition (volume.kat)

DEPRECATED in favor of defining volumes directly within workload.kat -> spec.volumes. This enhances Locality of Behavior. Section 3.2 reflects this change. This file kind is reserved for potential future use with cluster-wide persistent volumes.

3.7. Namespace Definition (namespace.kat)

REQUIRED for defining non-default namespaces.

apiVersion: kat.dws.rip/v1alpha1
kind: Namespace
metadata:
  name: string # REQUIRED. Name of the namespace.
  # labels: map[string]string # OPTIONAL.

3.8. Node Resource (Internal)

Represents node state managed by the Leader, queryable via API. Not defined by user Quadlets. Contains fields like name, status, addresses, capacity, allocatable, labels, taints.

3.9. Cluster Configuration (cluster.kat)

Used only during kat-agent init via a flag (e.g., --config cluster.kat). Defines immutable cluster-wide parameters.

apiVersion: kat.dws.rip/v1alpha1
kind: ClusterConfiguration
metadata:
  name: string # REQUIRED. Informational name for the cluster.
spec:
  clusterCIDR: string # REQUIRED. CIDR for overlay network IPs (e.g., "10.100.0.0/16").
  serviceCIDR: string # REQUIRED. CIDR for internal virtual IPs (used by future internal proxy/LB).
                      # Not directly used by containers in V1 networking model.
  nodeSubnetBits: int # OPTIONAL. Number of bits for node subnets within clusterCIDR.
                      # Default 7 (yielding /23 subnets if clusterCIDR=/16).
  clusterDomain: string # OPTIONAL. DNS domain suffix. Default "kat.cluster.local".
  # --- Port configurations ---
  agentPort: int # OPTIONAL. Port agent listens on (internal). Default 9116.
  apiPort: int # OPTIONAL. Port leader listens on for API. Default 9115.
  etcdPeerPort: int # OPTIONAL. Default 2380.
  etcdClientPort: int # OPTIONAL. Default 2379.
  # --- Path configurations ---
  volumeBasePath: string # OPTIONAL. Agent base path for SimpleClusterStorage. Default "/var/lib/kat/volumes".
  backupPath: string # OPTIONAL. Path on Leader for etcd backups. Default "/var/lib/kat/backups".
  # --- Interval configurations ---
  backupIntervalMinutes: int # OPTIONAL. Frequency of etcd backups. Default 30.
  agentTickSeconds: int # OPTIONAL. Agent heartbeat interval. Default 15.
  nodeLossTimeoutSeconds: int # OPTIONAL. Time before marking node NotReady. Default 60.

4. Core Operations and Lifecycle Management

This section details the operational logic, state transitions, and management processes within the KAT system, from cluster initialization to workload execution and node dynamics.

4.1. System Bootstrapping and Node Lifecycle

4.1.1. Initial Leader Setup

The first KAT Node is initialized to become the Leader and establish the cluster.

  1. Command: The administrator executes kat-agent init --config <path_to_cluster.kat> on the designated initial node. The cluster.kat file (see Section 3.9) provides essential cluster-wide parameters.
  2. Action:
    • The kat-agent process starts.
    • It parses the cluster.kat file to obtain parameters like ClusterCIDR, ServiceCIDR, domain, agent/API ports, etcd ports, volume paths, and backup settings.
    • It generates a new internal Certificate Authority (CA) key and certificate (for the PKI, see Section 10.6) if one doesn't already exist at a predefined path.
    • It initializes and starts an embedded single-node etcd server, using the configured etcd ports. The etcd data directory is created.
    • The agent campaigns for leadership via etcd's election mechanism (Section 5.3) and, being the only member, becomes the Leader. It writes its identity (e.g., its advertise IP and API port) to a well-known key in etcd (e.g., /kat/config/leader_endpoint).
    • The Leader initializes its IPAM module (Section 7.2) for the defined ClusterCIDR.
    • It generates its own WireGuard key pair, stores the private key securely, and publishes its public key and overlay endpoint (external IP and WireGuard port) to etcd.
    • It sets up its local kat0 WireGuard interface using its assigned overlay IP (the first available from its own initial subnet).
    • It starts the API server on the configured API port.
    • It starts its local DNS resolver (Section 7.3).
    • The kat-core and default Namespaces are created in etcd if they do not exist.
4.1.2. Agent Node Join

Subsequent Nodes join an existing KAT cluster to act as workers (and potential future etcd quorum members or leaders if so configured, though V1 focuses on a static initial quorum).

  1. Command: kat-agent join --leader-api <leader_api_ip:port> --advertise-address <ip_or_interface_name> [--etcd-peer] (The --etcd-peer flag indicates this node should attempt to join the etcd quorum).
  2. Action:
    • The kat-agent process starts.
    • It generates a WireGuard key pair.
    • It contacts the specified Leader API endpoint to request joining the cluster, sending its intended advertise-address (for inter-node WireGuard communication) and its WireGuard public key. It also sends a Certificate Signing Request (CSR) for its mTLS client/server certificate.
    • The Leader, upon validating the join request (V1 has no strong token validation, relies on network trust):
      • Assigns a unique Node Name (if not provided by agent, Leader generates one) and a Node Subnet from the ClusterCIDR (Section 7.2).
      • Signs the Agent's CSR using the cluster CA, returning the signed certificate and the CA certificate.
      • Records the new Node's name, advertise address, WireGuard public key, and assigned subnet in etcd (e.g., under /kat/nodes/registration/{nodeName}).
      • If --etcd-peer was requested and the quorum has capacity, the Leader MAY instruct the node to join the etcd quorum by providing current peer URLs. (For V1, etcd peer addition post-init is considered an advanced operation, default is static initial quorum).
      • Provides the new Agent with the list of all current Nodes' WireGuard public keys, overlay endpoint addresses, and their assigned overlay subnets (for AllowedIPs).
    • The joining Agent:
      • Stores the received mTLS certificate and CA certificate.
      • Configures its local kat0 WireGuard interface with an IP from its assigned subnet (typically the .1 address) and sets up peers for all other known nodes.
      • If instructed to join etcd quorum, configures and starts its embedded etcd as a peer.
      • Registers itself formally with the Leader via a status update.
      • Starts its local DNS resolver and begins syncing DNS state from etcd.
      • Becomes Ready and available for scheduling workloads.
4.1.3. Node Heartbeat and Status Reporting

Each Agent Node (including the Leader acting as an Agent for its local workloads) MUST periodically send a status update to the active Leader.

  • Interval: Configurable agentTickSeconds (from cluster.kat, e.g., default 15 seconds).
  • Content: The payload is a JSON object reflecting the Node's current state:
    • nodeName: string (its unique identifier)
    • nodeUID: string (a persistent unique ID for the node instance)
    • timestamp: int64 (Unix epoch seconds)
    • resources:
      • capacity: {"cpu": "2000m", "memory": "4096Mi"}
      • allocatable: {"cpu": "1800m", "memory": "3800Mi"} (capacity minus system overhead)
    • workloadInstances: Array of objects, each detailing a locally managed container:
      • workloadName: string
      • namespace: string
      • instanceID: string (unique ID for this replica/run of the workload)
      • containerID: string (from Podman)
      • imageID: string (from Podman)
      • state: string ("running", "exited", "paused", "unknown")
      • exitCode: int (if exited)
      • healthStatus: string ("healthy", "unhealthy", "pending_check") (from VirtualLoadBalancer.kat health check)
      • restarts: int (count of Agent-initiated restarts for this instance)
    • overlayNetwork: {"status": "connected", "lastPeerSync": "timestamp"}
  • Protocol: HTTP POST to Leader's /v1alpha1/nodes/{nodeName}/status endpoint, authenticated via mTLS. The Leader updates the Node's actual state in etcd (e.g., /kat/nodes/actual/{nodeName}/status).
4.1.4. Node Departure and Failure Detection
  • Graceful Departure:
    1. Admin action: katcall drain <nodeName>. This sets a NoSchedule Taint on the Node object in etcd and marks its desired state as "Draining".
    2. Leader reconciliation loop:
      • Stops scheduling new workloads to the Node.
      • For existing Service and DaemonService instances on the draining node, it initiates a process to reschedule them to other eligible nodes (respecting update strategies where applicable, e.g., not violating maxUnavailable for the service cluster-wide).
      • For Job instances, they are allowed to complete. If a Job is actively running and the node is drained, KAT V1's behavior is to let it finish; more sophisticated preemption is future work.
    3. Once all managed workload instances are terminated or rescheduled, the Agent MAY send a final "departing" message, and the admin can decommission the node. The Leader eventually removes the Node object from etcd after a timeout if it stops heartbeating.
  • Failure Detection:
    1. The Leader monitors Agent heartbeats. If an Agent misses nodeLossTimeoutSeconds (from cluster.kat, e.g., 3 * agentTickSeconds), the Leader marks the Node's status in etcd as NotReady.
    2. Reconciliation Loop for NotReady Node:
      • For Service instances previously on the NotReady node: The Leader attempts to schedule replacement instances on other Ready eligible nodes to maintain spec.replicas.
      • For DaemonService instances: No action, as the node is not eligible.
      • For Job instances: If the job has a restart policy allowing it, the Leader MAY attempt to reschedule the failed job instance on another eligible node.
    3. If the Node rejoins (starts heartbeating again): The Leader marks it Ready. The reconciliation loop will then assess if any workloads should be on this node (e.g., DaemonServices, or if it's now the best fit for some pending Services). Any instances that were rescheduled off this node and are now redundant (because the original instance might still be running locally on the rejoined node if it only had a network partition) will be identified. The Leader will instruct the rejoined Agent to stop any such zombie/duplicate containers based on instanceID tracking.

4.2. Workload Deployment and Source Management

Workloads are the primary units of deployment, defined by Quadlet directories.

  1. Submission:
    • Client (e.g., katcall apply -f ./my-workload-dir/) archives the Quadlet directory (e.g., my-workload-dir/) into a tar.gz file.
    • Client sends an HTTP POST (for new) or PUT (for update) to /v1alpha1/n/{namespace}/workloads (if name is in workload.kat) or /v1alpha1/n/{namespace}/workloads/{workloadName} (for PUT). The body is the tar.gz archive.
    • Leader validates the metadata.name in workload.kat against the URL path for PUT.
  2. Validation & Storage:
    • Leader unpacks the archive.
    • It validates each .kat file against its known schema (e.g., Workload, VirtualLoadBalancer, BuildDefinition, JobDefinition).
    • Cross-Quadlet file consistency is checked (e.g., referenced port names in VirtualLoadBalancer.kat -> spec.ingress exist in VirtualLoadBalancer.kat -> spec.ports).
    • If valid, Leader persists each Quadlet file's content into etcd under /kat/workloads/desired/{namespace}/{workloadName}/{fileName}. The metadata.generation for the workload is incremented on spec changes.
    • Leader responds 201 Created or 200 OK with the workload's metadata.
  3. Source Handling Precedence by Agent (upon receiving deployment command):
    1. If workload.kat -> spec.source.git is defined: a. If workload.kat -> spec.source.cacheImage is also defined, Agent first attempts to pull this cacheImage (see Section 4.3.3). If successful and image hash matches an expected value (e.g., if git commit is also specified and used to tag cache), this image is used, and local build MAY be skipped. b. If no cache image or cache pull fails/mismatches, proceed to Git Build (Section 4.3). The resulting locally built image is used.
    2. Else if workload.kat -> spec.source.image is defined (and no git source): Agent pulls this image (Section 4.6.1).
    3. If neither git nor image is specified, it's a validation error by the Leader.

4.3. Git-Native Build Process

Triggered when an Agent is instructed to run a Workload instance with spec.source.git.

  1. Setup: Agent creates a temporary, isolated build directory.
  2. Cloning: git clone --depth 1 --branch <branch_or_tag_or_default> <repository_url> . (or git fetch origin <commit> && git checkout <commit>).
  3. Context & Dockerfile Path: Agent uses buildContext and dockerfilePath from build.kat (defaults to . and ./Dockerfile respectively).
  4. Build Execution:
    • Construct podman build command with:
      • -t <internal_image_tag> (e.g., kat-local/{namespace}_{workloadName}:{git_commit_sha_short})
      • -f {dockerfilePath} within the {buildContext}.
      • --build-arg for each from build.kat -> spec.buildArgs.
      • --target {targetStage} if specified.
      • --platform {platform} if specified (else Podman defaults).
      • The build context path.
    • Execute as the Agent's rootless user or a dedicated build user for that workload.
  5. Build Caching (build.kat -> spec.cache.registryPath):
    • Pre-Build Pull (Cache Hit): Before Step 2 (Cloning), Agent constructs a tag based on registryPath and the specific Git commit SHA (if available, else latest of branch/tag). Attempts podman pull. If successful, uses this image and skips local build steps.
    • Post-Build Push (Cache Miss/New Build): After successful local build, Agent tags the new image with {registryPath}:{git_commit_sha_short} and attempts podman push. Registry credentials MUST be configured locally on the Agent (e.g., in Podman's auth file for the build user). KATv1 does not manage these credentials centrally.
  6. Outcome: Agent reports build success (with internal image tag) or failure to Leader.

4.4. Scheduling

The Leader performs scheduling in its reconciliation loop for new or rescheduled Workload instances.

  1. Filter Nodes - Resource Requests:
    • Identify spec.container.resources.requests (CPU, memory).
    • Filter out Nodes whose status.allocatable resources are less than requested.
  2. Filter Nodes - nodeSelector:
    • If spec.nodeSelector is present, filter out Nodes whose labels do not match all key-value pairs in the selector.
  3. Filter Nodes - Taints and Tolerations:
    • For each remaining Node, check its taints.
    • A Workload instance is repelled if the Node has a taint with effect=NoSchedule that is not tolerated by spec.tolerations.
    • (Nodes with PreferNoSchedule taints not tolerated are kept but deprioritized in scoring).
  4. Filter Nodes - GPU Requirements:
    • If spec.container.resources.gpu is specified:
      • Filter out Nodes that do not report matching GPU capabilities (e.g., gpu.nvidia.present=true based on driver request).
      • Filter out Nodes whose reported available VRAM (a node-level attribute, potentially dynamically tracked by agent) is less than minVRAM_MB.
  5. Score Nodes ("Most Empty" Proportional):
    • For each remaining candidate Node:
      • cpu_used_percent = (node_total_cpu_requested_by_workloads / node_allocatable_cpu) * 100
      • mem_used_percent = (node_total_mem_requested_by_workloads / node_allocatable_mem) * 100
      • score = (100 - cpu_used_percent) + (100 - mem_used_percent) (Higher is better, gives more weight to balanced free resources). Or score = 100 - max(cpu_used_percent, mem_used_percent).
  6. Select Node:
    • Prioritize nodes not having untolerated PreferNoSchedule taints.
    • Among those (or all, if all preferred are full), pick the Node with the highest score.
    • If multiple nodes tie for the highest score, pick one randomly.
  7. Replica Spreading (Services/DaemonServices): For multi-instance workloads, when choosing among equally scored nodes, the scheduler MAY prefer nodes currently running fewer instances of the same workload to achieve basic anti-affinity. For DaemonService, it schedules one instance on every eligible node identified after filtering.
  8. If no suitable node is found, the instance remains Pending.

4.5. Workload Updates and Rollouts

Triggered by PUT to Workload API endpoint with changed Quadlet specs. Leader compares new desiredSpecHash with status.observedSpecHash.

  • Simultaneous Strategy (spec.updateStrategy.type):
    1. Leader instructs Agents to stop and remove all old-version instances.
    2. Once confirmed (or timeout), Leader schedules all new-version instances as per Section 4.4. This causes downtime.
  • Rolling Strategy (spec.updateStrategy.type):
    1. max_surge_val = calculate_absolute(spec.updateStrategy.rolling.maxSurge, new_replicas_count)
    2. Total allowed instances = new_replicas_count + max_surge_val.
    3. The Leader updates instances incrementally: a. Scales up by launching new-version instances until total_running_instances reaches new_replicas_count or old_replicas_count + max_surge_val, whichever is smaller and appropriate for making progress. New instances use the updated Quadlet spec. b. Once a new-version instance becomes Healthy (passes VirtualLoadBalancer.kat health checks, or just starts if no checks), an old-version instance is selected and terminated. c. The process continues until all instances are new-version and new_replicas_count are healthy. d. If new_replicas_count < old_replicas_count, surplus old instances are terminated first, respecting a conceptual (not explicitly defined in V1, but can be max_surge_val effectively acting as maxUnavailable) limit to maintain availability.
  • Rollbacks (Manual):
    1. Leader stores the Quadlet files of the previous successfully deployed version in etcd (e.g., at /kat/workloads/archive/{namespace}/{workloadName}/{generation-1}/).
    2. User command: katcall rollback workload {namespace}/{name}.
    3. Leader retrieves archived Quadlets, treats them as a new desired state, and applies the workload's configured updateStrategy to revert.

4.6. Container Lifecycle Management

Managed by the Agent based on Leader commands and local policies.

  1. Image Pull/Availability: Before creating, Agent ensures the target image (from Git build, cache, or direct ref) is locally available, pulling if necessary.
  2. Creation & Start: Agent uses ContainerRuntime to create and start the container with parameters derived from workload.kat -> spec.container and VirtualLoadBalancer.kat -> spec.ports (translated to runtime port mappings). Node-allocated IP is assigned.
  3. Health Checks (for Services with VirtualLoadBalancer.kat): Agent periodically runs spec.healthCheck.exec.command inside the container after initialDelaySeconds. Status (Healthy/Unhealthy) based on successThreshold/failureThreshold is reported in heartbeats.
  4. Restart Policy (workload.kat -> spec.restartPolicy):
    • Never: No automatic restart by Agent. Leader reschedules for Services/DaemonServices.
    • Always: Agent always restarts on exit, with exponential backoff.
    • MaxCount: Agent restarts on non-zero exit, up to maxRestarts times. If resetSeconds elapses since the first restart in a series without hitting maxRestarts, the restart count for that series resets. Persistent failure after maxRestarts within resetSeconds window causes instance to be marked Failed by Agent. Leader acts accordingly.

4.7. Volume Lifecycle Management

Defined in workload.kat -> spec.volumes and mounted via spec.container.volumeMounts.

  • Agent Responsibility: Before container start, Agent ensures specified volumes are available:
    • SimpleClusterStorage: Creates directory {agent.volumeBasePath}/{namespace}/{workloadName}/{volumeName} if it doesn't exist. Permissions should allow container user access.
    • HostMount: Validates hostPath exists. If ensureType is DirectoryOrCreate or FileOrCreate, attempts creation. Mounts into container.
  • Persistence: Data in SimpleClusterStorage on a node persists across container restarts on that same node. If the underlying agent.volumeBasePath is on network storage (user-managed), it's cluster-persistent. HostMount data persists with the host path.

4.8. Job Execution Lifecycle

Defined by workload.kat -> spec.type: Job and job.kat.

  1. Leader schedules Job instances based on schedule, completions, parallelism.
  2. Agent runs container. On exit:
    • Exit code 0: Instance Succeeded.
    • Non-zero: Instance Failed. Agent applies restartPolicy up to job.kat -> spec.backoffLimit for the Job instance (distinct from container restarts).
  3. Leader tracks completions and activeDeadlineSeconds.

4.9. Detached Node Operation and Rejoin

Revised mechanism for dynamic nodes (e.g., laptops):

  1. Configuration: Agents have --parent-cluster-name and --node-type (e.g., laptop, stable).
  2. Detached Mode: If Agent cannot reach parent Leader after nodeLossTimeoutSeconds, it sets an internal detached=true flag.
  3. Local Leadership: Agent becomes its own single-node Leader (trivial election).
  4. Local Operations:
    • Continues running pre-detachment workloads.
    • New workloads submitted to its local API get an automatic nodeSelector constraint: kat.dws.rip/nodeName: <current_node_name>.
  5. Rejoin Attempt: Periodically multicasts (REJOIN_REQUEST, <parent_cluster_name>, ...) on local LAN.
  6. Parent Response & Rejoin: Parent Leader responds. Detached Agent clears flag, submits its locally-created (nodeSelector-constrained) workloads to parent Leader API, then performs standard Agent join.
  7. Parent Reconciliation: Parent Leader accepts new workloads, respecting their nodeSelector.

5. State Management

5.1. State Store Interface (Go)

KAT components interact with etcd via a Go interface for abstraction.

package store

import (
	"context"
	"time"
)

type KV struct { Key string; Value []byte; Version int64 /* etcd ModRevision */ }
type WatchEvent struct { Type EventType; KV KV; PrevKV *KV }
type EventType int
const ( EventTypePut EventType = iota; EventTypeDelete )

type StateStore interface {
	Put(ctx context.Context, key string, value []byte) error
	Get(ctx context.Context, key string) (*KV, error)
	Delete(ctx context.Context, key string) error
	List(ctx context.Context, prefix string) ([]KV, error)
	Watch(ctx context.Context, keyOrPrefix string, startRevision int64) (<-chan WatchEvent, error) // Added startRevision
	Close() error
	Campaign(ctx context.Context, leaderID string, leaseTTLSeconds int64) (leadershipCtx context.Context, err error) // Returns context cancelled on leadership loss
	Resign(ctx context.Context) error // Uses context from Campaign to manage lease
	GetLeader(ctx context.Context) (leaderID string, err error)
	DoTransaction(ctx context.Context, checks []Compare, onSuccess []Op, onFailure []Op) (committed bool, err error) // For CAS operations
}
type Compare struct { Key string; ExpectedVersion int64 /* 0 for key not exists */ }
type Op struct { Type OpType; Key string; Value []byte /* for Put */ }
type OpType int
const ( OpPut OpType = iota; OpDelete; OpGet /* not typically used in Txn success/fail ops */)

The Campaign method returns a context that is cancelled when leadership is lost or Resign is called, simplifying leadership management. DoTransaction enables conditional writes for atomicity.

5.2. etcd Implementation Details

  • Client: Uses go.etcd.io/etcd/client/v3.
  • Embedded Server: Uses go.etcd.io/etcd/server/v3/embed within kat-agent on quorum nodes. Configuration (data-dir, peer/client URLs) from cluster.kat and agent flags.
  • Key Schema Examples:
    • /kat/schema_version: v1.0
    • /kat/config/cluster_uid: UUID generated at init.
    • /kat/config/leader_endpoint: Current Leader's API endpoint.
    • /kat/nodes/registration/{nodeName}: Node's static registration info (UID, WireGuard pubkey, advertise addr).
    • /kat/nodes/status/{nodeName}: Node's dynamic status (heartbeat timestamp, resources, local instances). Leased by agent.
    • /kat/workloads/desired/{namespace}/{workloadName}/manifest/{fileName}: Content of each Quadlet file.
    • /kat/workloads/desired/{namespace}/{workloadName}/meta: Workload metadata (generation, overall spec hash).
    • /kat/workloads/status/{namespace}/{workloadName}: Leader-maintained status of the workload.
    • /kat/network/config/overlay_cidr: ClusterCIDR.
    • /kat/network/nodes/{nodeName}/subnet: Assigned overlay subnet.
    • /kat/network/allocations/{instanceID}/ip: Assigned container overlay IP. Leased by agent managing instance.
    • /kat/dns/{namespace}/{workloadName}/{recordType}/{value}: Flattened DNS records.
    • /kat/leader_election/ (etcd prefix): Used by clientv3/concurrency/election.

5.3. Leader Election

Utilizes go.etcd.io/etcd/client/v3/concurrency#NewElection and Campaign. All agents configured as potential quorum members participate. The elected Leader renews its lease continuously. If the lease expires (e.g., Leader crashes), other candidates campaign.

5.4. State Backup (Leader Responsibility)

The active Leader periodically performs an etcd snapshot.

  1. Interval: backupIntervalMinutes from cluster.kat.
  2. Action: Executes etcdctl snapshot save {backupPath}/{timestamped_filename.db} against its own embedded etcd member.
  3. Path: backupPath from cluster.kat.
  4. Rotation: Leader maintains the last N snapshots locally (e.g., N=5, configurable), deleting older ones.
  5. User Responsibility: These are local snapshots on the Leader node. Users MUST implement external mechanisms to copy these snapshots to secure, off-node storage.

5.5. State Restore Procedure

For disaster recovery (total cluster loss or etcd quorum corruption):

  1. STOP all kat-agent processes on all nodes.
  2. Identify the desired etcd snapshot file (.db).
  3. On one designated node (intended to be the first new Leader):
    • Clear its old etcd data directory (--data-dir for etcd).
    • Restore the snapshot: etcdctl snapshot restore <snapshot.db> --name <member_name> --initial-cluster <member_name>=http://<node_ip>:<etcdPeerPort> --initial-cluster-token <new_token> --data-dir <new_data_dir_path>
    • Modify the kat-agent startup for this node to use the new_data_dir_path and configure it as if initializing a new cluster but pointing to this restored data (specific flags for etcd embed).
  4. Start the kat-agent on this restored node. It will become Leader of a new single-member cluster with the restored state.
  5. On all other KAT nodes:
    • Clear their old etcd data directories.
    • Clear any KAT agent local state (e.g., WireGuard configs, runtime state).
    • Join them to the new Leader using kat-agent join as if joining a fresh cluster.
  6. The Leader's reconciliation loop will then redeploy workloads according to the restored desired state. In-flight data or states not captured in the last etcd snapshot will be lost.

6. Container Runtime Interface

6.1. Runtime Interface Definition (Go)

Defines the abstraction KAT uses to manage containers.

package runtime

import (
	"context"
	"io"
	"time"
)

type ImageSummary struct { ID string; Tags []string; Size int64 }
type ContainerState string
const (
	ContainerStateRunning    ContainerState = "running"
	ContainerStateExited     ContainerState = "exited"
	ContainerStateCreated    ContainerState = "created"
	ContainerStatePaused     ContainerState = "paused"
	ContainerStateRemoving   ContainerState = "removing"
	ContainerStateUnknown    ContainerState = "unknown"
)
type HealthState string
const (
    HealthStateHealthy      HealthState = "healthy"
    HealthStateUnhealthy    HealthState = "unhealthy"
    HealthStatePending      HealthState = "pending_check" // Health check defined but not yet run
    HealthStateNotApplicable HealthState = "not_applicable" // No health check defined
)
type ContainerStatus struct {
	ID         string
	ImageID    string
	ImageName  string // Image used to create container
	State      ContainerState
	ExitCode   int
	StartedAt  time.Time
	FinishedAt time.Time
	Health     HealthState
	Restarts   int // Number of times runtime restarted this specific container instance
	OverlayIP  string
}
type BuildOptions struct { // From Section 3.5, expanded
	ContextDir     string
	DockerfilePath string
	ImageTag       string // Target tag for the build
	BuildArgs      map[string]string
	TargetStage    string
	Platform       string
	CacheTo        []string // e.g., ["type=registry,ref=myreg.com/cache/img:latest"]
	CacheFrom      []string // e.g., ["type=registry,ref=myreg.com/cache/img:latest"]
	NoCache        bool
	Pull           bool // Whether to attempt to pull base images
}
type PortMapping struct { HostPort int; ContainerPort int; Protocol string /* TCP, UDP */; HostIP string /* 0.0.0.0 default */}
type VolumeMount struct {
    Name        string // User-defined name of the volume from workload.spec.volumes
	Type        string // "hostMount", "simpleClusterStorage" (translated to "bind" for Podman)
	Source      string // Resolved host path for the volume
	Destination string // Mount path inside container
	ReadOnly    bool
	// SELinuxLabel, Propagation options if needed later
}
type GPUOptions struct { DeviceIDs []string /* e.g., ["0", "1"] or ["all"] */; Capabilities [][]string /* e.g., [["gpu"], ["compute","utility"]] */}
type ResourceSpec struct {
	CPUShares  int64 // Relative weight
	CPUQuota   int64 // Microseconds per period (e.g., 50000 for 0.5 CPU with 100000 period)
	CPUPeriod  int64 // Microseconds (e.g., 100000)
	MemoryLimitBytes int64
	GPUSpec    *GPUOptions // If GPU requested
}
type ContainerCreateOptions struct {
	WorkloadName  string
	Namespace     string
	InstanceID    string // KAT-generated unique ID for this replica/run
	ImageName     string // Image to run (after pull/build)
	Hostname      string
	Command       []string
	Args          []string
	Env           map[string]string
	Labels        map[string]string // Include KAT ownership labels
	RestartPolicy string // "no", "on-failure", "always" (Podman specific values)
	Resources     ResourceSpec
	Ports         []PortMapping
	Volumes       []VolumeMount
	NetworkName   string // Name of Podman network to join (e.g., for overlay)
	IPAddress     string // Static IP within Podman network, if assigned by KAT IPAM
	User          string // User to run as inside container (e.g., "1000:1000")
	CapAdd        []string
	CapDrop       []string
	SecurityOpt   []string
	HealthCheck   *ContainerHealthCheck // Podman native healthcheck config
	Systemd       bool // Run container with systemd as init
}
type ContainerHealthCheck struct {
    Test        []string // e.g., ["CMD", "curl", "-f", "http://localhost/health"]
    Interval    time.Duration
    Timeout     time.Duration
    Retries     int
    StartPeriod time.Duration
}

type ContainerRuntime interface {
	BuildImage(ctx context.Context, opts BuildOptions) (imageID string, err error)
	PullImage(ctx context.Context, imageName string, platform string) (imageID string, err error)
	PushImage(ctx context.Context, imageName string, destinationRegistry string) error
	CreateContainer(ctx context.Context, opts ContainerCreateOptions) (containerID string, err error)
	StartContainer(ctx context.Context, containerID string) error
	StopContainer(ctx context.Context, containerID string, timeoutSeconds uint) error
	RemoveContainer(ctx context.Context, containerID string, force bool, removeVolumes bool) error
	GetContainerStatus(ctx context.Context, containerOrName string) (*ContainerStatus, error)
	StreamContainerLogs(ctx context.Context, containerID string, follow bool, since time.Time, stdout io.Writer, stderr io.Writer) error
	PruneAllStoppedContainers(ctx context.Context) (reclaimedSpace int64, err error)
	PruneAllUnusedImages(ctx context.Context) (reclaimedSpace int64, err error)
	EnsureNetworkExists(ctx context.Context, networkName string, driver string, subnet string, gateway string, options map[string]string) error
    RemoveNetwork(ctx context.Context, networkName string) error
	ListManagedContainers(ctx context.Context) ([]ContainerStatus, error) // Lists containers labelled by KAT
}

6.2. Default Implementation: Podman

The default and only supported ContainerRuntime for KAT v1.0 is Podman. The implementation will primarily shell out to the podman CLI, using appropriate JSON output flags for parsing. It assumes podman is installed and correctly configured for rootless operation on Agent nodes. Key commands used: podman build, podman pull, podman push, podman create, podman start, podman stop, podman rm, podman inspect, podman logs, podman system prune, podman network create/rm/inspect.

6.3. Rootless Execution Strategy

KAT Agents MUST orchestrate container workloads rootlessly. The PREFERRED strategy is:

  1. Dedicated User per Workload/Namespace: The kat-agent (running as root, or with specific sudo rights for useradd, loginctl, systemctl --user) creates a dedicated, unprivileged system user account (e.g., kat_wl_mywebapp) when a workload is first scheduled to the node, or uses a pre-existing user from a pool.
  2. Enable Linger: loginctl enable-linger <username>.
  3. Generate Systemd Unit: The Agent translates the KAT workload definition into container create options and uses podman generate systemd --new --name {instanceID} --files --time 10 {imageName} {command...} to produce a .service unit file. This unit will include environment variables, volume mounts, port mappings (if host-mapped), resource limits, etc. The Restart= directive in the systemd unit will be set according to workload.kat -> spec.restartPolicy.
  4. Place and Manage Unit: The unit file is placed in /etc/systemd/user/ (if agent is root, enabling it for the target user) or ~{username}/.config/systemd/user/. The Agent then uses systemctl --user --machine={username}@.host daemon-reload, systemctl --user --machine={username}@.host enable --now {service_name}.service to start and manage it.
  5. Status and Logs: Agent queries systemctl --user --machine... status and journalctl --user-unit ... for status and logs.

This leverages systemd's robust process supervision and cgroup management for rootless containers.


7. Networking

7.1. Integrated Overlay Network

KAT v1.0 implements a mandatory, simple, encrypted Layer 3 overlay network connecting all Nodes using WireGuard.

  1. Configuration: Defined by cluster.kat -> spec.clusterCIDR.
  2. Key Management:
    • Each Agent generates a WireGuard key pair locally upon first start/join. Private key is stored securely (e.g., /etc/kat/wg_private.key, mode 0600). Public key is reported to the Leader during registration.
    • Leader stores all registered Node public keys and their external advertise IPs (for WireGuard endpoint) in etcd under /kat/network/nodes/{nodeName}/wg_pubkey and /kat/network/nodes/{nodeName}/wg_endpoint.
  3. Peer Configuration: Each Agent watches /kat/network/nodes/ in etcd. When a new node joins or an existing node's WireGuard info changes, the Agent updates its local WireGuard configuration (e.g., for interface kat0):
    • Adds/updates a [Peer] section for every other node.
    • PublicKey = {peer_public_key}
    • Endpoint = {peer_advertise_ip}:{configured_wg_port}
    • AllowedIPs = {peer_assigned_overlay_subnet_cidr} (see IPAM below).
    • PersistentKeepalive MAY be used if nodes are behind NAT.
  4. Interface Setup: Agent ensures kat0 interface is up with its assigned overlay IP. Standard OS routing rules handle traffic for the clusterCIDR via kat0.

7.2. IP Address Management (IPAM)

The Leader manages IP allocation for the overlay network.

  1. Node Subnets: From clusterCIDR and nodeSubnetBits (from cluster.kat), the Leader carves out a distinct subnet for each Node that joins (e.g., if clusterCIDR is 10.100.0.0/16 and nodeSubnetBits is 7, each node gets a /23, like 10.100.0.0/23, 10.100.2.0/23, etc.). This Node-to-Subnet mapping is stored in etcd.
  2. Container IPs: When the Leader schedules a Workload instance to a Node, it allocates the next available IP address from that Node's assigned subnet. This instanceID -> containerIP mapping is stored in etcd, possibly with a lease. The Agent is informed of this IP to pass to podman create --ip ....
  3. Maximum Instances: The size of the node subnet implicitly limits the number of container instances per node.

7.3. Distributed Agent DNS and Service Discovery

Each KAT Agent runs an embedded DNS resolver, synchronized via etcd, providing service discovery.

  1. DNS Server Implementation: Agents use github.com/miekg/dns to run a DNS server goroutine, listening on their kat0 overlay IP (port 53).
  2. Record Source:
    • When a Workload instance (especially Service or DaemonService) with an assigned overlay IP becomes healthy (or starts, if no health check), the Leader writes DNS A records to etcd:
      • A <instanceID>.<workloadName>.<namespace>.<clusterDomain> -> <containerOverlayIP}
      • For Services with VirtualLoadBalancer.kat -> spec.ports: A <workloadName>.<namespace>.<clusterDomain> -> <containerOverlayIP} (multiple A records for different healthy instances are created).
    • The etcd key structure might be /kat/dns/{clusterDomain}/{namespace}/{workloadName}/{instanceID_or_service_A}.
  3. Agent DNS Sync: Each Agent's DNS server Watches the /kat/dns/ prefix in etcd. On changes, it updates its in-memory DNS zone data.
  4. Container Configuration: Agents configure the /etc/resolv.conf of all managed containers to use the Agent's own kat0 overlay IP as the sole nameserver.
  5. Query Handling:
    • The local Agent DNS resolver first attempts to resolve queries based on the source container's namespace (e.g., app from ns-foo tries app.ns-foo.kat.cluster.local).
    • If not found, it tries fully qualified name as-is.
    • Implements basic negative caching (NXDOMAIN for short TTL) to reduce load.
    • It does NOT forward to upstream resolvers for KAT domain names. For external names, it may forward or containers must have a secondary configured upstream resolver (V1: no upstream forwarding by agent DNS for simplicity).

7.4. Ingress (Opinionated Recipe via Traefik)

KAT provides a standardized way to deploy Traefik for ingress.

  1. Ingress Node Designation: Admins label Nodes intended for ingress with kat.dws.rip/role=ingress.
  2. kat-traefik-ingress Quadlet: DWS LLC provides standard Quadlet files:
    • workload.kat: Deploys Traefik as a DaemonService with a nodeSelector for kat.dws.rip/role=ingress. Includes the kat-ingress-updater container.
    • VirtualLoadBalancer.kat: Exposes Traefik's ports (80, 443) via HostPort on the ingress Nodes. Specifies health checks for Traefik itself.
    • volume.kat: Mounts host paths for /etc/traefik/traefik.yaml (static config), /data/traefik/dynamic_conf/ (for kat-ingress-updater), and /data/traefik/acme/ (for Let's Encrypt certs).
  3. kat-ingress-updater Container:
    • Runs alongside Traefik. Watches KAT API for VirtualLoadBalancer Quadlets with spec.ingress stanzas.
    • Generates Traefik dynamic configuration files (routers, services) mapping external host/path to internal KAT service FQDNs (e.g., <service>.<namespace>.kat.cluster.local:<port>).
    • Configures Traefik certResolver for Let's Encrypt for services requesting TLS.
    • Traefik watches its dynamic configuration directory.

8. API Specification (KAT v1.0 Alpha)

8.1. General Principles and Authentication

  • Protocol: HTTP/1.1 or HTTP/2. Mandatory mTLS for Agent-Leader and CLI-Leader.
  • Data Format: Request/Response bodies MUST be JSON.
  • Versioning: Endpoints prefixed with /v1alpha1.
  • Authentication: Static Bearer Token in Authorization header for CLI/external API clients. For KATv1, this token grants full cluster admin rights. Agent-to-Leader mTLS serves as agent authentication.
  • Error Reporting: Standard HTTP status codes. JSON body for errors: {"error": "code", "message": "details"}.

8.2. Resource Representation (Proto3 & JSON)

All API resources (Workloads, Namespaces, Nodes, etc., and their Quadlet file contents) are defined using Protocol Buffer v3 messages. The HTTP API transports these as JSON. Common metadata (name, namespace, uid, generation, resourceVersion, creationTimestamp) and status structures are standardized.

8.3. Core API Endpoints

(Referencing the structure from prior discussion in RFC draft section 8.3, ensuring:

  • Namespace CRUD.
  • Workload CRUD: POST/PUT accepts tar.gz of Quadlet dir. GET returns metadata+status. Endpoints for individual Quadlet file content (.../files/{fileName}). Endpoint for logs (.../instances/{instanceID}/logs). Endpoint for rollback (.../rollback).
  • Node read endpoints: GET /nodes, GET /nodes/{name}. Agent status update: POST /nodes/{name}/status. Admin taint update: PUT /nodes/{name}/taints.
  • Event query endpoint: GET /events.
  • ClusterConfiguration read endpoint: GET /config/cluster (shows sanitized running config). No separate top-level Volume API for KAT v1; volumes are defined within workloads.)

9. Observability

9.1. Logging

  • Container Logs: Agents capture stdout/stderr, make available via podman logs mechanism, and stream via API to katcall logs. Local rotation on agent node.
  • Agent Logs: kat-agent logs to systemd journal or local files.
  • API Audit (Basic): Leader logs API requests (method, path, source IP, user if distinguishable) at a configurable level.

9.2. Metrics

  • Agent Metrics: Node resource usage (CPU, memory, disk, network), container resource usage. Included in heartbeats.
  • Leader Metrics: API request latencies/counts, scheduling attempts/successes/failures, etcd health.
  • Exposure (V1): Minimal exposure via a /metrics JSON endpoint on Leader and Agent, not Prometheus formatted yet.
  • Future: Standardized Prometheus exposition format.

9.3. Events

Leader records significant cluster events (Workload create/update/delete, instance schedule/fail/health_change, Node ready/not_ready/join/leave, build success/fail, detached/rejoin actions) into a capped, time-series like structure in etcd.

  • API: GET /v1alpha1/events?[resourceType=X][&resourceName=Y][&namespace=Z]
  • Fields per event: Timestamp, Type, Reason, InvolvedObject (kind, name, ns, uid), Message.

10. Security Considerations

10.1. API Security

  • mTLS REQUIRED for all inter-KAT component communication (Agent-Leader).
  • Bearer token for external API clients (e.g., katcall). V1: single admin token. No granular RBAC.
  • API server should implement rate limiting.

10.2. Rootless Execution

Core design principle. Agents execute workloads via Podman in rootless mode, leveraging systemd user sessions for enhanced isolation. Minimizes container escape impact.

10.3. Build Security

  • Building arbitrary Git repositories on Agent nodes is a potential risk.
  • Builds run as unprivileged users via rootless Podman.
  • Network access during build MAY be restricted in future (V1: unrestricted).
  • Users are responsible for trusting Git sources. cacheImage provides a way to use pre-vetted images.

10.4. Network Security

  • WireGuard overlay provides inter-node and inter-container encryption.
  • Host firewalls are user responsibility. nodePort or Ingress exposure requires careful firewall configuration.
  • API/Agent communication ports should be firewalled from public access.

10.5. Secrets Management

  • KAT v1 has NO dedicated secret management.
  • Sensitive data passed via environment variables in workload.kat -> spec.container.env is stored plain in etcd. This is NOT secure for production secrets.
  • Registry credentials for cacheImage push/pull are local Agent configuration.
  • Recommendation: For sensitive data, users should use application-level encryption or sidecars that fetch from external secret stores (e.g., Vault), outside KAT's direct management in V1.

10.6. Internal PKI

  1. Initialization (kat-agent init):
    • Generates a self-signed CA key (ca.key) and CA certificate (ca.crt). Stored securely on the initial Leader node (e.g., /var/lib/kat/pki/).
    • Generates a Leader server key/cert signed by this CA for its API and Agent communication endpoints.
    • Generates a Leader client key/cert signed by this CA for authenticating to etcd and Agents.
  2. Node Join (kat-agent join):
    • Agent generates a keypair and a CSR.
    • Sends CSR to Leader over an initial (potentially untrusted, or token-protected if implemented later) channel.
    • Leader signs the Agent's CSR using the CA key, returns the signed Agent certificate and the CA certificate.
    • Agent stores its key, its signed certificate, and the CA cert for mTLS.
  3. mTLS Usage: All Agent-Leader and Leader-Agent (for commands) communications use mTLS, validating peer certificates against the cluster CA.
  4. Certificate Lifespan & Rotation: For V1, certificates might have a long lifespan (e.g., 1-10 years). Automated rotation is deferred. Manual regeneration/redistribution would be needed.

13. Acknowledgements

The KAT system design, while aiming for novel simplicity, stands on the shoulders of giants. Its architecture and concepts draw inspiration and incorporate lessons learned from numerous preceding systems and bodies of work in distributed computing and container orchestration. We specifically acknowledge the influence of:

  • Kubernetes: For establishing many of the core concepts and terminology in modern container orchestration, even as KAT diverges in implementation complexity and API specifics.
  • k3s and MicroK8s: For demonstrating the demand and feasibility of lightweight Kubernetes distributions, validating the need KAT aims to fill more radically.
  • Podman & Quadlets: For pioneering robust rootless containerization and providing the direct inspiration for KAT's declarative Quadlet configuration model and systemd user service execution strategy.
  • Docker Compose: For setting the standard in single-host multi-container application definition simplicity.
  • HashiCorp Nomad: For demonstrating an alternative, successful approach to simplified, flexible orchestration beyond the Kubernetes paradigm, particularly its use of HCL and clear deployment primitives.
  • Google Borg: For concepts in large-scale cluster management, scheduling, and the importance of introspection, as documented in their published research.
  • The "Hints for Computer System Design" (Butler Lampson): For principles regarding simplicity, abstraction, performance trade-offs, and fault tolerance that heavily influenced KAT's philosophy.
  • "A Note on Distributed Computing" (Waldo et al.): For articulating the fundamental differences between local and distributed computing that KAT attempts to manage pragmatically, rather than hide entirely.
  • The Grug Brained Developer: For the essential reminder to relentlessly fight complexity and prioritize understandability.
  • Open Source Community: For countless libraries, tools, discussions, and prior art that make a project like KAT feasible.

Finally, thanks to Simba, my cat, for providing naming inspiration.


14. Author's Address

Tanishq Dubey
DWS LLC
Email: [email protected]
URI: https://www.dws.rip