Refine controller template and probe listeners

2026-04-27 00:28:25 +08:00
parent 8fae920fc8
commit d7c2dac944
20 changed files with 780 additions and 217 deletions
--- a/docs/ota.md
+++ b/docs/ota.md
@@ -1,19 +1,54 @@
-## Upgrade process
+# OS OTA Upgrades

-We use an agent to watch the OSUpgrade CRD to handle this. Our image versions follows upstream.
+MonoK8s upgrades are driven through two custom resources:

-To issue an upgrade. Simply use
+- `OSUpgrade`: the user-facing upgrade request.
+- `OSUpgradeProgress`: the per-node upgrade state watched and executed by the node agent.

+The node agent does the actual upgrade work. It watches `OSUpgradeProgress` resources assigned to its node, downloads the selected image, writes it to the inactive rootfs partition, updates status, and reboots when ready.
+
+The controller is optional but strongly recommended. It watches `OSUpgrade` resources and creates the matching `OSUpgradeProgress` resources for the target nodes.
+
+## Install the controller
+
+By default, each managed node only runs the node agent. The node agent does **not** watch `OSUpgrade` directly; it only watches `OSUpgradeProgress`.
+
+You can create `OSUpgradeProgress` resources by hand, but normal users should not need to. Install the controller instead, then create `OSUpgrade` resources.
+
+Install the controller from the existing node-agent image:
+
+```bash
+kubectl exec -i -n mono-system ds/node-agent -- \
+  ctl create controller --image REPO/IMAGE:TAG | kubectl apply -f -
+```
+
+### `--image`
+
+`--image` is optional.
+
+If omitted, the generated Deployment uses the local controller image that is already shipped with managed nodes. In that mode, the controller Deployment is scheduled only onto managed nodes because the image is expected to exist locally.
+
+If provided, the generated Deployment uses that image directly. This is useful when you host the controller image in your own registry.
+
+There is no official public image repository yet, so external controller images must currently be managed by the operator.
+
+## Create an upgrade
+
+Create an `OSUpgrade` resource to request an upgrade:
+
+```bash
 kubectl apply -f upgrade.yaml
+```
+
+Example:

-Example yaml
 ```yaml
 apiVersion: monok8s.io/v1alpha1
 kind: OSUpgrade
 metadata:
-  name: "my-ugrade-2"
+  name: my-upgrade-2
 spec:
-  version: "v1.35.3"
+  version: v1.35.3
  nodeSelector: {}
  catalog:
    inline: |
@@ -34,24 +69,61 @@ spec:
        - version: v1.35.1
          url: http://localhost:8000/rootfs.ext4.zst
          checksum: sha256:99af82a263deca44ad91d21d684f0fa944d5d0456a1da540f1c644f8aa59b14b
-          size: 1858076672 # expanded image size in bytes, use "zstd -lv image.zst to check"
+          size: 1858076672 # expanded image size in bytes; check with: zstd -lv image.zst
      blocked:
        - v1.34.0
 ```

-catalog also accepts URL or ConfigMap※
+### `spec.version`
+
+`spec.version` is the requested target version.
+
+It may be either:
+
+- an explicit version, such as `v1.35.3`
+- `stable`, if the catalog defines a `stable` version
+
+### `spec.nodeSelector`
+
+`spec.nodeSelector` selects the nodes that should receive the upgrade.
+
+An empty selector means all eligible managed nodes.
+
+### `spec.catalog`
+
+The catalog tells the agent where to find available OS images.
+
+The catalog can be provided inline:
+
 ```yaml
 catalog:
-  URL: https://example.com/images.yaml
-
-catalog:
-  ConfigMap: images-cm
+  inline: |
+    stable: v1.35.1
+    images:
+      - version: v1.35.1
+        url: https://example.invalid/images/monok8s-v1.35.1.img.zst
+        checksum: sha256:abc
+        size: 1858076672
 ```

-※ ConfigMap requires additional RBAC permissions which is not enabled by default. You can edit
-the node-agent's ClusterRole and add `configmaps: get` to allow this.
+It can also be loaded from a URL:
+
+```yaml
+catalog:
+  url: https://example.com/images.yaml
+```
+
+Or from a ConfigMap:
+
+```yaml
+catalog:
+  configMap: images-cm
+```
+
+ConfigMap catalogs require extra RBAC. This permission is not enabled by default. To use a ConfigMap catalog, edit the relevant ClusterRole and allow `get` on `configmaps`.
+
+Catalog content should look like this:

-Contents should look like this
 ```yaml
 stable: v1.35.1
 images:
@@ -70,64 +142,114 @@ images:
  - version: v1.35.1
    url: http://localhost:8000/rootfs.ext4.zst
    checksum: sha256:99af82a263deca44ad91d21d684f0fa944d5d0456a1da540f1c644f8aa59b14b
-    size: 1858076672 # expanded image size in bytes, use "zstd -lv image.zst to check"
+    size: 1858076672 # expanded image size in bytes; check with: zstd -lv image.zst
 blocked:
  - v1.34.0
 ```

-### Monitoring the upgrades
+## Monitor upgrades

-kubectl get osugrades
-```
-NAME            DESIRED    RESOLVED   PHASE       TARGETS   OK   FAIL   AGE
-my-upgrade-3    stable     v1.35.4    RollingOut  3         1    0      1m
-my-upgrade-2    v1.35.3    v1.35.3    Accepted    2         0    0      1m
-my-downgrade-1  v1.33.2    v1.33.2    Rejected    2         0    2      1m
+List upgrade requests:
+
+```bash
+kubectl get osupgrades
 ```

-kubectl get osupgradeprogress
+Example output:
+
+```text
+NAME            DESIRED    RESOLVED   PHASE
+my-upgrade-3    stable     v1.35.4    Pending
+my-upgrade-2    v1.35.3    v1.35.3    Accepted
+my-downgrade-1  v1.33.2    v1.33.2    Rejected
 ```
+
+List per-node progress:
+
+```bash
+kubectl get osupgradeprogresses
+```
+
+Example output:
+
+```text
 NAME                NODE        SOURCE        CURRENT  TARGET   STATUS
-osupgrade-abc123f   node-1      my-upgrade-2  v1.34.1  v1.35.3  downloading
-osupgrade-cde456g   node-2      my-upgrade-2  v1.35.3  v1.35.3  completed
+osupgrade-abc123f   node-1      my-upgrade-2  v1.34.1  v1.35.3  Downloading
+osupgrade-cde456g   node-2      my-upgrade-2  v1.35.3  v1.35.3  Completed
 ```

+Inspect one node's progress:

+```bash
 kubectl describe osupgradeprogress osupgrade-abc123f
+```
+
+Example resource:
+
 ```yaml
 apiVersion: monok8s.io/v1alpha1
 kind: OSUpgradeProgress
 metadata:
-  name: "osupgrade-abc123f"
+  name: osupgrade-abc123f
 spec:
  sourceRef:
    name: my-upgrade-2
  nodeName: node-1
 status:
-  currentVersion: "v1.34.1"
-  targetVersion: "v1.35.3"
+  currentVersion: v1.34.1
+  targetVersion: v1.35.3
  phase: Downloading
  startedAt: null
  completedAt: null
  lastUpdatedAt: null
  retryCount: 0
-  inactivePartition: "B"
+  inactivePartition: B
  failureReason: ""
  message: ""
 ```

+## Retry a failed upgrade
+
+If an upgrade fails, for example because the image download failed, edit `spec.retryNonce` on the affected `OSUpgradeProgress` resource.
+
+Any changed value is enough. The field is only used to tell the node agent that the user intentionally requested a retry.
+
+Example:
+
+```bash
+kubectl patch osupgradeprogress osupgrade-abc123f \
+  --type merge \
+  -p '{"spec":{"retryNonce":"retry-1"}}'
+```
+
+If the same node fails again and you want to retry again, change the nonce to a new value:
+
+```bash
+kubectl patch osupgradeprogress osupgrade-abc123f \
+  --type merge \
+  -p '{"spec":{"retryNonce":"retry-2"}}'
+```
+
 ## Development notes

-### Flashing manually into partition B
+### Flash an image manually into partition B

-**Use nmap ncat**. Otherwise we'll have all kinds of fabulous issues sending it.
+Use nmap's `ncat`. Other tools may work, but they are more likely to cause annoying stream or connection behavior.

-Sending side
-```
-pv "out/rootfs.ext4.zst" | ncat 10.0.0.10 1234 --send-only
+On the sending machine:
+
+```bash
+pv out/rootfs.ext4.zst | ncat 10.0.0.10 1234 --send-only
 ```

-Receiving side
-```
-ncat -l 1234 --recv-only | zstd -d -c | dd of=/dev/sda3 bs=4M status=progress && sync && echo "SUCCESS"
+On the receiving machine:
+
+```bash
+ncat -l 1234 --recv-only | \
+  zstd -d -c | \
+  dd of=/dev/sda3 bs=4M status=progress && \
+  sync && \
+  echo "SUCCESS"
 ```
+
+Be careful with the target partition. The example writes to `/dev/sda3`, which is assumed to be rootfs B in that setup. Verify the partition layout before running this on real hardware.