SLSA for ML in 2025: Signed Datasets, Reproducible Training, and Attested Inference

Trustworthy machine learning in 2025 isn’t only about better models; it’s about verifiable supply chains. The software world has largely embraced SLSA (Supply-chain Levels for Software Artifacts) for code and binaries. ML needs the same discipline—adapted to data, training pipelines, and inference environments—because the threats and failure modes are broader: data poisoning, model tampering, runaway dependencies, nondeterministic training, and silent drift.

This article offers a practical, opinionated blueprint for implementing SLSA-inspired controls across the ML lifecycle. The core tenets:

Sign datasets and model weights.
Capture provenance with in-toto/SLSA attestations.
Make training runs reproducible and hermetic.
Store models as OCI artifacts, not ad hoc blobs.
Gate releases on attestations and policy.
Verify signatures and policy at inference time.

I’ll lean on proven building blocks—Sigstore/cosign, in-toto attestations, Tekton Chains, ORAS/OCI registries, OPA/Kyverno, MLflow/KServe/Argo/Kubeflow—and show code-level examples you can deploy today.

Why SLSA for ML now

Attack surface: ML adds data and training infrastructure to the classical build chain. Data poisoning, prompt/embedding injection, tampered checkpoints, malicious fine-tunes—all are supply-chain threats.
Regulatory tailwinds: AI assurance and transparency requirements are rising globally (e.g., NIST AI RMF 1.0, EU AI Act implementation phases, ISO/IEC 42001). Provenance and traceability won’t be optional in many contexts.
Ecosystem maturity: SLSA v1, Sigstore, in-toto, Tekton Chains, and OCI Artifacts have matured. K8s-native policy (OPA Gatekeeper, Kyverno) and model serving stacks (KServe, Triton) make enforcement practical.

You don’t need to solve everything to get value. Even SLSA Level 2/3 equivalents for ML (build provenance, reproducible pipelines, verified signatures) dramatically raise the bar.

A quick refresher: SLSA, adapted to ML

SLSA’s goals—integrity, provenance, and trust in artifacts—map naturally to ML when you treat data and models as first-class build materials and outputs.

SLSA Level 1: Documented build process. For ML: track training scripts, datasets, and hyperparameters.
SLSA Level 2: Build service and provenance. For ML: use a controlled training pipeline that emits in-toto/SLSA provenance linking data digests, code commit, container, and model artifact.
SLSA Level 3: Ephemeral, isolated builds with non-falsifiable provenance. For ML: ephemeral jobs on a secured cluster emit attestations via a trusted builder (e.g., Tekton Chains, GitHub Actions OIDC, Buildkite + Sigstore), with materials pinned.
SLSA Level 4: Two-person review, hermetic and reproducible. For ML: hermetic training environments, pinned dependencies, deterministic algorithms where possible, and independent re-builds verifying the same model (within tolerance if bitwise deterministic training is infeasible).

A caveat: bitwise-reproducible training on GPUs is still hard at scale. Aim for “controlled nondeterminism” plus attestations and statistical checks when needed.

The blueprint at a glance

Data layer: version datasets, compute content digests (Merkle manifests), and sign them with Sigstore; store in an auditable system (e.g., DVC, lakeFS, Delta Lake) and optionally mirror as OCI artifacts.
Training layer: run training in a reproducible, hermetic environment; log seeds, hardware, and hyperparameters; emit SLSA provenance with in-toto attestations; sign checkpoints/weights.
Packaging layer: publish models as OCI artifacts (ORAS) with annotations linking dataset digests, provenance URIs, SBOMs, and licenses; sign the OCI artifact.
Release governance: enforce policy in CI/CD; gate promotions on required attestations and signatures.
Inference layer: verify signatures at startup (or dynamically), validate policy (e.g., trusted builders, allowed datasets), and attest runtime environment if needed.

The rest of this article details how to implement each step with concrete tools and examples.

1) Sign your datasets (and make them content-addressable)

If you can’t trust your data, you can’t trust your model. Start by making datasets content-addressable and auditable.

Recommended approaches:

Use a dataset versioning layer: DVC, lakeFS, Delta Lake, or Dolt. Treat dataset versions like commits. Avoid implicit “latest” pointers.
Generate a canonical manifest with file paths, sizes, and cryptographic hashes (SHA-256). For large collections, use a Merkle tree manifest to avoid recomputing monolithic tarball hashes.
Sign the manifest with Sigstore (cosign) to produce a verifiable signature and store the signature separately from the data.
Optionally, publish datasets (or their manifests) as OCI artifacts in your registry. This harmonizes distribution and verification with your model supply chain.

Example: create and sign a dataset manifest

bash
# 1) Create a deterministic manifest of your dataset
# Example: all *.parquet files under data/ with their SHA-256 digests
find data -type f -name "*.parquet" -print0 \
 | sort -z \
 | xargs -0 -I{} sh -c 'sha256sum "{}" | awk '{print $1"  "$2}'' \
 > dataset.files.sha256

# Convert to a minimal JSON manifest (optional) using jq or a small script
# Example manifest path: dataset.manifest.json

# 2) Sign the manifest using Sigstore cosign (OIDC keyless)
COSIGN_EXPERIMENTAL=1 cosign sign-blob \
  --oidc-provider https://accounts.google.com \
  --output-signature dataset.manifest.sig \
  --output-certificate dataset.manifest.cert \
  dataset.manifest.json

# 3) Verify later
COSIGN_EXPERIMENTAL=1 cosign verify-blob \
  --certificate dataset.manifest.cert \
  --signature dataset.manifest.sig \
  --certificate-identity-regexp ".*@yourcompany.com" \
  --certificate-oidc-issuer https://accounts.google.com \
  dataset.manifest.json

To distribute via OCI:

bash
# Push dataset manifest and attach signature using ORAS (OCI Registry As Storage)
oras push ghcr.io/yourorg/datasets/census:2025-03 \
  --artifact-type application/vnd.example.dataset.manifest.v1+json \
  ./dataset.manifest.json:application/json \
  ./dataset.manifest.sig:application/vnd.dev.sigstore.signature \
  ./dataset.manifest.cert:application/vnd.dev.sigstore.certificate \
  --annotation org.opencontainers.image.title="census-2025-03" \
  --annotation org.opencontainers.image.description="Census dataset snapshot with signed manifest"

Consider in-toto attestations for datasets

Attach an in-toto Statement that records dataset creation/curation steps (e.g., deduplication, filtering, PII removal), with materials (source data URIs and digests) and predicates (pipeline metadata).
Store the attestation alongside the manifest in your registry or object store. Sigstore’s Rekor transparency log provides discoverability and auditability of signatures.

This gives you a cryptographic chain from data collection through processing to training consumption.

2) Make training reproducible (or controlled) and attest it

Perfect determinism is a spectrum. The practical target is an environment where the same inputs and configuration produce the same model checksum (best case) or the same metrics within tight bounds (realistic for large GPU workloads). Either way, attest everything.

Minimum viable reproducibility:

Pin everything: container images, CUDA/cuDNN versions, Python packages (use a lockfile like pip-tools/uv/Poetry), dataset commit/digest, code commit, and config.
Hermetic builds: disallow network access during training except for artifact/material fetching from pinned URIs.
Seed and log: seed PRNGs in every relevant library (Python’s random, NumPy, framework RNGs) and log the seeds.
Deterministic algorithm settings:
- PyTorch: torch.use_deterministic_algorithms(True), set torch.backends.cudnn.deterministic=True and benchmark=False, set environment CUBLAS_WORKSPACE_CONFIG to enforce determinism on GEMM. Note: not all ops are deterministic; consult the PyTorch docs for your version.
- TensorFlow: tf.random.set_seed(), enable deterministic ops via tf.config.experimental.enable_op_determinism(True) where available.
- JAX/Flax: manage PRNG keys explicitly; pin XLA/CUDA versions.
Distributed training caveat: floating-point reductions across GPUs can be nondeterministic due to operation order. If bitwise determinism is mandatory, prefer CPU or single-GPU runs or frameworks/algorithms designed for determinism; otherwise, treat it as controlled nondeterminism and attest settings.

Example: deterministic knobs for PyTorch (2025-era)

python
import os
import random
import numpy as np
import torch

def set_deterministic(seed: int):
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"  # or ":4096:8"; required for cublas GEMM determinism
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    # Optional: limit threads to reduce nondeterminism in parallelism
    torch.set_num_threads(1)
    torch.set_num_interop_threads(1)

set_deterministic(424242)

Hermetic, pinned environments:

Prefer a minimal base image with pinned CUDA drivers (e.g., nvidia/cuda:12.4.1-cudnn8-runtime-ubuntu22.04) and lock your Python dependencies:

bash
# Generate a lockfile for exact reproducibility
uv pip compile -q pyproject.toml -o requirements.lock.txt

# Then build with a Dockerfile that installs from the lock

For stronger guarantees, build with Nix/Guix or Bazel rules that fetch exact toolchain and ensure repeatable environments. Nix flakes work well for training nodes:

nix
{
  description = "Reproducible ML training env";
  inputs.nixpkgs.url = "github:nixos/nixpkgs/nixos-24.05";

  outputs = { self, nixpkgs }: let
    pkgs = import nixpkgs { system = "x86_64-linux"; }; 
  in {
    devShells.x86_64-linux.default = pkgs.mkShell {
      packages = with pkgs; [ python311 python311Packages.pip python311Packages.numpy cudatoolkit cudnn git ];
      shellHook = ''
        export PYTHONPATH=.
      '';
    };
  };
}

Attest training with in-toto/SLSA

Have your training pipeline emit provenance automatically.

Use Tekton Chains or GitHub Actions OIDC with cosign to generate SLSA provenance.
The provenance should include: builder identity, training recipe (container, entrypoint), materials (code commit, dataset manifest digest, lockfiles), parameters (hyperparameters, seeds), and the resulting model artifact digest.

Example: verifying a model against SLSA provenance

bash
# Suppose training emitted model.pt and provenance.intoto.jsonl
slsa-verifier verify-artifact \
  --provenance-path provenance.intoto.jsonl \
  --source-uri github.com/yourorg/ml-pipeline \
  --builder-id https://github.com/actions/runner \
  --source-tag refs/tags/train-2025-03-30 \
  model.pt

Key point: your provenance must link the model artifact digest to the exact dataset manifest digest and code commit. If not, you cannot prove what was trained.

3) Store models as OCI artifacts with rich metadata

Stop treating models as opaque blobs in a bucket. OCI registries give you:

Content-addressable storage with digests
Versioned tags
First-class signing and attestations
Cross-cloud distribution and caching (same infra as containers)

A pragmatic layout for a model artifact in OCI:

Artifact type: application/vnd.example.model.v1+tar (or use an index for fleets of files)
Files:
- weights: model.safetensors (or .bin, .gguf, etc.)
- config: config.json
- tokenizer: tokenizer.json / vocab files
- metadata: model-card.md, LICENSE, SBOM (SPDX or CycloneDX for training deps)
- provenance: provenance.intoto.jsonl
Annotations:
- org.ml.dataset.digest: sha256:...
- org.ml.dataset.oci-ref: ghcr.io/yourorg/datasets/census:2025-03
- org.ml.training.code: git+https://github.com/yourorg/ml-pipeline@<commit>
- org.ml.training.container: ghcr.io/yourorg/trainers/pytorch:sha256:...
- org.ml.hyperparams: JSON-encoded subset, or link to a config file

Publish with ORAS and sign with cosign:

bash
oras push ghcr.io/yourorg/models/census-income:2025-03 \
  --artifact-type application/vnd.example.model.v1+tar \
  ./model.safetensors:application/octet-stream \
  ./config.json:application/json \
  ./tokenizer.json:application/json \
  ./provenance.intoto.jsonl:application/vnd.in-toto+jsonl \
  ./SBOM.spdx.json:application/spdx+json \
  --annotation org.ml.dataset.digest="sha256:abc123..." \
  --annotation org.ml.dataset.oci-ref="ghcr.io/yourorg/datasets/census:2025-03" \
  --annotation org.ml.training.code="github.com/yourorg/ml-pipeline@c0ff33" \
  --annotation org.ml.training.container="ghcr.io/yourorg/trainers/pytorch@sha256:deadbeef..."

COSIGN_EXPERIMENTAL=1 cosign sign \
  ghcr.io/yourorg/models/census-income:2025-03 \
  --identity-token "$(gh auth token)"

For multi-variant models (quantizations, architectures, hardware targets), use an OCI index (manifest list) with per-variant annotations; clients can select variants based on policy.

4) Gate releases on attestations and policy

Signatures are necessary but not sufficient. You need policy enforcing:

Only models built by trusted builders (identities) can be promoted.
Required attestations exist: dataset signature, training provenance, SBOM.
Inputs meet allowlists: allowed datasets, allowed base containers, allowed code repos.
Optional: metric thresholds, fairness audits, or evaluation attestations.

Examples of enforcement points:

CI/CD: before pushing prod tags in the model registry.
Kubernetes admission: before serving in KServe/Triton.
Edge update infrastructure: before distributing to devices.

OPA Gatekeeper/Rego policy example (conceptual):

rego
package model.release

violation["untrusted_builder"] {
  input.artifact.annotations["cosign.sigstore.dev/bundle"].issuer != "https://token.actions.githubusercontent.com"
}

violation["missing_dataset_sig"] {
  not input.artifact.annotations["org.ml.dataset.digest"]
}

violation["dataset_not_allowlisted"] {
  some d
  d := input.artifact.annotations["org.ml.dataset.oci-ref"]
  not startswith(d, "ghcr.io/yourorg/datasets/")
}

violation["provenance_missing"] {
  not input.has_attestation_type["https://slsa.dev/provenance/v1"]
}

In Kubernetes, use Ratify + Gatekeeper or Kyverno policies to verify OCI signatures and required attestations before pods mount models.

CI/CD example gating step:

bash
# Fail the pipeline if verification fails
cosign verify ghcr.io/yourorg/models/census-income:2025-03 \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  --certificate-identity-regexp "https://github.com/yourorg/.+" \
  --rekor-url https://rekor.sigstore.dev

# Verify SLSA provenance attached to the artifact
slsa-verifier verify-image \
  ghcr.io/yourorg/models/census-income:2025-03 \
  --source-uri github.com/yourorg/ml-pipeline \
  --builder-id https://github.com/actions/runner

If you use Tekton, Tekton Chains can auto-attach in-toto/SLSA attestations to OCI artifacts; Ratify can be configured to require them.

5) Verify signatures at inference (attested inference)

Verification in CI/CD reduces risk, but runtime verification prevents configuration drift and unauthorized hot-patching.

Patterns to consider:

Cold-start verification: the serving container verifies the model’s OCI signature and provenance before loading weights.
Continuous policy: a sidecar (Ratify) or init container enforces policy and pins specific digests.
Secure caching: store verified models in a local cache with their signatures; only load from cache.

Python example: verify a model blob with sigstore-python before loading

python
from sigstore.verify import Verifier
from sigstore._internal.rekor.client import RekorClient

# model_path: downloaded weights
# sig_path, cert_path: detached signature and cert from the OCI artifact

verifier = Verifier.production()  # uses Fulcio + Rekor prod endpoints

with open("model.safetensors", "rb") as f_blob, \
     open("model.safetensors.sig", "rb") as f_sig, \
     open("model.safetensors.cert", "rb") as f_cert:
    ok = verifier.verify(
        input_=f_blob.read(),
        signature=f_sig.read(),
        certificate=f_cert.read(),
        offline=False,
    )

if not ok:
    raise RuntimeError("Model signature verification failed")

# Continue to load the model
# from safetensors.torch import load_file
# state_dict = load_file("model.safetensors")

KServe + Ratify example (conceptual):

Install Ratify in your cluster; configure a policy that requires cosign signatures and SLSA provenance for artifacts with type application/vnd.example.model.v1+tar.
Configure your InferenceService to reference the model by OCI digest (not tag), e.g., ghcr.io/yourorg/models/census-income@sha256:...
Admission control blocks deployments if verification fails.

Edge devices:

Verify signatures in the updater before swapping models.
Pin digests and enforce transparency log inclusion (Rekor) for traceability.
Consider TUF/Uptane for robust update channels, with cosign used to sign the payloads.

Handling nondeterminism: thoughtful compromises

GPUs, parallelism, and mixed precision make bitwise determinism hard. Reasonable compromises include:

Controlled nondeterminism: fix seeds, enable deterministic ops where possible, and pin software/hardware stacks. Accept small numerical variance; define a tolerance in metrics.
Statistical reproducibility: store evaluation metrics with confidence intervals; require re-trains to fall within thresholds.
Snapshot RNG states: record PRNG states after data splitting and before training to replay splits.
Attest everything: even if bitwise differs, the provenance ties the run to the same inputs and environment config.

If you truly need bitwise reproducibility:

Single-GPU or CPU-only training.
Disable mixed precision, use FP32.
Avoid nondeterministic ops and layers; consult framework determinism lists.
Accept performance trade-offs.

End-to-end example pipeline

Assume:

Datasets in lakeFS (or DVC) with a commit per snapshot; you generate a signed manifest and push it as an OCI artifact.
Training on GitHub Actions self-hosted runners that submit jobs to a secured Kubernetes cluster (Tekton/Kubeflow). Tekton Chains emits in-toto/SLSA provenance.
Models published as OCI artifacts to GHCR and signed with cosign keyless using OIDC.
CI gates promotions; KServe + Ratify verifies at deployment.

Sketch: GitHub Actions workflow for training + signing

yaml
name: train-and-publish
on:
  workflow_dispatch:
  push:
    tags:
      - "train-*"

jobs:
  train:
    runs-on: ubuntu-22.04
    permissions:
      id-token: write   # required for keyless signatures
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Fetch dataset manifest
        run: |
          oras pull ghcr.io/yourorg/datasets/census:2025-03 -o dataset/
          cosign verify-blob \
            --certificate dataset/dataset.manifest.cert \
            --signature dataset/dataset.manifest.sig \
            dataset/dataset.manifest.json

      - name: Build training image
        run: |
          docker build -t ghcr.io/yourorg/trainers/pytorch:train-${{ github.sha }} .
          echo "IMAGE_DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/yourorg/trainers/pytorch:train-${{ github.sha }})" >> $GITHUB_ENV

      - name: Launch training
        run: |
          # Submit to your cluster; ensure Tekton Chains is enabled to emit provenance
          kubectl apply -f k8s/train-job.yaml
          # Wait for completion and copy artifacts back: model.safetensors, provenance.intoto.jsonl

      - name: Package model as OCI
        run: |
          oras push ghcr.io/yourorg/models/census-income:2025-03 \
            --artifact-type application/vnd.example.model.v1+tar \
            ./model.safetensors:application/octet-stream \
            ./config.json:application/json \
            ./tokenizer.json:application/json \
            ./provenance.intoto.jsonl:application/vnd.in-toto+jsonl \
            --annotation org.ml.dataset.oci-ref="ghcr.io/yourorg/datasets/census:2025-03" \
            --annotation org.ml.training.container="${{ env.IMAGE_DIGEST }}" \
            --annotation org.ml.training.code="github.com/yourorg/ml-pipeline@${{ github.sha }}"

      - name: Sign model (keyless)
        env:
          COSIGN_EXPERIMENTAL: "1"
        run: |
          cosign sign ghcr.io/yourorg/models/census-income:2025-03

      - name: Gate on provenance
        run: |
          slsa-verifier verify-image \
            ghcr.io/yourorg/models/census-income:2025-03 \
            --source-uri github.com/yourorg/ml-pipeline \
            --builder-id https://github.com/actions/runner

Kubernetes deployment with Ratify policy (conceptual):

Install Ratify and configure a policy requiring cosign signature and in-toto SLSA provenance type.
InferenceService references the image by digest; Ratify blocks if verification fails.

Operational concerns and best practices

Key management and identity:

Prefer keyless signing via OIDC (Sigstore Fulcio) tied to your CI identity. It reduces key sprawl and anchors trust in your IdP.
For offline/airgapped scenarios, use KMS-backed keys (HSM/Cloud KMS) and rotate regularly.
Enforce identity pinning in verification: specify issuer and subject patterns.

Revocation and transparency:

Use Rekor transparency log to audit signatures; mirror the log for resilience if needed.
If a dataset or model is compromised, publish a security advisory and update policy to reject that digest; optionally distribute a revocation list that your inference layer checks.

SBOMs and licenses:

Generate SPDX/CycloneDX SBOMs for the training environment. Attach them to the OCI artifact as auxiliary blobs.
Include license files for datasets and models; annotate permitted uses to assist downstream compliance.

Performance and cost:

Signature verification is fast; the dominant costs are generating manifests and container builds. Parallelize manifest hashing and cache layers.
For very large models, use an immutable local cache of verified artifacts to avoid re-downloading and re-verification on every restart.

Human factors:

Make provenance human-readable in dashboards. Engineers will use it if it helps debugging and auditing.
Automate everything; if signing or attestation is manual, it will be skipped under pressure.

What to avoid

“Latest” tags for datasets or models in production.
Mutable buckets without versioning or access logs.
Unpinned dependencies in training containers.
Attestations stored only in ephemeral CI storage; publish to your registry and/or artifact repo.
Blind trust in metrics without verifying the inputs.

A pragmatic maturity model

Phase 1 (2–4 weeks): sign dataset manifests, sign model artifacts, store both in OCI, verify in CI. Pin dependencies and seeds.
Phase 2 (4–8 weeks): emit in-toto/SLSA provenance automatically (Tekton Chains or GitHub Actions + slsa-framework). Gate releases on provenance; add SBOMs.
Phase 3 (ongoing): deterministic training where feasible; admission control on clusters (Ratify + Gatekeeper); edge verification; statistical reproducibility checks.

Each increment yields immediate security and reliability benefits.

Quick checklist

References and tooling

SLSA v1: https://slsa.dev
in-toto attestations: https://in-toto.io
Sigstore (cosign, Fulcio, Rekor): https://www.sigstore.dev
ORAS (OCI Artifacts): https://oras.land
OCI Distribution Spec: https://github.com/opencontainers/distribution-spec
Tekton Chains (provenance): https://tekton.dev/docs/chains
SLSA Verifier: https://github.com/slsa-framework/slsa-verifier
OPA Gatekeeper: https://github.com/open-policy-agent/gatekeeper
Kyverno: https://kyverno.io
Ratify (artifact verification in K8s): https://github.com/deislabs/ratify
lakeFS: https://lakefs.io
DVC: https://dvc.org
Delta Lake: https://delta.io
PyTorch determinism docs: https://pytorch.org/docs/stable/notes/randomness.html
TensorFlow determinism: https://www.tensorflow.org/guide/random_numbers

Closing opinion

In 2025, trustworthy ML means treating your data and models like critical software artifacts. That means signatures, provenance, policy, and reproducibility—even if you have to accept controlled nondeterminism in GPU-heavy workflows. The good news: the building blocks exist, and they interoperate. Start by signing datasets and models, attach SLSA provenance from your training pipeline, ship models as OCI artifacts, and verify at inference. Your future incidents, audits, and customers will thank you.