t81-foundation

RFC-0040: Formalization of SWAR Operations for Deterministic Ternary Computing

Summary

This RFC formalizes the SWAR (SIMD Within A Register) operations currently implemented in the experimental PackedTritVector system, promoting them from experimental to stable within the Deterministic Core Profile (DCP). It establishes SWAR as a fundamental building block for deterministic ternary operations, providing bit-exact results while maintaining cross-platform reproducibility.

Motivation

The current SWAR implementation resides in experimental/packed_trit_vector.hpp and serves as a critical performance optimization for small-to-medium sized trit vectors. However, its experimental status prevents broader adoption across the T81 ecosystem. Formalizing SWAR operations will:

  1. Enable Deterministic Performance: Provide predictable performance characteristics for trit-wise operations below SIMD thresholds
  2. Ensure Cross-Platform Consistency: Guarantee bit-exact results across x86_64, ARM64, and future architectures
  3. Facilitate JIT Integration: Allow the Trace-JIT to emit SWAR operations for optimized code generation
  4. Support Ecosystem Growth: Enable external tools and language bindings to rely on stable SWAR primitives

Proposal

Technical Details

1. SWAR Operation Specification

SWAR operations process multiple trits simultaneously using 64-bit word-level parallelism. The implementation uses 2-bit trit encoding (4 trits per byte) for optimal word alignment.

Trit Encoding Mapping:

Binary Trit Description
00 0 Zero value
01 1 Positive one
11 -1 Negative one
10 Invalid (error detection)

Trit Density:

This density sets clear expectations for when SWAR provides benefits over scalar operations.

2. Core SWAR Operations

Ternary Logic Semantics

T81 uses the mathematically natural balanced ternary conventions:

Truth Tables:

a b TAnd (min) TOr (max)
-1 -1 -1 -1
-1 0 -1 0
-1 +1 -1 +1
0 -1 -1 0
0 0 0 0
0 +1 0 +1
+1 -1 -1 +1
+1 0 0 +1
+1 +1 +1 +1

This convention preserves algebraic properties (idempotence, absorption, distributivity) and aligns with balanced ternary arithmetic used in T81’s mathematical foundations.

Encoding Duality Note: With this 2-bit encoding, the high bit acts as a “sign-like” discriminator (0 for non-negative 0/+1, 1 for negative -1), while the low bit distinguishes 0 from non-zero — enabling the simple negation via low-bit flip.

TNot (Ternary Negation)

static void kernel_not_swar(const uint8_t* src, uint8_t* dst, size_t n);

Algorithm: Extract low bits of each 2-bit pair and XOR with shifted version.

TAnd (Ternary Conjunction)

static void kernel_and_swar(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t n);

Algorithm: Compute high bits via OR, low bits via AND, then combine with proper masking using min(a,b) semantics.

TOr (Ternary Disjunction)

static void kernel_or_swar(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t n);

Algorithm: Compute high bits via AND, low bits via OR, then combine with proper masking using max(a,b) semantics.

3. Threshold-Based Dispatch

SWAR operations are automatically selected based on data size thresholds:

static constexpr size_t AVX2_THRESHOLD_BYTES = 64;  // ~256 trits
static constexpr size_t NEON_THRESHOLD_BYTES = 64;  // ~256 trits

Dispatch logic:

4. API Surface

Public API (Stable)

namespace t81::swar {
    // Primary operations
    Result<ComputeTritVector> t_not(const ComputeTritVector& input);
    Result<ComputeTritVector> t_and(const ComputeTritVector& a, const ComputeTritVector& b);
    Result<ComputeTritVector> t_or(const ComputeTritVector& a, const ComputeTritVector& b);
    
    // In-place variants for zero-allocation scenarios
    Result<bool> t_not_inplace(ComputeTritVector& input);
    Result<bool> t_and_inplace(ComputeTritVector& a, const ComputeTritVector& b);
    Result<bool> t_or_inplace(ComputeTritVector& a, const ComputeTritVector& b);
    
    // Explicit SWAR selection (for testing/benchmarking)
    Result<ComputeTritVector> t_not_swar(const ComputeTritVector& input);
    Result<ComputeTritVector> t_and_swar(const ComputeTritVector& a, const ComputeTritVector& b);
    Result<ComputeTritVector> t_or_swar(const ComputeTritVector& a, const ComputeTritVector& b);
}

Internal Kernel API

namespace t81::swar::kernel {
    void t_not(const uint8_t* src, uint8_t* dst, size_t len);
    void t_and(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t len);
    void t_or(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t len);
}

5. Integration Points

VM Integration

CanonFS Integration

Corner Cases

Invalid Trit Patterns

Unaligned Data

Size Mismatches

Impact

Backward Compatibility

Breaking Changes:

Non-Breaking Changes:

Performance

Expected Improvements:

Benchmark Results (Current):

Throughput measured on x86_64 @ 3.2GHz; ARM64 shows similar relative performance Assumes packed 2-bit encoding (4 trits/byte); effective trit bandwidth is 4× byte throughput

Security

Determinism Guarantees:

Memory Safety:

Alternatives Considered

Pure Scalar Implementation

SIMD-Only Approach

Hardware-Specific Optimizations

Lookup Table (LUT) Approach

4-Trit Shuffle LUT Approach

Implementation Roadmap

Phase 1: API Stabilization (Week 1-2)

Phase 2: VM Integration (Week 3-4)

Phase 3: JIT Integration (Week 5-6)

Phase 4: Documentation & Migration (Week 7-8)

Current Implementation Status

Implemented as of 2026-03-18:

Still open:

Status 2026-03-22: accepted in-repo. ARM64 evidence is current (see RFC_0041_SIMD_EVIDENCE_2026-03-22.md §Benchmark Results for refreshed numbers). Deprecation wording for t81/experimental/packed_trit_vector.hpp is now in place (#pragma message guard, 2026-03-22). The sole remaining stable-promotion item is the x86_64 evidence refresh.

Future Operations Roadmap

After core SWAR stabilization, priority extensions for ternary ML/AI workloads:

  1. Ternary ADD (with carry trit — critical for MAC operations)
  2. Ternary MUL (implemented via lookup or shift-add patterns)
  3. Compare/Clip/Abs operations for neural network activation functions
  4. Population count and bitmask extraction for sparse tensor operations
  5. Reduction operations (sum, product, consensus) for vector operations

Acceptance Criteria

ID Criterion Status
[A-0040-01] All SWAR operations produce bit-exact results across x86_64 and ARM64 Accepted in-repo; refreshed cross-architecture evidence is still pending for the next status transition, while local ARM64 backend/JIT/VM coverage is already in place
[A-0040-02] Performance benchmarks meet or exceed current implementation Met for promoted reference baseline: docs/records/status-history/RFC_0040_SWAR_EVIDENCE_2026-03-18.md shows SWAR materially ahead of the Phase 2A reference path on ARM64; LUT remains competitive on small local cases but is not the promoted VM/JIT path
[A-0040-03] VM integration passes full conformance test suite Met: t81_vm_rfc0040_swar_test, tisc_opcode_matrix_test, t81_vm_tisc_v04_extensions_test
[A-0040-04] JIT integration maintains determinism invariants Met: jit_trace_equivalence_test, jit_repro_oracle_test, jit_canonfs_cache_test, including SWAR policy enforcement coverage
[A-0040-05] Backward compatibility maintained through deprecation cycle Met: stable API plus deprecated compatibility shim in experimental/packed_trit_vector.hpp
[A-0040-06] Documentation and migration guide complete Met: spec updates plus docs/process/migration/RFC_0040_SWAR_MIGRATION.md, docs/explanation/performance-strategy.md, and the evidence snapshot

References