This RFC formalizes the SWAR (SIMD Within A Register) operations currently implemented in the experimental PackedTritVector system, promoting them from experimental to stable within the Deterministic Core Profile (DCP). It establishes SWAR as a fundamental building block for deterministic ternary operations, providing bit-exact results while maintaining cross-platform reproducibility.
The current SWAR implementation resides in experimental/packed_trit_vector.hpp and serves as a critical performance optimization for small-to-medium sized trit vectors. However, its experimental status prevents broader adoption across the T81 ecosystem. Formalizing SWAR operations will:
SWAR operations process multiple trits simultaneously using 64-bit word-level parallelism. The implementation uses 2-bit trit encoding (4 trits per byte) for optimal word alignment.
Trit Encoding Mapping:
| Binary | Trit | Description |
|---|---|---|
00 |
0 |
Zero value |
01 |
1 |
Positive one |
11 |
-1 |
Negative one |
10 |
— | Invalid (error detection) |
Trit Density:
This density sets clear expectations for when SWAR provides benefits over scalar operations.
Ternary Logic Semantics
T81 uses the mathematically natural balanced ternary conventions:
min(a,b) (consensus/minimum)max(a,b) (any/maximum)-a (negation)Truth Tables:
| a | b | TAnd (min) | TOr (max) |
|---|---|---|---|
| -1 | -1 | -1 | -1 |
| -1 | 0 | -1 | 0 |
| -1 | +1 | -1 | +1 |
| 0 | -1 | -1 | 0 |
| 0 | 0 | 0 | 0 |
| 0 | +1 | 0 | +1 |
| +1 | -1 | -1 | +1 |
| +1 | 0 | 0 | +1 |
| +1 | +1 | +1 | +1 |
This convention preserves algebraic properties (idempotence, absorption, distributivity) and aligns with balanced ternary arithmetic used in T81’s mathematical foundations.
Encoding Duality Note: With this 2-bit encoding, the high bit acts as a “sign-like” discriminator (0 for non-negative 0/+1, 1 for negative -1), while the low bit distinguishes 0 from non-zero — enabling the simple negation via low-bit flip.
TNot (Ternary Negation)
static void kernel_not_swar(const uint8_t* src, uint8_t* dst, size_t n);
Algorithm: Extract low bits of each 2-bit pair and XOR with shifted version.
01 ↔ 11, leaves 00 unchanged10 (invalid pattern)0x55...55 mask after validationTAnd (Ternary Conjunction)
static void kernel_and_swar(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t n);
Algorithm: Compute high bits via OR, low bits via AND, then combine with proper masking using min(a,b) semantics.
TOr (Ternary Disjunction)
static void kernel_or_swar(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t n);
Algorithm: Compute high bits via AND, low bits via OR, then combine with proper masking using max(a,b) semantics.
SWAR operations are automatically selected based on data size thresholds:
static constexpr size_t AVX2_THRESHOLD_BYTES = 64; // ~256 trits
static constexpr size_t NEON_THRESHOLD_BYTES = 64; // ~256 trits
Dispatch logic:
≤ 8 bytes: Fastpath with direct 64-bit operations≤ 16 bytes: Small fastpath with two 64-bit operations1–63 bytes: SWAR kernels (SWAR handles 1–63 bytes)≥ 64 bytes: SIMD (AVX2/NEON) with SWAR fallback for tailsPublic API (Stable)
namespace t81::swar {
// Primary operations
Result<ComputeTritVector> t_not(const ComputeTritVector& input);
Result<ComputeTritVector> t_and(const ComputeTritVector& a, const ComputeTritVector& b);
Result<ComputeTritVector> t_or(const ComputeTritVector& a, const ComputeTritVector& b);
// In-place variants for zero-allocation scenarios
Result<bool> t_not_inplace(ComputeTritVector& input);
Result<bool> t_and_inplace(ComputeTritVector& a, const ComputeTritVector& b);
Result<bool> t_or_inplace(ComputeTritVector& a, const ComputeTritVector& b);
// Explicit SWAR selection (for testing/benchmarking)
Result<ComputeTritVector> t_not_swar(const ComputeTritVector& input);
Result<ComputeTritVector> t_and_swar(const ComputeTritVector& a, const ComputeTritVector& b);
Result<ComputeTritVector> t_or_swar(const ComputeTritVector& a, const ComputeTritVector& b);
}
Internal Kernel API
namespace t81::swar::kernel {
void t_not(const uint8_t* src, uint8_t* dst, size_t len);
void t_and(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t len);
void t_or(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t len);
}
VM Integration
TNOT_SWAR, TAND_SWAR, TOR_SWARCanonFS Integration
Invalid Trit Patterns
Result<T>::failure for invalid 2-bit patterns10) trigger deterministic error handling(byte & 0xAA) across all bytes; fail if result ≠ 0 (detects any high-bit-set without low-bit-set patterns)10 patternsUnaligned Data
mask_trailing() ensures partial bytes are properly maskedSize Mismatches
Breaking Changes:
experimental/packed_trit_vector.hpp SWAR methods will be deprecatedNon-Breaking Changes:
Expected Improvements:
Benchmark Results (Current):
Throughput measured on x86_64 @ 3.2GHz; ARM64 shows similar relative performance Assumes packed 2-bit encoding (4 trits/byte); effective trit bandwidth is 4× byte throughput
Determinism Guarantees:
Memory Safety:
Pure Scalar Implementation
SIMD-Only Approach
Hardware-Specific Optimizations
Lookup Table (LUT) Approach
4-Trit Shuffle LUT Approach
include/t81/swar/swar.hpp with stable APIImplemented as of 2026-03-18:
include/t81/swar/swar.hppinclude/t81/experimental/packed_trit_vector.hppTNOT_SWAR, TAND_SWAR, TOR_SWARExactTrit tensor handles with shape/type faultingspec/tisc-spec.md and spec/tisc/opcode-*.mddocs/process/migration/RFC_0040_SWAR_MIGRATION.mddocs/explanation/performance-strategy.mddocs/records/status-history/RFC_0040_SWAR_EVIDENCE_2026-03-18.mdStill open:
Status 2026-03-22: accepted in-repo. ARM64 evidence is current (see
RFC_0041_SIMD_EVIDENCE_2026-03-22.md §Benchmark Results for refreshed numbers).
Deprecation wording for t81/experimental/packed_trit_vector.hpp is now in place
(#pragma message guard, 2026-03-22). The sole remaining stable-promotion item is
the x86_64 evidence refresh.
After core SWAR stabilization, priority extensions for ternary ML/AI workloads:
| ID | Criterion | Status |
|---|---|---|
| [A-0040-01] | All SWAR operations produce bit-exact results across x86_64 and ARM64 | Accepted in-repo; refreshed cross-architecture evidence is still pending for the next status transition, while local ARM64 backend/JIT/VM coverage is already in place |
| [A-0040-02] | Performance benchmarks meet or exceed current implementation | Met for promoted reference baseline: docs/records/status-history/RFC_0040_SWAR_EVIDENCE_2026-03-18.md shows SWAR materially ahead of the Phase 2A reference path on ARM64; LUT remains competitive on small local cases but is not the promoted VM/JIT path |
| [A-0040-03] | VM integration passes full conformance test suite | Met: t81_vm_rfc0040_swar_test, tisc_opcode_matrix_test, t81_vm_tisc_v04_extensions_test |
| [A-0040-04] | JIT integration maintains determinism invariants | Met: jit_trace_equivalence_test, jit_repro_oracle_test, jit_canonfs_cache_test, including SWAR policy enforcement coverage |
| [A-0040-05] | Backward compatibility maintained through deprecation cycle | Met: stable API plus deprecated compatibility shim in experimental/packed_trit_vector.hpp |
| [A-0040-06] | Documentation and migration guide complete | Met: spec updates plus docs/process/migration/RFC_0040_SWAR_MIGRATION.md, docs/explanation/performance-strategy.md, and the evidence snapshot |
include/t81/experimental/packed_trit_vector.hpp (current implementation)benchmarks/BM_PackedTritVector.cpp (performance validation)tests/cpp/test_packed_trit_vector.cpp (existing test coverage)