This RFC formalizes the SIMD (Single Instruction Multiple Data) operations currently implemented in the experimental PackedTritVector system, promoting them from experimental to stable within the Deterministic Core Profile (DCP). It establishes SIMD as the highest-performance tier in the scalar → SWAR → SIMD optimization hierarchy, providing maximum throughput for large trit vectors while maintaining strict cross-platform determinism.
The core SIMD implementation already exists in code and is not greenfield:
TNot, TAnd, and TOrTNot, TAnd, and TOr63/64/65/127/129-class cases)This RFC therefore focuses on promotion and surface formalization:
experimentalIt does not currently claim a SIMD-specific TISC opcode family or SIMD-specific Trace-JIT lowering. Existing VM/JIT tritwise integration remains the RFC-0040 SWAR path.
Status 2026-03-18: accepted in-repo. The remaining evidence and compatibility
work below is stable-promotion hardening, not an accepted blocker.
The current SIMD implementation resides in experimental/packed_trit_vector.hpp and serves as the critical performance path for large-scale ternary operations (256+ trits). However, its experimental status prevents broader adoption and creates uncertainty about cross-architecture consistency. Formalizing SIMD operations will:
Primary Architectures:
Future Extensions:
2-Bit Trit Encoding (Consistent with SWAR):
| Trit | Binary Pattern | Description |
|---|---|---|
| -1 | 00 |
Negative one |
| 0 | 01 |
Zero |
| +1 | 11 |
Positive one |
| – | 10 |
Invalid (error detection) |
Note: 10 pattern is reserved for explicit error marking; validation rejects vectors containing any 10 pair.
Vector Width Configuration:
| Architecture | Vector Width | Bytes/Op | Trits/Op | Processing Stride |
|---|---|---|---|---|
| AVX2 | 256-bit | 32 | 128 | 32-byte loop |
| NEON | 128-bit | 16 | 64 | 16-byte loop |
Trit Density Summary:
TNot (Ternary Negation)
// AVX2 implementation using XOR mask (correct for balanced ternary negation)
__m256i mask55 = _mm256_set1_epi8(0x55);
__m256i maskAA = _mm256_set1_epi8(0xAA);
for (; i + 32 <= n; i += 32) {
__m256i x = _mm256_loadu_si256((const __m256i*)(src + i));
// Extract high and low bits
__m256i high = _mm256_and_si256(x, maskAA);
__m256i low = _mm256_and_si256(x, mask55);
// Negation: flip high bit, keep low bit unchanged
// -1(00) → +1(11): high 0→1, low 0→1
// 0(01) → 0(01): high 0→0, low 1→1
// +1(11) → -1(00): high 1→0, low 1→0
__m256i neg_high = _mm256_andnot_si256(high, maskAA); // Flip high bits
__m256i neg_low = _mm256_and_si256(low, low); // Keep low bits only for non-zero
__m256i res = _mm256_or_si256(neg_high, neg_low);
_mm256_storeu_si256((__m256i*)(dst + i), res);
}
NEON Implementation (TNot):
uint8x16_t mask55 = vdupq_n_u8(0x55);
uint8x16_t maskAA = vdupq_n_u8(0xAA);
for (; i + 16 <= n; i += 16) {
uint8x16_t x = vld1q_u8(src + i);
// Extract high and low bits
uint8x16_t high = vandq_u8(x, maskAA);
uint8x16_t low = vandq_u8(x, mask55);
// Negation: flip high bit, conditional low bit
uint8x16_t neg_high = vbicq_u8(maskAA, high); // Flip high bits
uint8x16_t neg_low = vandq_u8(low, low); // Keep low bits only for non-zero
uint8x16_t res = vorrq_u8(neg_high, neg_low);
vst1q_u8(dst + i, res);
}
Worked Example (TNot):
Input trits: [-1, 0, +1, -1, 0, +1, 0, -1]
Packed hex: 00 01 11 00 01 11 01 00 = 0x1D 0x34
Negation: [+1, 0, -1, +1, 0, -1, 0, +1]
Result hex: 11 01 00 11 01 00 01 11 = 0xD3 0x8B
TAnd (Ternary Conjunction)
// AVX2 implementation with min(a,b) semantics
__m256i maskAA = _mm256_set1_epi8(0xAA);
__m256i mask55 = _mm256_set1_epi8(0x55);
for (; i + 32 <= n; i += 32) {
__m256i va = _mm256_loadu_si256((const __m256i*)(src_a + i));
__m256i vb = _mm256_loadu_si256((const __m256i*)(src_b + i));
__m256i a_or_b = _mm256_or_si256(va, vb);
__m256i H = _mm256_and_si256(a_or_b, maskAA);
__m256i a_and_b = _mm256_and_si256(va, vb);
__m256i L_content = _mm256_and_si256(a_and_b, mask55);
__m256i H_shr = _mm256_srli_epi64(H, 1);
__m256i res = _mm256_or_si256(H, H_shr);
res = _mm256_or_si256(res, L_content);
_mm256_storeu_si256((__m256i*)(dst + i), res);
}
NEON Implementation (TAnd):
uint8x16_t maskAA = vdupq_n_u8(0xAA);
uint8x16_t mask55 = vdupq_n_u8(0x55);
for (; i + 16 <= n; i += 16) {
uint8x16_t va = vld1q_u8(src_a + i);
uint8x16_t vb = vld1q_u8(src_b + i);
uint8x16_t a_or_b = vorrq_u8(va, vb);
uint8x16_t H = vandq_u8(a_or_b, maskAA);
uint8x16_t a_and_b = vandq_u8(va, vb);
uint8x16_t L_content = vandq_u8(a_and_b, mask55);
uint8x16_t H_shr = vshrq_n_u8(H, 1);
uint8x16_t res = vorrq_u8(H, H_shr);
res = vorrq_u8(res, L_content);
vst1q_u8(dst + i, res);
}
Worked Example (TAnd):
Input A: [-1, 0, +1, -1, 0, +1, 0, -1]
Input B: [+1, 0, -1, -1, +1, 0, 0, +1]
Packed A: 00 01 11 00 01 11 01 00 = 0x1D 0x34
Packed B: 11 01 00 00 11 01 01 11 = 0xC3 0x8B
TAnd (min): [-1, 0, -1, -1, 0, 0, 0, -1]
Result hex: 00 01 00 00 01 01 01 00 = 0x10 0x74
TOr (Ternary Disjunction)
// AVX2 implementation with max(a,b) semantics
__m256i maskAA = _mm256_set1_epi8(0xAA);
__m256i mask55 = _mm256_set1_epi8(0x55);
for (; i + 32 <= n; i += 32) {
__m256i va = _mm256_loadu_si256((const __m256i*)(src_a + i));
__m256i vb = _mm256_loadu_si256((const __m256i*)(src_b + i));
__m256i h_a = _mm256_and_si256(va, maskAA);
__m256i h_b = _mm256_and_si256(vb, maskAA);
__m256i l_a = _mm256_and_si256(va, mask55);
__m256i l_b = _mm256_and_si256(vb, mask55);
__m256i H = _mm256_and_si256(h_a, h_b);
__m256i h_or = _mm256_or_si256(h_a, h_b);
__m256i mask = _mm256_srli_epi64(h_or, 1);
__m256i l_and = _mm256_and_si256(l_a, l_b);
__m256i l_or = _mm256_or_si256(l_a, l_b);
__m256i L = _mm256_or_si256(l_and, _mm256_andnot_si256(mask, l_or));
__m256i H_shr = _mm256_srli_epi64(H, 1);
__m256i res = _mm256_or_si256(H, H_shr);
res = _mm256_or_si256(res, L);
_mm256_storeu_si256((__m256i*)(dst + i), res);
}
NEON Implementation (TOr):
uint8x16_t maskAA = vdupq_n_u8(0xAA);
uint8x16_t mask55 = vdupq_n_u8(0x55);
for (; i + 16 <= n; i += 16) {
uint8x16_t va = vld1q_u8(src_a + i);
uint8x16_t vb = vld1q_u8(src_b + i);
uint8x16_t h_a = vandq_u8(va, maskAA);
uint8x16_t h_b = vandq_u8(vb, maskAA);
uint8x16_t l_a = vandq_u8(va, mask55);
uint8x16_t l_b = vandq_u8(vb, mask55);
uint8x16_t H = vandq_u8(h_a, h_b);
uint8x16_t h_or = vorrq_u8(h_a, h_b);
uint8x16_t mask = vshrq_n_u8(h_or, 1);
uint8x16_t l_and = vandq_u8(l_a, l_b);
uint8x16_t l_or = vorrq_u8(l_a, l_b);
uint8x16_t L = vorrq_u8(l_and, vbicq_u8(l_or, mask));
uint8x16_t H_shr = vshrq_n_u8(H, 1);
uint8x16_t res = vorrq_u8(H, H_shr);
res = vorrq_u8(res, L);
vst1q_u8(dst + i, res);
}
Worked Example (TOr):
Input A: [-1, 0, +1, -1, 0, +1, 0, -1]
Input B: [+1, 0, -1, -1, +1, 0, 0, +1]
Packed A: 00 01 11 00 01 11 01 00 = 0x1D 0x34
Packed B: 11 01 00 00 11 01 01 11 = 0xC3 0x8B
TOr (max): [+1, 0, +1, -1, +1, +1, 0, +1]
Result hex: 11 01 11 00 11 11 01 11 = 0xD7 0xFB
#### 4. Threshold-Based Dispatch
**Dispatch Logic:**
```cpp
static constexpr size_t AVX2_THRESHOLD_BYTES = 64; // ~256 trits
static constexpr size_t NEON_THRESHOLD_BYTES = 64; // ~256 trits
// Runtime tunable thresholds (optional)
static size_t get_avx2_threshold() {
const char* env = std::getenv("T81_AVX2_THRESHOLD");
return env ? std::stoul(env) : AVX2_THRESHOLD_BYTES;
}
Threshold Rationale:
Alignment Requirements:
_mm256_loadu_si256 / vld1q_u8 for compatibility_mm_prefetch for large vectors (>1KB)Access Patterns:
// Vectorized loop with 32-byte stride (AVX2)
for (; i + 32 <= n; i += 32) {
// Process 128 trits per iteration
__m256i chunk = _mm256_loadu_si256((const __m256i*)(src + i));
// ... SIMD operations ...
_mm256_storeu_si256((__m256i*)(dst + i), result);
}
// SWAR fallback for tail
if (i < n) {
kernel_not_swar(src + i, dst + i, n - i);
}
Bit-Exact Guarantees:
Validation Strategy:
Public API (Stable)
namespace t81::simd {
// Primary operations with automatic dispatch (aliases for clarity)
Result<ComputeTritVector> t_not(const ComputeTritVector& input);
Result<ComputeTritVector> t_and(const ComputeTritVector& a, const ComputeTritVector& b);
Result<ComputeTritVector> t_or(const ComputeTritVector& a, const ComputeTritVector& b);
// Semantic aliases (ternary logic operations)
Result<ComputeTritVector> t_neg(const ComputeTritVector& input); // Alias for t_not
Result<ComputeTritVector> t_min(const ComputeTritVector& a, const ComputeTritVector& b); // Alias for t_and
Result<ComputeTritVector> t_max(const ComputeTritVector& a, const ComputeTritVector& b); // Alias for t_or
// In-place variants for zero-allocation scenarios
Result<bool> t_not_inplace(ComputeTritVector& input);
Result<bool> t_and_inplace(ComputeTritVector& a, const ComputeTritVector& b);
Result<bool> t_or_inplace(ComputeTritVector& a, const ComputeTritVector& b);
// Architecture-specific operations (for advanced use)
Result<ComputeTritVector> t_not_avx2(const ComputeTritVector& input);
Result<ComputeTritVector> t_not_neon(const ComputeTritVector& input);
// Configuration and introspection
bool is_avx2_available();
bool is_neon_available();
size_t get_optimal_threshold();
void set_threshold_override(size_t bytes);
}
Internal Kernel API
namespace t81::simd::kernel {
void t_not_avx2(const uint8_t* src, uint8_t* dst, size_t len);
void t_and_avx2(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t len);
void t_or_avx2(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t len);
void t_not_neon(const uint8_t* src, uint8_t* dst, size_t len);
void t_and_neon(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t len);
void t_or_neon(const uint8_t* src_a, const uint8_t* src_b, uint8_t* dst, size_t len);
}
Unsupported Architectures
__builtin_cpu_supportsAlignment and Memory Constraints
Tail Handling
Invalid Trit Patterns
Breaking Changes:
experimental/packed_trit_vector.hpp SIMD methods will be deprecatedNon-Breaking Changes:
Expected Improvements:
Benchmark Targets (per ternary operation):
Throughput assumes packed 2-bit encoding; effective trit bandwidth is 4× byte throughput
Determinism Guarantees:
Memory Safety:
SWAR-Only Approach
Hardware-Specific SIMD Only
Lookup Table (LUT) SIMD
Custom Ternary Hardware Instructions
include/t81/simd/simd.hpp with stable API wrappersexperimental::ComputeTritVectordocs/records/status-history/RFC_0041_SIMD_EVIDENCE_2026-03-18.mddocs/records/status-history/RFC_0041_SIMD_EVIDENCE_2026-03-22.mdTOr; SWAR default for TAnd/TNot on Neoverse-classexperimental usageexperimental inclusion
t81/experimental/packed_trit_vector.hpp emits #pragma message on direct includeT81_PACKED_TRIT_VECTOR_STABLE_INCLUDE guard from stable headerdocs/records/status-history/RFC_0041_SIMD_EVIDENCE_2026-03-22.md| ID | Criterion | Status |
|---|---|---|
| [A-0041-01] | All SIMD operations produce bit-exact results across x86_64 and ARM64 | Accepted in-repo; test coverage exists and ARM64 evidence is now recorded, while refreshed x86_64 evidence still needs to be recorded under this RFC for the next promotion step |
| [A-0041-02] | Performance benchmarks meet or exceed targets (≥2x SWAR speedup) | Accepted in-repo with bounded caveat: benchmark coverage exists, but current ARM64 evidence is mixed, and the tuned implementation now treats only TOr as a clear NEON candidate on this host class |
| [A-0041-03] | Stable public SIMD API exists outside experimental |
Met: include/t81/simd/simd.hpp now exposes the promoted surface |
| [A-0041-04] | Backward compatibility maintained through a compatibility period | Met: stable API is currently a promotion wrapper over the existing implementation |
| [A-0041-05] | Cross-platform differential/property tests remain in CI-visible test targets | Met |
| [A-0041-06] | Boundary condition tests (63,64,65,127,128,129 bytes) pass | Met |
| [A-0041-07] | Documentation and migration guide complete | Met: migration guide, deprecation wording (#pragma message), and 2026-03-22 evidence note all in place; x86_64 refreshed evidence is the sole remaining item for next promotion step |
| [A-0041-08] | SIMD-specific VM/JIT scope is either implemented or explicitly deferred | Met: explicitly deferred in this RFC revision |
include/t81/experimental/packed_trit_vector.hpp (current implementation)benchmarks/BM_PackedTritVector.cpp (performance validation)tests/cpp/test_packed_trit_vector.cpp (existing test coverage)