Live demo: 3 Trillion dimension manifold processed on RTX 4060 Laptop GPU using pure phase-slicing zero-allocation — stable at only 1.87 GB VRAM

Building on the 3-trillion-dimension video I just posted, I ran a full autonomous discovery sweep to find the absolute sweet spots on the RTX 4060 Laptop GPU.

All tests used pure FP16 Tensor Core matmuls with auto-calibrated row counts targeting safe VRAM usage.

Apex Discovery Results:

Dimensions Rows TFLOPS Bandwidth
4096 327,680 16.73 8.22 GB/s
5120 262,144 20.30 8.01 GB/s
6144 218,432 18.56 6.13 GB/s
7168 187,232 15.35 4.37 GB/s
8192 163,840 17.20 4.30 GB/s
12288 109,216 20.91 3.60 GB/s
16384 81,920 20.80 2.79 GB/s

Key takeaways:

  • Peak performance: 20.91 TFLOPS at 12,288 dimensions
  • Very strong resonance band around 12k–16k dimensions
  • All tests stayed well under 4 GB VRAM thanks to the same zero-allocation / phase-slicing technique shown in the 3T video

This is the exact same core that scales cleanly to 3-trillion dimensions at only 1.87 GB VRAM.

Would love to hear from other CUDA devs — what dimensions / batch sizes do you see as sweet spots on your hardware?

#VeilSight #PhaseSlicing #TensorCore