Building on the 3-trillion-dimension video I just posted, I ran a full autonomous discovery sweep to find the absolute sweet spots on the RTX 4060 Laptop GPU.
All tests used pure FP16 Tensor Core matmuls with auto-calibrated row counts targeting safe VRAM usage.
Apex Discovery Results:
| Dimensions | Rows | TFLOPS | Bandwidth |
|---|---|---|---|
| 4096 | 327,680 | 16.73 | 8.22 GB/s |
| 5120 | 262,144 | 20.30 | 8.01 GB/s |
| 6144 | 218,432 | 18.56 | 6.13 GB/s |
| 7168 | 187,232 | 15.35 | 4.37 GB/s |
| 8192 | 163,840 | 17.20 | 4.30 GB/s |
| 12288 | 109,216 | 20.91 | 3.60 GB/s |
| 16384 | 81,920 | 20.80 | 2.79 GB/s |
Key takeaways:
- Peak performance: 20.91 TFLOPS at 12,288 dimensions
- Very strong resonance band around 12k–16k dimensions
- All tests stayed well under 4 GB VRAM thanks to the same zero-allocation / phase-slicing technique shown in the 3T video
This is the exact same core that scales cleanly to 3-trillion dimensions at only 1.87 GB VRAM.
Would love to hear from other CUDA devs — what dimensions / batch sizes do you see as sweet spots on your hardware?
#VeilSight #PhaseSlicing #TensorCore