Live demo: 3 Trillion dimension manifold processed on RTX 4060 Laptop GPU using pure phase-slicing zero-allocation — stable at only 1.87 GB VRAM

demonrice · April 7, 2026, 5:13pm

demonrice · April 7, 2026, 5:19pm

Building on the 3-trillion-dimension video I just posted, I ran a full autonomous discovery sweep to find the absolute sweet spots on the RTX 4060 Laptop GPU.

All tests used pure FP16 Tensor Core matmuls with auto-calibrated row counts targeting safe VRAM usage.

Apex Discovery Results:

Dimensions	Rows	TFLOPS	Bandwidth
4096	327,680	16.73	8.22 GB/s
5120	262,144	20.30	8.01 GB/s
6144	218,432	18.56	6.13 GB/s
7168	187,232	15.35	4.37 GB/s
8192	163,840	17.20	4.30 GB/s
12288	109,216	20.91	3.60 GB/s
16384	81,920	20.80	2.79 GB/s

Key takeaways:

Peak performance: 20.91 TFLOPS at 12,288 dimensions
Very strong resonance band around 12k–16k dimensions
All tests stayed well under 4 GB VRAM thanks to the same zero-allocation / phase-slicing technique shown in the 3T video

This is the exact same core that scales cleanly to 3-trillion dimensions at only 1.87 GB VRAM.

Would love to hear from other CUDA devs — what dimensions / batch sizes do you see as sweet spots on your hardware?

#VeilSight #PhaseSlicing #TensorCore

Topic		Replies	Views
Pushing RTX 4060 Mobile to 20.9 TFLOPS: Procedural 1M+ Vertex 3D Generation via PyTorch CUDA Programming and Performance cuda	0	13	April 7, 2026
"Has anyone else hit the physical TGP ceiling on the Mobile 4060? I wrote a PyTorch cold-boot script that sustains 26.3+ TFLOPS across massive 163K x CUDA Programming and Performance	1	14	April 8, 2026
How to achieve 56 TFLOPS performance on RTX 500 Ada? CUDA Programming and Performance cuda	11	656	April 20, 2025
Inside Volta: The World’s Most Advanced Data Center GPU Technical Blog	43	1708	October 1, 2018
Programming Tensor Cores in CUDA 9 Technical Blog	14	1305	November 28, 2022
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2143	August 8, 2023
I dont achieve peak performance of 4090 in Pytorch CUDA Programming and Performance cuda	0	33	March 25, 2026
CUDA 10 Features Revealed: Turing, CUDA Graphs, and More Technical Blog	4	553	October 24, 2018
Unexpected low fp16 performance on P3 Frameworks (archived) tensorflow	4	2482	October 12, 2021
Peak Performance INT1, INT4, INT8, INT16, INT32 for RTX3090 Tensorcore CUDA Developer Tools	0	1363	January 12, 2021

Live demo: 3 Trillion dimension manifold processed on RTX 4060 Laptop GPU using pure phase-slicing zero-allocation — stable at only 1.87 GB VRAM

Related topics