I’m trying to optimize my kernel for some basic rejection calculations and the slowest portion appears to be access to global memory and conversion from float to double. From reading a number of articles about optimizing memory access I think I am coalescing memory properly but I would love some advice for speeding up my inital load of the memory.
Attached is a screenshot of my NSight capture. Tex Throttling is almost entirely the reason all of my warps are waiting to execute.
I’m running an RTX 3060, below is my deviceQuery information
Device 0: "NVIDIA GeForce RTX 3060"
CUDA Driver Version / Runtime Version 11.7 / 11.7
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 12288 MBytes (12884377600 bytes)
(028) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1807 MHz (1.81 GHz)
Memory Clock rate: 7501 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 2359296 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 43 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.7, CUDA Runtime Version = 11.7, NumDevs = 1
My FP64 Pipeline is at 87.3% utilization.
The data in the global space is float which then gets converted to double for mathematical precision when performing the calculations. I’m not sure if switching to storing as doubles to avoid type conversion would help relax some of the pressure on the FP64 pipeline.
I’ve read that table a few times in the past but it finally made sense looking at it now. I think you are right and that is my main problem. I guess the only answer to getting better performance is finding a way to reduce the dependence on the double precision pipeline. I’ll go over my math to see if there is a way I can leverage the single precision pipeline.
Do you know if there is an option to perform mixed precision math so that I start with floats and end with a double? I don’t know if that would be faster or if that feature is available outside of the tensor cores.
I’m sorry, I’ve minimal experience with floating point. If you’re able to outline your problem and mention the hardware limitation you have, in a post on the Cuda Programming and Performance forum, there are a number of people there who may be able to help.
njuffa, a frequent poster there recently posted something addressing this, but I can’t find it at the moment.