Global Memory Access Optimization, tex throttling

I’m trying to optimize my kernel for some basic rejection calculations and the slowest portion appears to be access to global memory and conversion from float to double. From reading a number of articles about optimizing memory access I think I am coalescing memory properly but I would love some advice for speeding up my inital load of the memory.

Attached is a screenshot of my NSight capture. Tex Throttling is almost entirely the reason all of my warps are waiting to execute.

What is the GPU you are using and what is the FP64 pipeline usage?

I’m running an RTX 3060, below is my deviceQuery information

Device 0: "NVIDIA GeForce RTX 3060"
  CUDA Driver Version / Runtime Version          11.7 / 11.7
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 12288 MBytes (12884377600 bytes)
  (028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
  GPU Max Clock rate:                            1807 MHz (1.81 GHz)
  Memory Clock rate:                             7501 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 2359296 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 43 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.7, CUDA Runtime Version = 11.7, NumDevs = 1

My FP64 Pipeline is at 87.3% utilization.

The data in the global space is float which then gets converted to double for mathematical precision when performing the calculations. I’m not sure if switching to storing as doubles to avoid type conversion would help relax some of the pressure on the FP64 pipeline.

Your float to double conversions are being done on the FP64 unit, which for consumer oriented GPUs has a much reduced throughput - see here.

This is very likely the cause for your stalls, see Greg’s reply here.

I’ve read that table a few times in the past but it finally made sense looking at it now. I think you are right and that is my main problem. I guess the only answer to getting better performance is finding a way to reduce the dependence on the double precision pipeline. I’ll go over my math to see if there is a way I can leverage the single precision pipeline.

Do you know if there is an option to perform mixed precision math so that I start with floats and end with a double? I don’t know if that would be faster or if that feature is available outside of the tensor cores.

I’m sorry, I’ve minimal experience with floating point. If you’re able to outline your problem and mention the hardware limitation you have, in a post on the Cuda Programming and Performance forum, there are a number of people there who may be able to help.

njuffa, a frequent poster there recently posted something addressing this, but I can’t find it at the moment.