Global Memory Access Optimization, tex throttling

zbrong92 · April 1, 2024, 1:58am

I’m trying to optimize my kernel for some basic rejection calculations and the slowest portion appears to be access to global memory and conversion from float to double. From reading a number of articles about optimizing memory access I think I am coalescing memory properly but I would love some advice for speeding up my inital load of the memory.

Attached is a screenshot of my NSight capture. Tex Throttling is almost entirely the reason all of my warps are waiting to execute.

rs277 · April 1, 2024, 6:11pm

What is the GPU you are using and what is the FP64 pipeline usage?

zbrong92 · April 2, 2024, 11:11am

I’m running an RTX 3060, below is my deviceQuery information

Device 0: "NVIDIA GeForce RTX 3060"
  CUDA Driver Version / Runtime Version          11.7 / 11.7
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 12288 MBytes (12884377600 bytes)
  (028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
  GPU Max Clock rate:                            1807 MHz (1.81 GHz)
  Memory Clock rate:                             7501 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 2359296 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 43 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.7, CUDA Runtime Version = 11.7, NumDevs = 1

My FP64 Pipeline is at 87.3% utilization.

The data in the global space is float which then gets converted to double for mathematical precision when performing the calculations. I’m not sure if switching to storing as doubles to avoid type conversion would help relax some of the pressure on the FP64 pipeline.

rs277 · April 2, 2024, 5:32pm

Your float to double conversions are being done on the FP64 unit, which for consumer oriented GPUs has a much reduced throughput - see here.

This is very likely the cause for your stalls, see Greg’s reply here.

zbrong92 · April 3, 2024, 2:16am

I’ve read that table a few times in the past but it finally made sense looking at it now. I think you are right and that is my main problem. I guess the only answer to getting better performance is finding a way to reduce the dependence on the double precision pipeline. I’ll go over my math to see if there is a way I can leverage the single precision pipeline.

Do you know if there is an option to perform mixed precision math so that I start with floats and end with a double? I don’t know if that would be faster or if that feature is available outside of the tensor cores.

rs277 · April 3, 2024, 3:27am

I’m sorry, I’ve minimal experience with floating point. If you’re able to outline your problem and mention the hardware limitation you have, in a post on the Cuda Programming and Performance forum, there are a number of people there who may be able to help.

njuffa, a frequent poster there recently posted something addressing this, but I can’t find it at the moment.

veraj · May 8, 2024, 11:11am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Global memory access bottleneck CUDA Programming and Performance	8	3450	September 4, 2015
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	808	November 20, 2023
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	643	December 5, 2019
Constant memory provides no improvement CUDA Programming and Performance cuda , algorithm	16	90	January 17, 2025
Effective global memory bandwidth? CUDA Programming and Performance	17	17572	September 18, 2007
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	1054	July 17, 2023
Maximum Tensor Core utilization Nsight Compute cuda , kernel	4	143	March 20, 2025
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011
Global memory access cost CUDA Programming and Performance	4	2930	November 18, 2017
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	704	September 29, 2018

Global Memory Access Optimization, tex throttling

Related topics