Cuda memory optimization for Jetson Tx1

dominikschoen89 · August 3, 2018, 11:02am

I have a kernel with 81 grids and 256 threads per threadblock.
The first 201 threads in each block reads the same parts of the global memory.

Example: thread 0 in threadblock 0 reads global memory array with index 0
thread 1 in threadblock 0 reads global memory array with index 1
…
thread 0 in threadblock 1 reads global memory array with index 0
thread 1 in threadblock 1 reads global memory array with index 1
…and so on

So i have a lot of reads on the same global memory addresses.
Is there a better possibility for that issue because shared memory is not a solution i think??
Texture memory? etc?

cbuchner1 · August 3, 2018, 11:08am

The L2 cache will already cover most of the redundant reads.

L1 caches and texture caches are both local to each SM, and fall back to L2 cache as far as I know. So each SM will have to read the same data redundantly.

You could modify your kernel to use const restrict for the global memory pointer. It will make the compiler make reads through the texture cache (alternatively use the __ldg() intrinsic)

I doubt it would provide a big speedup.

Christian

Topic		Replies	Views
All blocks read same global memory location section. Fastest method is? CUDA Programming and Performance cuda , kernel	5	422	May 23, 2022
Global memory reads optimization with texture cache CUDA Programming and Performance	2	1405	August 2, 2009
Global memory latency ... and shared memory as a cache CUDA Programming and Performance	1	8356	February 17, 2008
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5498	March 6, 2007
Shared memory vs global memory CUDA Programming and Performance	6	3451	April 30, 2007
global memory caching CUDA Programming and Performance	4	1407	March 13, 2012
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4429	December 4, 2009
Need help about global memory access by threads CUDA Programming and Performance	4	1191	April 6, 2010
Shared memory alternative CUDA Programming and Performance	7	2442	December 7, 2011
access speed of shared memory and global memory CUDA Programming and Performance	1	1078	August 6, 2009

Cuda memory optimization for Jetson Tx1

Related topics