About compute accuracy

Curefab · February 10, 2025, 9:28am

Really check, if the input data to mma is the same. You should have found the bug with the double indirection and perhaps there is another bug.

half-0 · February 10, 2025, 9:32am

I’m still confuses about this thing. I know access to global memory should be coalesced to 32B/64B/128B, but will three choice leads to significant difference in efficiency?

Curefab · February 10, 2025, 11:28am

If you just copy from global memory to shared memory, there should not be much difference.
With a small bit more efficiency with the larger sizes.
(Whereas the 16-bit accesses are really slower, at least they cost more L1 bandwidth.)

However, if you want to reorder the data with a kernel before storing to shared memory, it can make a difference, as the data is read by different threads in each of the three cases:

The lowest coalescing size is 32 bytes.
E.g. you can use 128 bit accesses and load 32 bytes into 2 neighbouring threads each (128 bits = 16 bytes) and then do shuffle between the two threads to move data from thread 1 to thread 0. That means consecutive memory locations arrive in the same thread. That could be helpful for combining two specific half values into one 32 bit value before storing to shared memory.

Topic		Replies	Views
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1997	October 8, 2020
Padding of mma operation CUDA Programming and Performance	20	59	December 19, 2024
Why compiler don't use registers to store my data? CUDA Programming and Performance	43	89	December 7, 2024
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4497	October 24, 2008
error when trying to use half (fp16) CUDA Programming and Performance	16	20060	October 13, 2015
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11696	March 20, 2009
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16308	January 30, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2016	December 9, 2010
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	330	April 3, 2024

About compute accuracy

Related topics