How to Access Global Memory Efficiently in CUDA C/C++ Kernels

Originally published at:

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different…

In your description, you discussed that "Arrays allocated in device memory are aligned to 256-byte memory segments by the CUDA driver." Why are the arrays allocated in memory aligned to 256-byte memory segments? Is it limited by the CUDA Driver? Are there other alignments?

Minor mistake:

The post is tagged as "CUDA C++" while the rest of the post are refered as "CUDA C/C++"

Fixed. Thanks!

Hi Mark,

Could you explain this statement?
"For the C870 or any other device with a compute capability of 1.0, any misaligned access by a half warp of threads (or aligned access where the threads of the half warp do not access memory in sequence) results in 16 separate 32-byte transactions."

And then why were there 4 bytes requested per 32-byte transaction?

Thank you!

if i'm not mistaken, then :
* 4 bytes are requested per 32-byte transaction because the testes presented are for single precision, so elements are coded by 4 bytes
* in the C870, if accesses are not aligned, then each access causes a NEW TRANSACTION, and i suppose that the hardware choses -- from 32,64,128 byte available transactions -- the smallest transaction in term of bytes fetched to satisfy the requested element (in this case 4 bytes, so 32 is more than enough)

Are all those observation and best practices explained in the post still valid today, 5 years after the original post? Does this article equally applies to the newer cards with, say, compute capability 6.0? has information about compute capability of 6.0 and above.