Global Load and Store in C for CUDA


Just a small question: what do we mean by Global Load and Store exactly wrt NVIDIA GPUs AND how this feature allows a C-Like Language?


This sounds like a student assignment to me!

Global memory refers to the main memory of the GPU, which is accessible by all threads (as opposed to shared memory, which is shared between the threads in a block). Global load and store are instructions that load and store data from this memory to registers. Previous GPU architectures (using pixel shaders) could only write to a fixed location in memory for each thread (the current pixel), although they could read any address through textures. So it is global store (to arbitrary addresses, with byte addressing), that allows C-like array access and pointers in CUDA.

CPU :: RAM :: Loads and stores ( data = *p, *p = data)

GPU :: Global Memory :: Loads and Stores

CPU :: L1,L2,L3 caches

GPU :: Shared Memory (only L1)

CPU :: Applications :: Cache-un-aware (No CPU instruction directly addresses the cache entries; Caching happens Un-consciously)

GPU :: Kernels :: Cache aware (Shared Memory can be referred with pointers, Caching happens consciously)

Thanks Simon and Sarnath for your explanations regarding the concept. Can you guys tell me how to declare a pointer in shared memory in case of CUDA?

Hi !!!

I have a question. When you store a Matrix in global memory, Is it stored column-major order or row-major order?

Thank You

Jose Antonio

shared declaration inside the device functions/kernel. Refer cuda programming guide for more details on this…

Since the memory allocated would be linear (if you are using cudaMalloc), wouldn’t it be row-major? :)