basic doubts about cuda


i’m new to CUDA.

  1. Please help me to know if what i have understood is right: CUDA arrays are allocated on texture memory. hence, they are read-only. Is that right?
  2. Can someone give me an example of texturing from linear memory? and where is this linear memory allocated ?(in texture or global memory?)

thanks in advance.

  1. Correct. But it’s all the same memory. Cuda arrays are just in a special 2D layout that is necessary for doing efficient texture lookups from them. Current hardware can’t directly write to this format. So, yes, they are read only.

  2. You can bind a special kind of 1D texture to any linear range of global memory. This means you can effectively write to these kind of textures, although you should be careful not to write to the same region you’re reading from.

Here’s some example code:

// create a texture reference
texture<float4, 1, cudaReadModeElementType> tex;

// allocate global memory
cudaMalloc(dData, size);

// bind texture to global memory
cudaBindTexture(0, tex, dData, tex.channelDesc, size);

float4 data = tex1Dfetch(tex, index);

Thank you very much :)

  1. This means that there is no difference between global and texture memory, they are the same. But if data is read by texture fetching, then it is cached. Is that correct?

  2. I have some doubts about local memory:
    a). the programming guide shows local memory in the memory model(section 2.3), but is doesn’t figure in the hardware model(section 3.1), nor is it mentioned in the general specifications( section A.1). Where is the local memory?
    b ). If local memory does exit, then the local variables in the kernel are allocated in the local memory. Is that correct?
    c). Is there any difference in speeds of local and global memory? Is the local memory located on chip?

a.) Local memory is the same as global memory, only allocated per thread.
b.) Local variables may be placed in local memory. Compiler tries to store local variables in registers when possible (they are much faster). But number of registers is limited and they are not addressable, so sometimes compiler stores local variables in local memory.
c.) No, they’re the same. Slow. Try to avoid using it when possible.

Local memory is a special region of global memory per thread. If you end up using it, you’re generally doing something wrong. Well actually there is nothing special about it, it’s just very slow.

If you happen to use local memory, try changing our sinf to __sinf, your expf to __expf, etc. It has helped me to get rid of local memory or at least lower the amount used. It also brought down the register count of my kernel.

Why is completely unknown to me. The fact that I was using local memory was in my opinion a compiler bug in the first place as I was not using that many registers.

texture memory space , is a read - only region of device memory and it is cached. You can have several benefits when using texture memory as described in sections 5.4.


In sections it says :


it is said that local memory accesses are as fast as global memory. Because in section 3.1 it says

hope this helps a bit


Thank you all very much for clearing my confusion and for the quick replies :)

sinf() expands to a block of code that does argument reduction before calling __sinf(), the hardware function, which is only accurate over [-pi, pi]. That extra code uses some registers. If you were already spilling to local memory, then reducing the # of registers with __sinf() probably helped the register allocator.

It’s even worse than that, unfortunately. The block of code it expands to uses a local array, which, as current GPU hardware cannot address into thread registers, always expand to local.