Constant memory

It seems that out of the 1GB memory on the card, only 64kb is reserved to constant memory. While I’m still trying to digest the 16k limit of shared memory, this one I can’t understand at all. I have 1MB of read only data, which is constantly in use. Do I have a faster memory than the global (preferably with cache), which can be allocated for it?

What can I put in 64kb anyway, my name?!

Use a texture. It is also much faster for non-uniform access, where different threads of a warp access different memory locations.

In theory, it is twice as fast than global memory.

Thanks, I’ll try the texture memory.
BTW I’ve seen that memory allocation for texture is made through cudaMallocArray, and then I discovered cudaMallocPitch. So I was wondering if mallocArray isn’t exclusive for textures, how much is it different than mallocPitch? And if the both of them involve automatic byte alignment, then why don’t I always use them, instead of cudaMalloc?

The primary difference is that cudaMallocArray allocates 2D memory, in the form of a CUDA array, while cudaMallocPitch allocates pitched linear memory. If you’re using 2D memory, then you can get better cache coherence by binding the texture to a CUDA array. If you’re using 1D memory, then I don’t think arrays offer any performance benefits, but you will be working with the cudaArray data type (cudaArray) as opposed to raw data (i.e. a float *).

hope that made sense

Again, you discussed why mallocArray would be better for binding textures. I asked if it would be better for a “daily use” (just normal type variables), concerning the byte alignment issue.

Another question, if I use texture memory only to boost access to regular datatypes (to benefit from caching), and I don’t need interpolation or other texture utils, do I still need to access the memory with tex2d(), or can I use the original float* pointer from the mallocArray?

CUDA arrays (as in, the data type cudaArray) are opaque data types that cannot be accessed outside of the texture/surface functions. They reorder the data in memory in a proprietary way to improve locality of access for 2D (and 3D) spatial patterns. The ordering is something like this:

You can bind a texture to 1D linear memory (i.e., a normal float pointer, for example), but to get the caching benefit, you have to access the memory with the tex1Dfetch() function rather than just dereferencing the pointer.

Sounds reasonable, thanks.