First of all, I’m using CUDA 2.3 on Windows7 (VS2008).
I have two questions:
Is there a way to check how many registers are used to execute a kernel?
I store read only data in textures. What is the most efficient way to store data on the device?
Should I use cudaArrays for my data or is it enough to allocate some memory (device float *d_dataPtr)?
You have to add --ptxas-options=-v to the nvcc compilation flags.
It depends on your access pattern. If you access data in linear order, then a normal device pointer is fine (and in fact, you might not need the texture at all). If your data represents a 2D array and you access it in a spatially-local way, like a moving 16x16 block, then a cudaArray is better. The cudaArray holds the data in some kind of space-filling curve that makes spatially localized access more efficient than the standard row-major or column-major storage ordering.
You can try it and see if it helps. The texture cache is only 6-8 kB per multiprocessor, so performance will be poor for completely random access no matter what you do.
They don’t have to be inlined, but in general they almost always are (and it is automatic). The programming guide mentions that there is a noinline keyword to suggest that the compiler not inline the function. (The compiler is still free to ignore the suggestion.)