Hello all.
I’m writing an application using CUDA (compute 1.3) that involves reading data from a float array from global memory quite a bit. The memory reads are fully coalesced (from my understanding) such that if thread0 reads from index 64, thread1 reads index 65, … The next iteration of the loop sees the index decrease by one, so that thread0 would read index 63, thread1 would read index 64, … etc. Like so:
index = threadIdx.x + blockIdx.x * blockDim.x;
//size can be large, into the thousands
for(j = 0; j < size; j++)
{
sum += constant_mem_array[j] * global_mem_array[index - j];
//because each thread reads the same value from constant_mem_array, I plan to make constant_mem_array a shared memory array to hopefully increase speed
}
From my understanding, reading from global_mem_array in this fashion results in fully coalesced accesses. Am I correct? I tried making the global array bound to a texture, but that increased the execution time (by a lot… from about 155ms to 2200ms). Why is this? I figured that because every thread but one will read a value already read from memory in the previous iteration (thread0 would be reading a value that has not been read yet), this code would benefit from the caching that texture memory provides. Is there something that I am missing?
Also, does using a vector type (especially float4) benefit in regards to reads from global memory? In the programming guide, it mentions that 128-byte reads will be compiled into one read from global memory. So if each thread in the previous example read in 4 floats instead of 1 while unrolling the loop 4 times, would that increase performance (I believe this would correspond to 4 uncoalesced reads)? Or does the 128-byte reads only gain any performance when the 128 bytes are spread out perfectly over a warp? I tried searching around for information about the built-in vector types, but I have found just about nothing. Such as, if you were to multiply a float4 by a float4, is that the same as
float4 a = <init>
float4 b = <init>
float4 c;
c.x = a.x*b.x;
c.y = a.y*b.y;
//same for z and w fields
I guess what I am trying to ask is will the compiler be able to optimize float4 operations (such as multiplies or adds) or memory accesses? Are there instructions specifically for 128-bit quantities (analogous to SIMD instructions on CPUs where 2 floats occupying a 64-bit register can be multiplied at the same time)?
If anyone knows this information or of a place where I could find this information, I would greatly appreciate it.
Thank you all for your help.