Texture Memory vs. Global Memory and float4

Hello all.

I’m writing an application using CUDA (compute 1.3) that involves reading data from a float array from global memory quite a bit. The memory reads are fully coalesced (from my understanding) such that if thread0 reads from index 64, thread1 reads index 65, … The next iteration of the loop sees the index decrease by one, so that thread0 would read index 63, thread1 would read index 64, … etc. Like so:

index = threadIdx.x + blockIdx.x * blockDim.x;

//size can be large, into the thousands

for(j = 0; j < size; j++)

{

	sum += constant_mem_array[j] * global_mem_array[index - j];

	//because each thread reads the same value from constant_mem_array, I plan to make constant_mem_array a shared memory array to hopefully increase speed

}

From my understanding, reading from global_mem_array in this fashion results in fully coalesced accesses. Am I correct? I tried making the global array bound to a texture, but that increased the execution time (by a lot… from about 155ms to 2200ms). Why is this? I figured that because every thread but one will read a value already read from memory in the previous iteration (thread0 would be reading a value that has not been read yet), this code would benefit from the caching that texture memory provides. Is there something that I am missing?

Also, does using a vector type (especially float4) benefit in regards to reads from global memory? In the programming guide, it mentions that 128-byte reads will be compiled into one read from global memory. So if each thread in the previous example read in 4 floats instead of 1 while unrolling the loop 4 times, would that increase performance (I believe this would correspond to 4 uncoalesced reads)? Or does the 128-byte reads only gain any performance when the 128 bytes are spread out perfectly over a warp? I tried searching around for information about the built-in vector types, but I have found just about nothing. Such as, if you were to multiply a float4 by a float4, is that the same as

float4 a = <init>

float4 b = <init>

float4 c;

c.x = a.x*b.x;

c.y = a.y*b.y;

//same for z and w fields

I guess what I am trying to ask is will the compiler be able to optimize float4 operations (such as multiplies or adds) or memory accesses? Are there instructions specifically for 128-bit quantities (analogous to SIMD instructions on CPUs where 2 floats occupying a 64-bit register can be multiplied at the same time)?

If anyone knows this information or of a place where I could find this information, I would greatly appreciate it.

Thank you all for your help.

Hello all.

I’m writing an application using CUDA (compute 1.3) that involves reading data from a float array from global memory quite a bit. The memory reads are fully coalesced (from my understanding) such that if thread0 reads from index 64, thread1 reads index 65, … The next iteration of the loop sees the index decrease by one, so that thread0 would read index 63, thread1 would read index 64, … etc. Like so:

index = threadIdx.x + blockIdx.x * blockDim.x;

//size can be large, into the thousands

for(j = 0; j < size; j++)

{

	sum += constant_mem_array[j] * global_mem_array[index - j];

	//because each thread reads the same value from constant_mem_array, I plan to make constant_mem_array a shared memory array to hopefully increase speed

}

From my understanding, reading from global_mem_array in this fashion results in fully coalesced accesses. Am I correct? I tried making the global array bound to a texture, but that increased the execution time (by a lot… from about 155ms to 2200ms). Why is this? I figured that because every thread but one will read a value already read from memory in the previous iteration (thread0 would be reading a value that has not been read yet), this code would benefit from the caching that texture memory provides. Is there something that I am missing?

Also, does using a vector type (especially float4) benefit in regards to reads from global memory? In the programming guide, it mentions that 128-byte reads will be compiled into one read from global memory. So if each thread in the previous example read in 4 floats instead of 1 while unrolling the loop 4 times, would that increase performance (I believe this would correspond to 4 uncoalesced reads)? Or does the 128-byte reads only gain any performance when the 128 bytes are spread out perfectly over a warp? I tried searching around for information about the built-in vector types, but I have found just about nothing. Such as, if you were to multiply a float4 by a float4, is that the same as

float4 a = <init>

float4 b = <init>

float4 c;

c.x = a.x*b.x;

c.y = a.y*b.y;

//same for z and w fields

I guess what I am trying to ask is will the compiler be able to optimize float4 operations (such as multiplies or adds) or memory accesses? Are there instructions specifically for 128-bit quantities (analogous to SIMD instructions on CPUs where 2 floats occupying a 64-bit register can be multiplied at the same time)?

If anyone knows this information or of a place where I could find this information, I would greatly appreciate it.

Thank you all for your help.

Hi,
I’ll try to answer few of the questions and go on from there.

A. Access seems coallesced.
B. You should see some benefits from using textures. How did you use the textures in code? also textures are not magic its
a cache and it can get polluted and in-efficient.
C. You might want to use shared memory. i.e. if ‘size’ is not too big you can read everything into shared memory and then use that
data instead of going to gmem/textures. Even if ‘size’ is big you can do it in chunks.
D. Use visual profiler to see what your code does.
E. float2 should be the fastest if I remember correctly - but this might change based on your hardware.
See here: http://forums.nvidia.com/index.php?showtop…mp;#entry976583

Hope that helps,
Eyal

Hi,
I’ll try to answer few of the questions and go on from there.

A. Access seems coallesced.
B. You should see some benefits from using textures. How did you use the textures in code? also textures are not magic its
a cache and it can get polluted and in-efficient.
C. You might want to use shared memory. i.e. if ‘size’ is not too big you can read everything into shared memory and then use that
data instead of going to gmem/textures. Even if ‘size’ is big you can do it in chunks.
D. Use visual profiler to see what your code does.
E. float2 should be the fastest if I remember correctly - but this might change based on your hardware.
See here: http://forums.nvidia.com/index.php?showtop…mp;#entry976583

Hope that helps,
Eyal

Thanks for the reply!

B: I used a 1D texture of float using the default template for float (declaration: texture texture;), and in place of global_mem_array[index - j] I used tex1Dfetch(texture, index - j); I bound it as: cudaBindTexture(0, texture, dev_data, sizeof(float) * host_data);

C: That’s actually my next step in the optimization. Since size can be in the thousands (and even possibly hundreds of thousands), I plan on not using the constant memory and instead using tiles of shared memory (since accesses to constant_mem_array are not coalesced and global_mem_array are, I would get more speedup using shared memory for constant_mem_array since there won’t be bank conflicts). I tried to mention that in the code in a comment, but it is easily skipped :)

E: Thanks! That’s the first real data I’ve seen on float vectors. Do you know how arithmetic operations are done on float vectors or know of a place that describes it? Or if the compiler could optimize more using float2 or float4 by using less instructions?

Thanks!

Thanks for the reply!

B: I used a 1D texture of float using the default template for float (declaration: texture texture;), and in place of global_mem_array[index - j] I used tex1Dfetch(texture, index - j); I bound it as: cudaBindTexture(0, texture, dev_data, sizeof(float) * host_data);

C: That’s actually my next step in the optimization. Since size can be in the thousands (and even possibly hundreds of thousands), I plan on not using the constant memory and instead using tiles of shared memory (since accesses to constant_mem_array are not coalesced and global_mem_array are, I would get more speedup using shared memory for constant_mem_array since there won’t be bank conflicts). I tried to mention that in the code in a comment, but it is easily skipped :)

E: Thanks! That’s the first real data I’ve seen on float vectors. Do you know how arithmetic operations are done on float vectors or know of a place that describes it? Or if the compiler could optimize more using float2 or float4 by using less instructions?

Thanks!