Internal bandwidth How to read at full speed

Hey guys,

I was trying to learn how to read at maximum speed. I wrote a very very simple kernel (see below). Although all looks fine and the profiler says I have 100% occupancy, I can only get about 40GB/s. The board is advertised at twice as much … what am I doing wrong? I also get about 20GFlops in sdot for large vector sizes, which shows that at leas 80GB/s it is indeed possible …

I have 256 threads per block and 2048 blocks. The number of elements I read is 256*2048, so I read 256 elements, one for each thread, in each block.

To see if a texture read would work faster, I tried with texture read (see the comented line). To bind the texture, in the host part I added cudaBindTexture(texRef, d_data, memSize, 0), as I saw in the manual. I must have done something wrong here as well, since all I am reading from the kernel are 0s …

Help would be greatly appreciated.



#define tx threadIdx.x

#define bx blockIdx.x

texture<float, 1, cudaReadModeElementType> tex;

__global__ void data_kernel(const float *d_data, float *d_res)


      /* Alloc shared memory */

      __shared__ float  buffer[NUM_THREADS];   


      /* Starting point for this block */

      const int iStart = NUM_THREADS * bx; 


      /* Read the  data for all rows using all threads*/ 

      buffer[tx] = d_data[iStart + tx];  

      //buffer[tx] = texfetch(tex, 10000);