Hey guys,
I was trying to learn how to read at maximum speed. I wrote a very very simple kernel (see below). Although all looks fine and the profiler says I have 100% occupancy, I can only get about 40GB/s. The board is advertised at twice as much … what am I doing wrong? I also get about 20GFlops in sdot for large vector sizes, which shows that at leas 80GB/s it is indeed possible …
I have 256 threads per block and 2048 blocks. The number of elements I read is 256*2048, so I read 256 elements, one for each thread, in each block.
To see if a texture read would work faster, I tried with texture read (see the comented line). To bind the texture, in the host part I added cudaBindTexture(texRef, d_data, memSize, 0), as I saw in the manual. I must have done something wrong here as well, since all I am reading from the kernel are 0s …
Help would be greatly appreciated.
Regards,
Serban
#define tx threadIdx.x
#define bx blockIdx.x
texture<float, 1, cudaReadModeElementType> tex;
__global__ void data_kernel(const float *d_data, float *d_res)
{
/* Alloc shared memory */
__shared__ float buffer[NUM_THREADS];
/* Starting point for this block */
const int iStart = NUM_THREADS * bx;
/* Read the data for all rows using all threads*/
buffer[tx] = d_data[iStart + tx];
//buffer[tx] = texfetch(tex, 10000);
}