Hi all,
I am a little confused regarding the Fermi Tuning guide. Here is the paragraph that confuses me:
If I understand correctly global memory uses L1 cache and then can be faster than texture memory. Reading some others posts on the subject it is not always the case.
I work with a NVIDIA QUADRO 600 (arch. 2.1) which has a maximum memory bandwidth of 25.6GB/s (cf. http://en.wikipedia…ng_units#Quadro).
Is this bandwidth correspond to the L1 memory bandwidth or is it something different? Also would it be a way to know the theoretical texture memory bandwidth attainable?
Using the Pitch Linear Texture program from the NVIDIA SDK the bandwidth obtained for a pitch linear memory is 7.92 GB/s. It is quite a drop from the maximum possible on my card. I suppose this is the kind of limitation they were talking about in this paragraph.
I have a kernel that is doing a reduction on blocks of 16x16 elements on an array of 1024x1024 int (its equivalent to a sum of 256 elements). It seems to be faster than when I use texture fetches. Here is the kernel code below:
__global__ void BlockReductionKernel(int* pIn, size_t pitch_In_el, int* pOut, size_t pitch_Out_el)
{
uint i = blockIdx.x*16 + threadIdx.x; // Row index
uint j = blockIdx.y*16 + threadIdx.y; // Column index
uint tid = threadIdx.x*blockDim.y + threadIdx.y; // Thread index
__shared__ int smem[64];
// Load into shared memory
//smem[tid] = pIn[i*pitch_Pyr_el + j] + pIn [(i+4)*pitch_Pyr_el + j] + pIn [(i+8)*pitch_Pyr_el + j] + pIn[(i+12)*pitch_Pyr_el +j];
smem[tid] = tex2D(pInTex, j + 0.5f, i + 0.5f) + tex2D(pInTex, j + 0.5f, i + 4 + 0.5f) + tex2D(pInTex, j + 0.5f, i + 8 + 0.5f) + tex2D(pInTex, j + 0.5f, i + 12 + 0.5f); // Sum of 4 elements within the block of 16x16
__syncthreads();
if(tid<16) smem[tid] += smem[tid + 16] + smem[tid + 32] + smem[tid + 48];
if(tid==0)
{
#pragma unroll
for(uint x = 1 ; x<16 ; ++x)
smem[0] += smem[x];
pOut[blockIdx.x*pitch_Out_el + blockIdx.y] = smem[0];
}
}
The kernel is launched that way:
BlockReductionKernel<<<dim3(1024/16, 1024/16, 1), dim3(4, 16, 1)>>>(pIn, pitch_In_el, pOut, pitch_In_el);
This is the fastest code I could wrote and I get a bandwidth of 9.321 GB/s (450 us for an array of 1024x1024 pixels). It’s far from the performance obtained with the reduction kernel from the SDK but it’s not doing a reduction on the whole array. I need to keep a size of block of 16x16 elements for my application. I suppose in my case that the performance is limited by the texture bandwidth. Am I right?
So would it be possible to have a better bandwidth in my case using instead global memory ?
Thanks in advance for your help.