Hi guys,
could you please help analyze output of Visual Proifiler?
Kernel’s goal is just to aggregate every other couple of floats in the input vector array:
template<unsigned int blockStride> __global__ void collapse1_2 (float2 * g_idata, float* g_odata, int length, float divider, unsigned int gridSize)
{
extern __shared__ float2 sdata[];
const unsigned int tid = threadIdx.x;
const unsigned int num_threads = blockDim.x;
unsigned int i_idx1;
float2 pair1;
i_idx1 = num_threads * blockIdx.x * blockStride + tid;
if (i_idx1 < length){
pair1 = g_idata[i_idx1];
g_odata[i_idx1] = (pair1.x + pair1.y )*divider;
}
}
void aggregateGPU(float* g_idata, float* g_odata, unsigned int agg_step, int length){
float divider = 1/(float)agg_step;
const int coarse_agg_threshold = 256;
int threads = 256;
const int blockStride = 1;
int block_count = (length + threads*blockStride - 1)/(threads*blockStride);
int blocks = min(MAX_BLOCK_DIM_SIZE, block_count);
unsigned int gridSize = blocks * threads * blockStride;
collapse1_2<blockStride><<<blocks, threads, 0>>>((float2*)g_idata, g_odata, length>>1, 1/2.0f, gridSize);
}
g_idata is vectors of floats (overall 365144030 floats)
I also attached cuda profiler output for 10 kernel runs. It features multiple gld_128b and gst_64b memory requests which is good and what is expected.
But kernel execution is still bound to short gst_requests, e.g. in run 5:
gld_128b 49280
gst_32b 0
gst_64b 49280
gst_128b 0
gld_request 8216
gst_request 596132
Note that terrible count of gst_request’s. I can’t understand where they are coming from as all requests to store into global memory are indexed by thread_id?[attachment=21494:kernelrun.bmp]