Where gst_requests are coming from?

Hi guys,

could you please help analyze output of Visual Proifiler?

Kernel’s goal is just to aggregate every other couple of floats in the input vector array:

template<unsigned int blockStride> __global__ void collapse1_2 (float2 * g_idata, float* g_odata, int length, float divider, unsigned int gridSize)

{

extern  __shared__  float2 sdata[];

const unsigned int tid = threadIdx.x;

  const unsigned int num_threads = blockDim.x;

unsigned int i_idx1;

float2 pair1; 

i_idx1 = num_threads * blockIdx.x * blockStride + tid;

if (i_idx1 < length){

	  pair1 = g_idata[i_idx1];

	  g_odata[i_idx1] = (pair1.x + pair1.y )*divider;

  }

}

void aggregateGPU(float* g_idata, float* g_odata, unsigned int agg_step, int length){

	float divider = 1/(float)agg_step;

	const int coarse_agg_threshold = 256;

	int threads = 256;

	const int blockStride = 1;

	int block_count = (length + threads*blockStride - 1)/(threads*blockStride);

	int blocks = min(MAX_BLOCK_DIM_SIZE, block_count);

	unsigned int gridSize = blocks * threads * blockStride;

	collapse1_2<blockStride><<<blocks, threads, 0>>>((float2*)g_idata, g_odata, length>>1, 1/2.0f, gridSize);

}

g_idata is vectors of floats (overall 365144030 floats)

I also attached cuda profiler output for 10 kernel runs. It features multiple gld_128b and gst_64b memory requests which is good and what is expected.

But kernel execution is still bound to short gst_requests, e.g. in run 5:

gld_128b 49280

gst_32b 0

gst_64b 49280

gst_128b 0

gld_request 8216

gst_request 596132

Note that terrible count of gst_request’s. I can’t understand where they are coming from as all requests to store into global memory are indexed by thread_id?[attachment=21494:kernelrun.bmp]


HI !

I also have weird behaviors with the number of gst request.

I can’t figure out why I have a huge difference of gst request for those 2 very simple samples :

SAMPLE 1 : 22402 gst_64b ( blocksize(16,16) and (sizeX,sizeY) = (800,500))

[codebox]global void test(float * output,int sizeX,int sizeY)

{

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

unsigned int i = y*sizeX+x;

float acc=0.0f;

if ((x<sizeX) && (y<sizeY))

{

	output[i] = acc;

}

}[/codebox]

SAMPLE 2 : 1.28162e+08 gst_64b !!! ( blocksize(16,16) and (sizeX,sizeY) = (800,500))

[codebox]global void test(float * output,int sizeX,int sizeY)

{

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

unsigned int i = y*sizeX+x;

float acc=0.0f;

if ((x<sizeX) && (y<sizeY))

{

	for (int ind_i=0;ind_i<200;ind_i++)

	{

		for (int ind_j=0;ind_j<200;ind_j++)

		{

				acc	+= 1.0f;

		}

	}

	output[i] = acc;

}

}[/codebox]

I am not writing to global memory inside the loops as far I know !?

Here is the PTX code for sample 2 :

[codebox] .entry _Z4testPfiii (

	.param .u32 __cudaparm__Z4testPfiii_output,

	.param .s32 __cudaparm__Z4testPfiii_sizeX,

	.param .s32 __cudaparm__Z4testPfiii_sizeY,

	.param .s32 __cudaparm__Z4testPfiii_N)

{

.reg .u16 %rh<6>;

.reg .u32 %r<25>;

.reg .f32 %f<4>;

.reg .pred %p<5>;

.loc	15	771	0

$LBB1__Z4testPfiii:

mov.u16 	%rh1, %ctaid.x;

mov.u16 	%rh2, %ntid.x;

mul.wide.u16 	%r1, %rh1, %rh2;

mov.u16 	%rh3, %ctaid.y;

mov.u16 	%rh4, %ntid.y;

mul.wide.u16 	%r2, %rh3, %rh4;

cvt.u32.u16 	%r3, %tid.x;

add.u32 	%r4, %r3, %r1;

cvt.u32.u16 	%r5, %tid.y;

add.u32 	%r6, %r5, %r2;

ld.param.u32 	%r7, [__cudaparm__Z4testPfiii_sizeX];

ld.param.u32 	%r8, [__cudaparm__Z4testPfiii_sizeY];

set.gt.u32.u32 	%r9, %r8, %r6;

neg.s32 	%r10, %r9;

set.lt.u32.u32 	%r11, %r4, %r7;

neg.s32 	%r12, %r11;

and.b32 	%r13, %r10, %r12;

mov.u32 	%r14, 0;

setp.eq.s32 	%p1, %r13, %r14;

@%p1 bra 	$Lt_11_3330;

mov.s32 	%r15, 0;

mov.f32 	%f1, 0f00000000;     	// 0

$Lt_11_4354:

// Loop body line 771, nesting depth: 1, iterations: 200

mov.s32 	%r16, 0;

$Lt_11_5122:

// Loop body line 771, nesting depth: 2, iterations: 200

.loc	15	785	0

mov.f32 	%f2, 0f3f800000;     	// 1

add.f32 	%f1, %f1, %f2;

add.s32 	%r16, %r16, 1;

mov.u32 	%r17, 200;

setp.ne.s32 	%p2, %r16, %r17;

@%p2 bra 	$Lt_11_5122;

// Part of loop body line 771, head labeled $Lt_11_4354

add.s32 	%r15, %r15, 1;

mov.u32 	%r18, 200;

setp.ne.s32 	%p3, %r15, %r18;

@%p3 bra 	$Lt_11_4354;

.loc	15	788	0

ld.param.u32 	%r19, [__cudaparm__Z4testPfiii_output];

.loc	15	771	0

ld.param.u32 	%r7, [__cudaparm__Z4testPfiii_sizeX];

.loc	15	788	0

mul.lo.u32 	%r20, %r6, %r7;

add.u32 	%r21, %r4, %r20;

mul.lo.u32 	%r22, %r21, 4;

add.u32 	%r23, %r19, %r22;

st.global.f32 	[%r23+0], %f1;

$Lt_11_3330:

.loc	15	792	0

exit;

$LDWend__Z4testPfiii:

} // _Z4testPfiii

[/codebox]

Any ideas ?

Thks

Larry

Hi,

today I started analysing my CUDA code by using the Visual Profiler. I have a rather complex kernel, with reads a lot of data, but stores only a single int value per block. For my tests I use 511511 blocks with 44 threads per block. So the number of memory transactions should somehow correlate with 511511=261121. But the visual profiler tells me, that I have gst64b=4080291064, which is even much more than the data I read from global memory. I analysed the ptx code and found only a single global-store instruction, located at the end of my code (surrounded by an if(threadIdx.x==0)). So I have absolutely no idea how to interprete the gst-value I receive from the Visual Profiler.

I even surrounded the global store with an if-clause, that will never give true. So there should actually be no global write. But the number in gst64b stays approximately the same.

Due to the complex kernel, it does not make sense to post any source code.

Please help to figure out, how to interpret the gst-values.

Thx
porst17

Any ideas ? At least a misunderstanding about gst_resquests counting ?
Thank you
Larry

Any ideas ? At least a misunderstanding about gst_resquests counting ?
Thank you
Larry