Warp serialization problem: help me

Hi,

I have (still) a problem writing in the shared memory although, apparently, I should avoid bank conflicts.

I defined a shared memory array in the following way:

shared int pre_histox[16][128];

The first index depends on data (is an histogram’s bin) while the second one is thread dependent. In my understanding this should

avoid the conflicts (being 128 multiple of 16).

Let me show four pieces of code I tried to implement to try to figure out the problem, with the execution time of the kernel and the presence of warp serialization (according to the profiler):

time=5300, warp serialization

[codebox] int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

if (ndisx<16) {

  int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 

  pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

  }[/codebox]

time=1700, NO warp serialization

[codebox] int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

    ndisx=10;

if (ndisx<16) {

  int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 

  pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

  }[/codebox]

time=5000, warp serialization

[codebox] int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

//if (ndisx<16) {

  int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 

  pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

  //}[/codebox]

time=1700, NO warp serialization

[codebox] int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

    ndisx = 10;

// if (ndisx<16) {

  int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 

  pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

  //}[/codebox]

According to the results I shown above, seems that the serialization is introduced by the ndisx index when correctly computed and not by the divergent if structure.

Anyway I can’t understand whats happens: if there isn’t bank conflicts, for which reason I have warp serialization?

Thx very much for any also small suggestions.

gianluca

int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));		 

ndisx=10;	

if (ndisx<16) {	  

	int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 	  

	pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;	  

}

is equivalent to

int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));		 

ndisx=10;	

	int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 	  

	pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

what’s definition of variable Idx_rel? I think that it is related to threadIdx

thx for your answer,

yes they are the same… I’m just showing the stuff that I’m doing to point out the problem.

As you are supposing

Idx_rel = threadIdx.x-32

thank you again,

Gianluca

what’s formula of disx? depends on threadIdx!

float disx = sqrt((xhit-xcenterx)(xhit-xcenterx)+(yhit-ycenterx)(yhit-ycenterx));

where:

    float xhit = hits_s[idx].x;

    float yhit = hits_s[idx].y;

and

  float xcenterx = PMCenterPosition[Idx_centerpmx].x;

  float ycenterx = PMCenterPosition[Idx_centerpmx].y;

xhit and yhit are in the shared memory (write and read without conflicts) while xcenterx and ycenterx are in the constant.

the index idx_centerpmx depends on the threadIdx.x:

for(int ipm=0;ipm<PM_X_THREAD;ipm++) {

int Idx_centerpmx = ceilf(Idx_rel/16)PM_X_THREAD16+threadIdx.x%16+16*ipm;

}

while idx is constant in a thread (comes from a loop).

I dont want to bore you with the full kernel, but if you think that could be usefull to better understand my procedure let me know and I will post it.

thx again,

Gianluca

I combine your code segments as

float xcenterx = PMCenterPosition[Idx_centerpmx].x;

float ycenterx = PMCenterPosition[Idx_centerpmx].y;

float xhit = hits_s[idx].x;

float yhit = hits_s[idx].y;

float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));

//time=5300, warp serialization

int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS)); 	

if (ndisx<16) {	  

	int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 	  

	pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;	  

}

I think that the warp-serialization comes from value of ndisk, just you say

“serialization is introduced by the ndisx index when correctly computed and not by the divergent if structure.”

but why?

Suppose we consider first half-warp, say

threadIdx.x = 0, 1, 2, …, 15

and only three among them satisfy condition (ndisx<16), let us assume

threadIdx.x = 1, 7, 12 satisfy condition (ndisx<16).

Then I wonder how could hardware fetch shared memory “pre_histox[ndisx][Idx_rel]”,

in one transaction or in three transactions?

WE hope that one transaction for pre_histox[ndisx][Idx_rel] is done and do masking

such that only threadIdx.x = 1, 7, 12 obtain the value,

but I does not make sure that hardware will do this.

I think you can write a simple program to test it.

for example, only first 8 threads of a half-warp satisfies the if-condition and see what happens.

I will try your suggestion as soon as possible;
but, as you can see in my first post, if I comment the if structure (hoping that ndisx<16) the result improuve without warp serialization: only for this reason I’m assuming that the divergency isn’t introuduced by the branching.
I’m wrong?
thx,
g.

Ok, I know your point, from

time=5000, warp serialization

int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS)); 	

//if (ndisx<16) {	 

	 int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 	  

	 pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;	  

//}

It seems that warp serialization is independent of branch.

However you define

Idx_rel = threadIdx.x-32;

if threadIdx.x < 32, then Idx_rel < 0, you will have problem when accessing pre_histox[0][Idx_rel].

How do you avoid Idx_rel < 0 ?

If you don’t avoid Idx_rel < 0, then I suggest that you should try

int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));		 

ndisx=0;	

if (ndisx<16) {	  

	int tmp_pre_histox = pre_histox[ndisx][Idx_rel]; 	  

	pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;	  

}

and check if warp-serialization occurs.

Actually I avoid Idx_rel asking for threadIdx>32 to enter in this piece of code, with something like:

if (threadIdx<32) then

(shared memory inizialization)

end if

__syncthreads;

if (threadIdx>=32) then

Idx_rel=threadIdx.x-32;

(something with the code above)

endif

__syncthreads;

I think that the synchtread should be avoid thanks to the warp structure of my resources assignment (It’s true?).

Isn’t clear to me the code you are suggesting:

for which reason do you think that the result will change, using ndisx=0 with respect to

ndisx=10 that I have already tried (obtaining no warp serialization).

thanks a lot for your interest in my problem,

g.

I write a simple program to test your question

// no warp serialization

static __global__ void  fotone( int *A, int* B, int *C )

{

	__shared__ int pre_histox[16][128];

	

#pragma unroll

	for( int j = 0; j < 16; j++){

		pre_histox[j][ threadIdx.x ] = A[ 128 * j + threadIdx.x ]; 

	}

	__syncthreads();

	 

	int ndisx = B[threadIdx.x];

	int tmp_pre_histox = pre_histox[ndisx][threadIdx.x]; 	  

	pre_histox[ndisx][threadIdx.x] = tmp_pre_histox + 1;

	 

#pragma unroll

	for( int j = 0; j < 16; j++){

		C[ 128 * j + threadIdx.x ] = pre_histox[j][ threadIdx.x ]; 

	}	

}	

void fotone_device( int *A, int* B, int *C )

{

	dim3 threads( 128,1,1 );

	dim3 grid( 30, 1);

	fotone<<<grid, threads>>>( A, B, C );		

}

and driver

void profile_fotone( void )

{

	int *h_A, *h_B, *h_C;

	int num_bin = 16;

	int num_thread = 128;

	int size_A = num_bin * num_thread * sizeof( int );

	int size_B = size_A;

	int size_C = size_A;

	h_A = (int*) malloc( size_A ); assert(h_A);

	h_B = (int*) malloc( size_B ); assert(h_B);

	h_C = (int*) malloc( size_C ); assert(h_C);

	for( int i = 0; i < num_bin * num_thread; i++){

		h_A[i] = rand();

		h_B[i] = rand() % num_bin;

	}

	int *d_A, *d_B, *d_C;

	CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, size_A));

	CUDA_SAFE_CALL(cudaMalloc((void**) &d_B, size_B));	

	CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, size_C));	

	CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, size_A, cudaMemcpyHostToDevice) );

	CUDA_SAFE_CALL(cudaMemcpy(d_B, h_B, size_B, cudaMemcpyHostToDevice) );

	fotone_device( d_A, d_B, d_C );

	cutilSafeCall( cudaMemcpy( h_C, d_C, size_C, cudaMemcpyDeviceToHost) );

	free( h_A ); free( h_B ); free( h_C );

  cutilSafeCall(cudaFree(d_A));

  cutilSafeCall(cudaFree(d_B));

	cutilSafeCall(cudaFree(d_C));

}

platform: Tesla C1060, driver 109.38, winxp64, cuda 2.3

no warp serialization when using visual profiler.

of course if you try

// warp serialization = 52

static __global__ void  fotone( int *A, int* B, int *C )

{

	__shared__ int pre_histox[16][128];

	

#pragma unroll

	for( int j = 0; j < 16; j++){

		pre_histox[j][ threadIdx.x ] = A[ 128 * j + threadIdx.x ]; 

	}

	__syncthreads();

	 

	int ndisx = B[threadIdx.x];

	int tmp_pre_histox = pre_histox[threadIdx.x % 16][ndisx];

	pre_histox[threadIdx.x % 16][ndisx] = tmp_pre_histox + 1;

	 

#pragma unroll

	for( int j = 0; j < 16; j++){

		C[ 128 * j + threadIdx.x ] = pre_histox[j][ threadIdx.x ]; 

	}	

}

Then it has 52 warp serialization.

I think that value of ndisx shhould not relate to warp serialization.

Thanks a lot for your kindly interest!

I tried your code obtaining the same results like you: the first kernel without serialization and the second one (that is wrong) with serialization. So I agree with your conclusion about ndisx;

I also tried to put

if (ndisx>5)

and I obtained no serialization (I have some branch divergency, but not warp serialization, and the total time increase of just 1 us)

I’m a little bit puzzled a this stage: let me show my kernel code, if you could find some time to have a look could be a very big help for me:

[codebox]global

void ring_find_kernel(struct PMPosition *hits_d, struct _Resuln candy_d, int nn_evt_d)

{

shared int histos[1024];

shared int hh_step2[32];

shared int pre_histox[16][128];

shared PMPosition hits_s[32];

int Idx_evt = blockIdx.x;

int nnevt = nn_evt_d[Idx_evt];

// shared memory inizialization: the first warp is used to load the hist position

// from global to shared memory and to initialize the histos vector in shared memory

if( threadIdx.x < 32) {

if(threadIdx.x < nnevt) {

  hits_s[threadIdx.x].x = hits_d[Idx_evt*32+threadIdx.x].x;

  hits_s[threadIdx.x].y = hits_d[Idx_evt*32+threadIdx.x].y;

}

for (int i = 0; i < CALC_THREAD*PM_X_THREAD; i+=32)

{

  if (i + threadIdx.x <  CALC_THREAD*PM_X_THREAD )

    histos[i + threadIdx.x] = 0;

}

}

__syncthreads();

// a total of 128 (CALC_THREAD) threads are used to make calculation.

// each thread loops on 8 (PM_X_THREAD) points (defined by different Idx_centerpm)

// the thread result is stored in the histos vector

if(threadIdx.x >= 32 && threadIdx.x < CALC_THREAD+32) {

int Idx_rel = (threadIdx.x - 32);

//#pragma unroll

for(int ipm=0;ipm<PM_X_THREAD;ipm++) {

  int Idx_centerpmx = ceilf(Idx_rel/16)*PM_X_THREAD*16+threadIdx.x%16+16*ipm;

float xcenterx = PMCenterPosition[Idx_centerpmx].x;

  float ycenterx = PMCenterPosition[Idx_centerpmx].y;

#pragma unroll

  for (int i=0;i<NUM_BINS;i++) {

    pre_histox[i][Idx_rel]=0;

  }

#pragma unroll

  for (int idx=0;idx<nnevt;idx++) {

    float xhit = hits_s[idx].x;

    float yhit = hits_s[idx].y;

    float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));

int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

// ndisx=10;

    if (ndisx<16) {

      int tmp_pre_histox = pre_histox[ndisx][Idx_rel];

      pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

      }

  }

int maxhx = 0;

int maxnx = 0;

#pragma unroll

for (int i=0;i<16;i++) {

  if (pre_histox[i][Idx_rel]>maxhx) {

    maxhx = pre_histox[i][Idx_rel];

    maxnx = i;

  }

}

int dispx = ceilf(Idx_rel/16)PM_X_THREAD16+threadIdx.x%16+ipm*16;

histos[dispx] = ((maxnx<<16) & 0XFFFF0000) | (maxhx & 0X0000FFFF);

}

}

__syncthreads();

// a warp + 1 thread are used to decide the “best” histogram

if(threadIdx.x >= CALC_THREAD+32 && threadIdx.x < CALC_THREAD+32+32) {

 int idx_sm = threadIdx.x-(CALC_THREAD+32);

 int num_banks = (CALC_THREAD*PM_X_THREAD)/32;

 int hh_cont_max = 0;

 int hh_addr_max = 0;

#pragma unroll

 for(int i=0;i< num_banks; i++) {

   int ii = idx_sm+i*32;

   int hh_tmp = histos[ii];

   int hh_cont = hh_tmp & 0x0000FFFF;

   if (hh_cont > hh_cont_max) {

     hh_cont_max = hh_cont;

     hh_addr_max = ii;

   }

 }

 hh_step2[idx_sm] = hh_addr_max;

}

__syncthreads();

if(threadIdx.x == CALC_THREAD+32+32) {

 int hh_addr_max3 = 0;

 int hh_cont_max3 = 0;

#pragma unroll

 for (int i=0;i<32;i++) {

   int idx_step3 = hh_step2[i];

   int hh_tmp3 = histos[idx_step3];

   int hh_cont3 = hh_tmp3 & 0x0000FFFF;

   if (hh_cont3 > hh_cont_max3) {

   }

 }

 int hh_candy = histos[hh_addr_max3];

 int candy_bin = (hh_candy & 0xFFFF0000)>>16;

 int candy_center = hh_addr_max3;

 (candy_d+blockIdx.x)->nb_max =  hh_candy & 0x0000FFFF;

 (candy_d+blockIdx.x)->nb_center =  candy_center;

 (candy_d+blockIdx.x)->nb_radius =  candy_bin;

}

__syncthreads();

return;

}

[/codebox]

the general purpose of this code is to decide if 32 hits in a matrix of 1000 points are on a circle or not.

thank you again,

g.

I have several questions

  1. I guess that your configuration should be
#define  CALC_THREAD  128

#define  PM_X_THREAD  8

#define  NUM_BINS	 16

struct PMPosition

{

	float x;

	float y;

}

struct _Resuln

{

	int hh_candy;

	int candy_center;

	int candy_bin;

}
  1. where do you define array PMCenterPosition ?
for(int ipm=0;ipm<PM_X_THREAD;ipm++) {

	  int Idx_centerpmx = ceilf(Idx_rel/16)*PM_X_THREAD*16 + threadIdx.x%16 + 16*ipm;

// ??? where do you define array PMCenterPosition (I think its type is struct PMPosition) ?

	  float xcenterx = PMCenterPosition[Idx_centerpmx].x;

	  float ycenterx = PMCenterPosition[Idx_centerpmx].y;

	  

	  #pragma unroll

	  for (int i=0;i < NUM_BINS;i++) {

		pre_histox[i][Idx_rel]=0;

	  }

	  .....
  1. 2-way bank conflict occurs when writing to hit_s.x and hit_s.y
if( threadIdx.x < 32) {

	if(threadIdx.x < nnevt) {

//[comment]

//this should introduce 2-way bank conflict due to interleave property of component x and y.	

//[/comment]	

	  hits_s[threadIdx.x].x = hits_d[Idx_evt*32+threadIdx.x].x;

	  hits_s[threadIdx.x].y = hits_d[Idx_evt*32+threadIdx.x].y;

	}

	

		for (int i = 0; i < (CALC_THREAD*PM_X_THREAD - threadIdx.x) ; i+=32){

		histos[i + threadIdx.x] = 0;

	}

  }// if (threadIdx.x < 32)
  1. 2-way bank conflict occurs when reading hit_s.x and hit_s.y
#pragma unroll

	  for (int idx=0;idx<nnevt;idx++) {

//[comment]

//this should introduce 2-way bank conflict due to interleave property of component x and y.	

//[/comment]	   

		float xhit = hits_s[idx].x;

		float yhit = hits_s[idx].y;

		float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));

		int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

		if (ndisx<16) {

		  int tmp_pre_histox = pre_histox[ndisx][Idx_rel];

		  pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

		}

	  }// for idx
  1. what is content inside if-clause ?
if(threadIdx.x == CALC_THREAD+32+32) {

	 int hh_addr_max3 = 0;

	 int hh_cont_max3 = 0;

	 #pragma unroll

	 for (int i=0;i<32;i++) {

	   int idx_step3 = hh_step2[i];

	   int hh_tmp3 = histos[idx_step3];

	   int hh_cont3 = hh_tmp3 & 0x0000FFFF;

	   if (hh_cont3 > hh_cont_max3) {

// ??? what is content inside if-clause, should relate to  hh_addr_max3			 

	   }

	 }// for i

	 ....
  1. you don’t need “__syncthreads()” before “return”
// this synchronization is no need   

//   __syncthreads(); 

	return;

I apologize but the code I sent you is plenty of “work in progress” then some of your question can be answered with this reason:

in particular the n.6 was an attempt and the n.5, that is (in the if-clause):

hh_cont_max3 = hh_cont3;

hh_addr_max3 = idx_step3;

was just commented in the source code I used to post (and for this reason I skipped the copy).

The struct type definition is exactly the one you are immaging (question 1.) and the PMCenterPosition is defined in the constant memory (question 2), with

device constant PMpos PMCenterPosition[1024];

The data are loaded from a file and copied in the constant memory with cudaMemcpyToSymbol.

Concerning the questions 3 and 4, I think that there isn’t conflicts writing (I’m wrong?) because Idx_evt is blockIdx.x (a constant in the thread block) and the threadIdx.x index should guarantee a strided access.

In reading hit_s.x and hit_s.y you are right about the conflicts, but I’m assuming that the shared memory read is broadcasted in this case (several thread could access the same location at the same time).

In any case, triggered by your questions, I tried to fix hit_s.x and hit_s.y to a given value: nothing change, in the sense that isn’t the source of the warp serialization.

thx again,

G.

sorry, reading is conflict-free since broadcast

#pragma unroll

	  for (int idx=0;idx<nnevt;idx++) {

//[comment]

//this should introduce 2-way bank conflict due to interleave property of component x and y.	

//[/comment]	   

		float xhit = hits_s[idx].x;

		float yhit = hits_s[idx].y;

		float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));

		int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

		if (ndisx<16) {

		  int tmp_pre_histox = pre_histox[ndisx][Idx_rel];

		  pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

		}

	  }// for idx

but writing is indeed 2-way bank conflict

if( threadIdx.x < 32) {

	if(threadIdx.x < nnevt) {

//[comment]

//this should introduce 2-way bank conflict due to interleave property of component x and y.	

//[/comment]	

	  hits_s[threadIdx.x].x = hits_d[Idx_evt*32+threadIdx.x].x;

	  hits_s[threadIdx.x].y = hits_d[Idx_evt*32+threadIdx.x].y;

	}

	

		for (int i = 0; i < (CALC_THREAD*PM_X_THREAD - threadIdx.x); i+=32){

		histos[i + threadIdx.x] = 0;

	}

  }// if (threadIdx.x < 32)

I modify your program to show this

(I am sorry that I don’t know detail, so I comment some code and change some parameter)

(1) kernel

__global__  void ring_find_kernel(struct PMPosition *hits_d, struct _Resuln *candy_d, int* nn_evt_d)

{

  __shared__ int histos[1024];

  __shared__ int hh_step2[32];

  __shared__ int pre_histox[16][128];

  __shared__ PMPosition hits_s[32];

	int Idx_evt = blockIdx.x;

	int nnevt = 1;  

// shared memory inizialization: the first warp is used to load the hist position

// from global to shared memory and to initialize the histos vector in shared memory

  if( threadIdx.x < 32) {

//[comment]

//this should introduce 2-way bank conflict due to interleave property of component x and y.	

//[/comment]	

	 hits_s[threadIdx.x].x = hits_d[Idx_evt*32+threadIdx.x].x;

	 hits_s[threadIdx.x].y = hits_d[Idx_evt*32+threadIdx.x].y;

	

		for (int i = 0; i < (CALC_THREAD*PM_X_THREAD - threadIdx.x) ; i+=32){

		histos[i + threadIdx.x] = 0;

	}

  }// if (threadIdx.x < 32)

__syncthreads();

// a total of 128 (CALC_THREAD) threads are used to make calculation.

// each thread loops on 8 (PM_X_THREAD) points (defined by different Idx_centerpm)

// the thread result is stored in the histos vector

  if( threadIdx.x >= 32 && threadIdx.x < CALC_THREAD+32) {

	int Idx_rel = (threadIdx.x - 32);

	

	for(int ipm=0;ipm<PM_X_THREAD;ipm++) {

	  int Idx_centerpmx = ceilf(Idx_rel/16)*PM_X_THREAD*16 + threadIdx.x%16 + 16*ipm;

	  float xcenterx = 0;

	  float ycenterx = 0;

	  #pragma unroll

	  for (int i=0;i < NUM_BINS;i++) {

		pre_histox[i][Idx_rel]=0;

	  }

	  for (int idx=0;idx<nnevt;idx++) {   

		float xhit = hits_s[idx].x;

		float yhit = hits_s[idx].y;

		float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));

		int ndisx = (100 - 1)/16; //(H_MAX-H_MIN) / NUM_BINS;

		ndisx = ceilf( disx/( (float)ndisx ) );

		if (ndisx<16) {

		  int tmp_pre_histox = pre_histox[ndisx][Idx_rel];

		  pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

		}

	  }// for idx

		int maxhx = 0;

		int maxnx = 0;

		#pragma unroll

		for (int i=0;i<16;i++) {

		  if (pre_histox[i][Idx_rel]>maxhx) {

				maxhx = pre_histox[i][Idx_rel];

				maxnx = i;

		  }// endif

		}// for i

		int dispx = ceilf(Idx_rel/16)*PM_X_THREAD*16+threadIdx.x%16+ipm*16;

		histos[dispx] = ((maxnx<<16) & 0XFFFF0000) | (maxhx & 0X0000FFFF);

	}// for ipm

  }// if threadIdx

__syncthreads();

// a warp + 1 thread are used to decide the "best" histogram

   if(threadIdx.x >= CALC_THREAD+32 && threadIdx.x < CALC_THREAD+32+32) {

	 int idx_sm = threadIdx.x-(CALC_THREAD+32);

	 int num_banks = (CALC_THREAD*PM_X_THREAD)/32;

	 int hh_cont_max = 0;

	 int hh_addr_max = 0;

	 #pragma unroll

	 for(int i=0;i< num_banks; i++) {

	   int ii = idx_sm+i*32;

	   int hh_tmp = histos[ii];

	   int hh_cont = hh_tmp & 0x0000FFFF;

	   if (hh_cont > hh_cont_max) {

		 hh_cont_max = hh_cont;

		 hh_addr_max = ii;

	   }

	 }

	 hh_step2[idx_sm] = hh_addr_max;

   }

   __syncthreads();

   if(threadIdx.x == CALC_THREAD+32+32) {

	 int hh_addr_max3 = 0;

	 int hh_candy = histos[hh_addr_max3];

	 int candy_bin = (hh_candy & 0xFFFF0000)>>16;

	 int candy_center = hh_addr_max3;

	 (candy_d+blockIdx.x)->nb_max =  hh_candy & 0x0000FFFF;

	 (candy_d+blockIdx.x)->nb_center =  candy_center;

	 (candy_d+blockIdx.x)->nb_radius =  candy_bin;

   }

	return;

}

(2) C-wrapper

#define  CALC_THREAD  128

#define  PM_X_THREAD  8

#define  NUM_BINS	 16

struct PMPosition

{

	float x;

	float y;

};

struct _Resuln

{

	int nb_max;

	int nb_center;

	int nb_radius;

};

void ring_find_kernel_device(struct PMPosition *hits_d, struct _Resuln *candy_d, int* nn_evt_d) 

{

	dim3 threads( 128,1,1 );

	dim3 grid( 30, 1);

	ring_find_kernel<<<grid, threads>>>( hits_d, candy_d, nn_evt_d );		

}

(3) driver

void profile_ring_find_kernel( void )

{

	printf("#####  profile_ring_find_kernel \n");

	cudaDeviceProp deviceProp;

	int device = 2;

	cudaGetDeviceProperties(&deviceProp, device);

	cudaSetDevice( device );

	printf("use device %d, name = %s\n", device, deviceProp.name );

	struct PMPosition *hits_h;

	struct _Resuln *candy_h;

	int* nn_evt_h;

	

	int num_bin = 16;

	int num_thread = 128;

	int size_A = num_bin * num_thread * sizeof( struct PMPosition );

	int size_B = num_bin * num_thread * sizeof( struct _Resuln );

	int size_C = num_bin * num_thread * sizeof( int );

	hits_h = (struct PMPosition *) malloc( size_A );

	assert(hits_h);

	candy_h = (struct _Resuln *) malloc( size_B );

	assert(candy_h);

	nn_evt_h = (int*) malloc( size_C );

	assert(nn_evt_h);

	for( int i = 0; i < num_bin * num_thread; i++){

		nn_evt_h[i] = rand() % num_bin;

		hits_h[i].x = rand()/10000.0;

		hits_h[i].y = rand()/10000.0;

	}

	struct PMPosition *hits_d;

	struct _Resuln *candy_d;

	int* nn_evt_d;

	CUDA_SAFE_CALL(cudaMalloc((void**) &hits_d, size_A));

  CUDA_SAFE_CALL(cudaMalloc((void**) &candy_d, size_B));

  CUDA_SAFE_CALL(cudaMalloc((void**) &nn_evt_d, size_C));	

	CUDA_SAFE_CALL(cudaMemcpy(hits_d, hits_h, size_A, cudaMemcpyHostToDevice) );

	CUDA_SAFE_CALL(cudaMemcpy(nn_evt_d, nn_evt_h, size_C, cudaMemcpyHostToDevice) );

	ring_find_kernel_device( hits_d, candy_d, nn_evt_d);

	cutilSafeCall( cudaMemcpy( candy_h, candy_d, size_B, cudaMemcpyDeviceToHost) );

	free( hits_h );

	free( candy_h );

	free( nn_evt_h );

	cutilSafeCall(cudaFree(hits_d));

	cutilSafeCall(cudaFree(candy_d));

	cutilSafeCall(cudaFree(nn_evt_d));

}

platform: Tesla C1060

Case 1: keep writing

hits_s[threadIdx.x].x = hits_d[Idx_evt*32+threadIdx.x].x;

	 hits_s[threadIdx.x].y = hits_d[Idx_evt*32+threadIdx.x].y;

warp serialization = 16

case 2: comment component y

hits_s[threadIdx.x].x = hits_d[Idx_evt*32+threadIdx.x].x;

//	 hits_s[threadIdx.x].y = hits_d[Idx_evt*32+threadIdx.x].y;

warp serialization = 8

case 3: disable writing

//	 hits_s[threadIdx.x].x = hits_d[Idx_evt*32+threadIdx.x].x;

//	 hits_s[threadIdx.x].y = hits_d[Idx_evt*32+threadIdx.x].y;

warp serialization = 0

I apologize for the late answer, but I was in holiday for christmass! :santa:

I tried to run your kernel (with your wrapper) with the same result.
Neverthless in my original code comment or uncomment the same row dont give the same effect:
this means that I have (at least) two different sources of conflicts.
Going in deep in your code with respect mine (thank you very much for this help) I found where the problem is:
You wrote:
float xcenterx = 0;
float ycenterx = 0;
because you have not the way to fill the PMCenterPosition structure, while in my code there is:
float xcenterx = PMCenterPosition[Idx_centerpmx].x;
float ycenterx = PMCenterPosition[Idx_centerpmx].y;
where the structure is in the constant memory!!!
This is the big part (there is also the writing in the shared memory you pointed out, but for some reason in a smaller source) of the warp serialization.
My Idx_centerpmx is defined as
int Idx_centerpmx = ceilf(Idx_rel/16)PM_X_THREAD16+threadIdx.x%16+16*ipm;
with a structure suitable to avoid conflicts in the shared memory, but without any particular care for the constant one.

Do you have some suggestion about in which way I have to read the constant memory to avoid serialization???

Your help is priceless!

Thx,

Gianluca

I find out the answer: the warp serialization can’t be avoid in the constant memory if the threads in an half-warp have access to different address!

I moved the array in the global memory and finally I have no warp serialization! ( a part a small numer due to the hit_s array filling procedure)

Now I have no serialization but I have still doubts;
In the part of the code where I make the increment for the histogram

     if (ndisx<16) {
      int tmp_pre_histox = pre_histox[ndisx][Idx_rel];
        pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;
      }

I have a factor of 4 in better performance (from 4 ms overall to 1 ms) if I comment the “write” in the shared memory and very small increment if I comment the “read”.

Is’t the write procedure so time consuming with respect to the read ?

Do you have any idea?

From section 5.1.2.3 of programming guide, it says

" For all threads of a half-warp, reading from the constant cache is as fast as reading

from a register as long as all threads read the same address. The cost scales linearly

with the number of different addresses read by all threads."

I write a simple kernel to test “warp serialization” of constant memory

__constant__ float PMCenterPosition[16][16] = 

{

	1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f, 10.0f, 11.0f, 12.0f, 13.0f, 14.0f, 15.0f, 16.0f,

	2.8f, 2.7f, 3.6f, 4.5f, 5.4f, 6.3f, 7.2f, 8.1f, 9.2f, 10.3f, 11.4f, 12.5f, 13.6f, 14.7f, 15.8f, 16.1f, 

	3.0f, 2.0f, 3.0f, 4.1f, 5.0f, 9.0f, 17.0f, 8.0f, 9.1f, 11.0f, 11.0f, 12.0f, 14.0f, 15.0f, 25.0f, 16.0f,

	4.0f, 3.0f, 4.0f, 5.0f, 6.0f, 6.1f, 8.0f, 8.2f, 9.5f, 10.2f, 11.2f, 12.2f, 13.2f, 14.3f, 15.01f, 16.6f,

	

	1.7f, 2.0f, 3.1f, 4.1f, 5.1f, 6.3f, 7.5f, 8.7f, 9.6f, 41.0f, 11.0f, 12.1f, 23.0f, 14.0f, 11.0f, 16.0f,

	2.9f, 2.1f, 3.6f, 4.2f, 5.2f, 6.2f, 7.1f, 8.1f, 9.2f, 10.3f, 21.4f, 22.5f, 13.6f, 15.7f, 19.8f, 16.0f,

	3.6f, 2.2f, 3.3f, 4.3f, 5.3f, 9.1f, 1.6f, 8.7f, 9.1f, 51.1f, 81.0f, 12.0f, 14.1f, 15.0f, 25.0f, 16.0f,

	4.5f, 3.3f, 4.4f, 5.4f, 6.4f, 6.4f, 8.7f, 8.2f, 9.4f, 10.2f, 91.2f, 13.2f, 13.2f, 14.3f, 17.01f, 16.7f,

	

	3.0f, 2.0f, 3.0f, 4.1f, 5.0f, 9.0f, 17.0f, 8.0f, 9.1f, 11.0f, 11.0f, 12.0f, 14.0f, 15.0f, 25.0f, 16.0f,

	4.0f, 3.0f, 4.0f, 5.0f, 6.0f, 6.1f, 8.0f, 8.2f, 9.5f, 10.2f, 11.2f, 12.2f, 13.2f, 14.3f, 15.01f, 16.5f,

	2.9f, 2.1f, 3.6f, 4.2f, 5.2f, 6.2f, 7.1f, 8.1f, 9.2f, 10.3f, 21.4f, 22.5f, 13.6f, 15.7f, 19.8f, 16.0f,

	3.6f, 2.2f, 3.3f, 4.3f, 5.3f, 9.1f, 1.6f, 8.7f, 9.1f, 51.1f, 81.0f, 12.0f, 14.1f, 15.0f, 25.0f, 16.0f,

	

	2.8f, 2.7f, 3.6f, 4.5f, 5.4f, 6.3f, 7.2f, 8.1f, 9.2f, 10.3f, 11.4f, 12.5f, 13.6f, 14.7f, 15.8f,	 16.0f,

	4.5f, 3.3f, 4.4f, 5.4f, 6.4f, 6.4f, 8.7f, 8.2f, 9.4f, 10.2f, 91.2f, 13.2f, 13.2f, 14.3f, 17.01f, 16.1f,		

	3.8f, 2.7f, 5.6f, 4.5f, 5.4f, 6.3f, 9.2f, 8.1f, 9.2f, 20.3f, 11.4f, 12.5f, 13.6f, 14.7f, 15.8f, 16.0f,	

	4.5f, 3.3f, 4.4f, 5.4f, 6.4f, 6.4f, 8.7f, 8.2f, 9.4f, 10.2f, 91.2f, 13.2f, 13.2f, 14.3f, 17.01f,  16.3f		

};

// case 1: warp serialization = 589

static __global__ void  fotone( float *A, float* B, float *C )

{

	int k1 = threadIdx.x/16;

	int k2 = threadIdx.x % 16;

	

	int index = blockIdx.x * 128 + threadIdx.x;

	C[index] = PMCenterPosition[k1][k2];

}

void cmem_device( float *A, float* B, float *C )

{

	dim3 threads( 128,1,1 );

	dim3 grid( 30, 1);

	fotone<<<grid, threads>>>( A, B, C );		

}

and driver

void profile_cmem( void )

{

	float *h_A, *h_B, *h_C;

	int num_bin = 30;

	int num_thread = 128;

	int size_A = num_bin * num_thread * sizeof( float );

	int size_B = size_A;

	int size_C = size_A;

	h_A = (float*) malloc( size_A ); assert(h_A);

	h_B = (float*) malloc( size_B ); assert(h_B);

	h_C = (float*) malloc( size_C ); assert(h_C);

	for( int i = 0; i < num_bin * num_thread; i++){

		h_A[i] = rand();

		h_B[i] = rand() % num_bin;

	}

	

	float *d_A, *d_B, *d_C;

	CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, size_A));

  CUDA_SAFE_CALL(cudaMalloc((void**) &d_B, size_B));

  CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, size_C));	

	CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, size_A, cudaMemcpyHostToDevice) );

	CUDA_SAFE_CALL(cudaMemcpy(d_B, h_B, size_B, cudaMemcpyHostToDevice) );

	cmem_device( d_A, d_B, d_C );

	cutilSafeCall( cudaMemcpy( h_C, d_C, size_C, cudaMemcpyDeviceToHost) );

	free( h_A ); free( h_B );free( h_C );

	cutilSafeCall(cudaFree(d_A));

	cutilSafeCall(cudaFree(d_B));

	cutilSafeCall(cudaFree(d_C));

}

kernel of “case 1” simulates your kernel

int Idx_centerpmx = ceilf(Idx_rel/16)*PM_X_THREAD*16+threadIdx.x%16+16*ipm;

However it does not satisfy “all threads of a half-warp read the same address”,

it has “warp serialization = 589”.

similarly, if we use

// case 2: warp serialization = 1566

static __global__ void  fotone( float *A, float* B, float *C )

{

	int k2 = threadIdx.x % 16;

	

	int index = blockIdx.x * 128 + threadIdx.x;

	C[index] = PMCenterPosition[k2][0];

}

then it has “warp serialization = 1566” since each thread of a half-warp access different address.

However if we use

// case 3: warp serialization = 0

static __global__ void  fotone( float *A, float* B, float *C )

{

	int k2 = threadIdx.x / 16;

	

	int index = blockIdx.x * 128 + threadIdx.x;

	C[index] = PMCenterPosition[k2][0];

}

then it has 0 warp-serialization since all threads of a half-warp access the same address.

If you comment “write” in the shared memory, say

#pragma unroll	  

for (int idx=0;idx<nnevt;idx++) {		

	float xhit = hits_s[idx].x;		

	float yhit = hits_s[idx].y;		

	float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));		

	int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

	if (ndisx<16) {		  

		int tmp_pre_histox = pre_histox[ndisx][Idx_rel];		  

//		pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;		  

	}	 

}

then compiler would delete all for-loop since no output. That is why you have big improvement.

However if you comment “read” in the shared memory, say

#pragma unroll	  

for (int idx=0;idx<nnevt;idx++) {		

	float xhit = hits_s[idx].x;		

	float yhit = hits_s[idx].y;		

	float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));		

	int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

	if (ndisx<16) {		  

//		int tmp_pre_histox = pre_histox[ndisx][Idx_rel];		  

		pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;		  

	}	 

}

then compiler would remove only “read”, so you have slightly improvement.

Your interpretation was quite convincing to me! But I tried with this code, just to be sure:

int aa;

#pragma unroll	  

for (int idx=0;idx<nnevt;idx++) {		

	float xhit = hits_s[idx].x;		

	float yhit = hits_s[idx].y;		

	float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));		

	int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

	if (ndisx<16) {		  

		int tmp_pre_histox = pre_histox[ndisx][Idx_rel];		  

//		pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;		

				aa = tmp_pre_histox + 1;  

	}	 

}

so the loop has an output (aa) but the shared memory write is commented. In this case I have still the big improvment!

How I can explain this?

thx,

Gianluca

“aa” is just a local variable, if you don’t write its value to output parameter, then it is useless,

just try

nn_evt_d[threadIdx.x] = aa;

after for-loop.