A good idea or not ? need advice

Occoured to me that if only threads of the same warp can execute at the same time then this might be used to prevent two threads updating the same shared (or global) data

So a simple case

[codebox]if ( threadIdx.x == 0 || threadIdx.x == 32 )

{

int targ = threadIdx.x%32;

sharedData[targ] = sharedData[targ] + somenumber;

}[/codebox]

Will this work or is it expanded into

[codebox]

if ( threadIdx.x == 0 || threadIdx.x == 32 )

{

int targ = threadIdx.x%32;

float reg = sharedData[targ]; // step 1

reg = reg + somenumber; // step 2

sharedData[targ] = reg; // step 3

}

[/codebox]

with the possibility that the sequence of operations could become

warp A step 1

warp A step 2

warp B step 1 // ← warp B reads from shared before warp A does step 3 so a failure

warp B step 2

warp A step 3

warp B step 3

PS am aware it relies on card/compiler behaviour not changing.

I’m trying to solve the same problem, and I concluded you can’t avoid using atomicAdd.

The hardware will most likely switch to another warp after reading sharedData, (24 cycle latency between dependent instructions), so your results won’t be atomic.

In the end, using atomicAdd on WARP_SIZE counters will still be a win compared to using atromicAdd on a single counter, due to the reduced contention.

Thanks :)

I realised there may be a problem and wondered if using only one warp to all the calculations will work.

My actual problem is a grid where every cell needs to update some of its neighbours (and itself). So all cells will be updated and several cells may update the same cell.

Updates being transfering a portion of the’current’ cells value to random neighbours.

So my plan is to have only 32 threads doing updates and for those threads to be working on cells that are 4 rows and columns apart.

e.g.

[codebox]

int ro = threadIdx.x/8; // threads 0-7 work on row0, threads 8-15 on row 1,…

int co = threadIdx.x%8; // threads 0,8,16,24 work on column 0, threads 1,9,17,25 on column 1,…

if ( threadIdx.x < 32)

{

for ( int rr=0; rr<4; rr++)

{

int row = rr + ro*4;

for ( int cc=0; cc<4; cc++)

{

  int col = cc+ co*8;

  subtract an amount from cell[row][col] and add it to the appropriate neighbour

}

}

}[/codebox]

I believe that as all threads in a warp execute the same instruction they will always be on same rr and cc

so when thread 0 is on rr=0 and cc=0, thread 1 will be on rr=0 and cc=4 and they can never clash.

All warps other than warp 0 will do nothing in this part of the code, so they will take very few execution cycles.

However all warps will take part in copying data to/from global and shared arrays.

Am I overlooking something ?

Using a single warp should work, but performance will suffer greatly due to lack of concurrency, unless you do something like software pipelining.

The CUDA manual says there’s a 24 cycle delay between dependent instructions. If you only have 1 warp and have instructions with a dependency distance < 24 cycles (more if you’re reading from global RAM), the processors will idle due to having no other work.

Software pipelining is 1 solution to increase concurrency. I’ve experimented with it and it works great, but is pretty ugly:

Suppose you want to add all #s in array:

sum = a[0];

for (i = 1; i < N; ++i)

  sum += a[i];

you pipeline (4 iterations in this example, and only for 1 thread) to increase the dependence distance:

// load0 load1 load2 load3  add0 load4  add1 load5 add2 load6  add3 load7 add4 ... addN-4   addN-3  addN-2  addN-1

float scratch[4],

	sum[4] = {0, 0, 0, 0};

// fill pipeline

scratch[0] = a[0];

scratch[1] = a[1];

scratch[2] = a[2];

scratch[3] = a[3];

for (i = 4; i < N; )

{

  // precondition: all elements < i  have been read

  // scratch is circular buffer with scratch[0] holding oldest loaded value, and scratch[3] as youngest

  sum[0] += scratch[0];

  scratch[0] = a[i++];

  sum[1] += scratch[1];

  scratch[1] = a[i++];

  sum[2] += scratch[2];

  scratch[2] = a[i++];

  sum[3] += scratch[3];

  scratch[3] = a[i++];

}

// drain pipeline

sum[0] += scratch[0];

sum[1] += scratch[1];

sum[2] += scratch[2];

sum[3] += scratch[3];

float sum = sum[0] + sum[1] + sum[2] + sum[3];

However, this is what having 512 or 1024 threads achieves in the first place. The execution schedule for software pipelining with 32 threads is deterministic, but probably is very close to the schedule when you use 1024 threads.