Hey Everyone,

Appendix B.5 (3.2 programming guide) has some literature about synchronizing across blocks - that is, having the last block done do something. I implemented the code, but I don’t think it is working properly. I decided to have the last block print out its number - the printf statement prints out about 23-24 times, but it should be printing 33 times based on the data.

Here is my code

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void U(cuDoubleComplex *a, cuDoubleComplex* b, cuDoubleComplex *c, cuDoubleComplex *newX){

	int tid = blockIdx.x * blockDim.x + threadIdx.x;

	int num = N-1;

	while(num >= 0){

		if(a[tid].py==num && a[tid].px > num){

			newX[a[tid].px] = a[tid] * c[a[tid].px];

			__threadfence(); // testing...




			__threadfence();  // code in apendix

			unsigned int value = atomicInc(&count, gridDim.x);

			isLastBlockDone = (value == (gridDim.x - 1));	




			b[num] = SUM(b,c,newX,num);



				c[num] = b[num];

				count = 0;

				printf("Block: %d\n", blockIdx.x);







I’m calling the kernel:

int nTU = 16;

int numBlocks = (int)ceil((float)ne/nTU); // ne = number of array elements

U<<<numBlocks,nTU>>>(d_a, d_b, d_c, d_newX);

The output is:

Block: 1

Block: 12

Block: 13

Block: 16

Block: 17

Block: 22

Block: 22

Block: 25

Block: 26

Block: 30

Block: 34

Block: 36

Block: 4

Block: 40

Block: 41

Block: 43

Block: 43

Block: 46

Block: 48

Block: 5

Block: 55

Block: 57

Block: 58

Block: 8

Any help would be greatly appreciated.

You’re calling your LastBlockDone test but then you’re not actually finishing the block! You’ve got that test inside the

while (num>0)

loop. So it’s not called when the block is done… the “done” counter is therefore incremented multiple times per block, not once per block, making it meaningless.

Just move your test outside the loop so it’s the last thing the block does.

Thanks for the reply (and sorry for the double post). I didn’t think having it inside of a for loop would be a problem - my algorithm takes a sparse matrix, packs it, sends it to the card, and then the GPU is supposed to step through the packed matrix (an array) row by row. Are you saying that essentially, since I have a while loop, that the other blocks will ignore the is last block, and just continue on? That’s trouble for me!

I’ve implemented your suggestion, and simply removed the while loop. Instead, I put the kernel call in host code inside of a for loop. I had an unoptimized version that did this and it worked. I’m not getting the correct result, but I feel that might be my code, and not a result of the isLastBlock code.

Thanks again for your reply.

Look at SPWorley’s reply again - it is perfectly valid and explains why you get multiple outputs from the kernel.

And after trying out his suggestion and finding that it fixes the problem you have asked about, how can you possibly write that you think it doesn’t?

Feel free to re-read my reply. I’m not saying that his suggestion didn’t work. I’m saying that I implemented it, and I’m just curious about how cuda schedules while loops. The appendix led me to believe that the islastblock code would halt the other blocks. I was wrong, which is why I posted the question.

What I was saying in my reply is that even after implementing the correct code, I’m not getting the correct output. That, however, is probably my code,not the islastblock code.

Sorry for the confusion.

Then it’s useful to look at the lastBlock example to see how it works. It’s pretty simple. It’s basically just a global counter that’s incremented once for each block as it finishes its work.

Since the increments are atomic, the block even knows if it’s the last block out the the grid to finish, and it can do special cleanup or summarization there if you like.

The interesting detail is the need for the uncommon threadfence() command which is used to make sure that blocks don’t signal their completion before all their results are flushed out of the instruction pipeline.

Since the test is just so simple (by counting completion number) you can see why the test needs to be done once per block, and why it needs to be the last thing the block does.

You may want to skip such fanciness when first designing your algorithm. The simpler and cleaner way to do a final test is to simply use a second followup kernel call, even with just one block, to do it. This makes ordering very clear and easy, and it’s not even inefficient on Fermi.

The fancy lastBlock method is just a really clever hack to pull out a bit of performance from tight kernels that need a trivial cleanup step.