isLastBlockDone

fender177 · March 13, 2011, 3:48am

Hey Everyone,

Appendix B.5 (3.2 programming guide) has some literature about synchronizing across blocks - that is, having the last block done do something. I implemented the code, but I don’t think it is working properly. I decided to have the last block print out its number - the printf statement prints out about 23-24 times, but it should be printing 33 times based on the data.

Here is my code

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void U(cuDoubleComplex *a, cuDoubleComplex* b, cuDoubleComplex *c, cuDoubleComplex *newX){

	int tid = blockIdx.x * blockDim.x + threadIdx.x;

	int num = N-1;

	while(num >= 0){

		if(a[tid].py==num && a[tid].px > num){

			newX[a[tid].px] = a[tid] * c[a[tid].px];

			__threadfence(); // testing...

		}

		__syncthreads();

		if(threadIdx.x==0){

			__threadfence();  // code in apendix

			unsigned int value = atomicInc(&count, gridDim.x);

			isLastBlockDone = (value == (gridDim.x - 1));	

		}

		__syncthreads();

		if(isLastBlockDone){

			b[num] = SUM(b,c,newX,num);

				

			if(threadIdx.x==0){

				c[num] = b[num];

				count = 0;

				printf("Block: %d\n", blockIdx.x);

			}

		}

		__syncthreads();

		num--;

	}

}

I’m calling the kernel:

int nTU = 16;

int numBlocks = (int)ceil((float)ne/nTU); // ne = number of array elements

U<<<numBlocks,nTU>>>(d_a, d_b, d_c, d_newX);

The output is:

Block: 1

Block: 12

Block: 13

Block: 16

Block: 17

Block: 22

Block: 22

Block: 25

Block: 26

Block: 30

Block: 34

Block: 36

Block: 4

Block: 40

Block: 41

Block: 43

Block: 43

Block: 46

Block: 48

Block: 5

Block: 55

Block: 57

Block: 58

Block: 8

Any help would be greatly appreciated.

SPWorley · March 13, 2011, 4:11am

You’re calling your LastBlockDone test but then you’re not actually finishing the block! You’ve got that test inside the

while (num>0)

loop. So it’s not called when the block is done… the “done” counter is therefore incremented multiple times per block, not once per block, making it meaningless.

Just move your test outside the loop so it’s the last thing the block does.

fender177 · March 13, 2011, 4:32pm

You’re calling your LastBlockDone test but then you’re not actually finishing the block! You’ve got that test inside the
while (num>0)
loop. So it’s not called when the block is done… the “done” counter is therefore incremented multiple times per block, not once per block, making it meaningless.

Just move your test outside the loop so it’s the last thing the block does.

Thanks for the reply (and sorry for the double post). I didn’t think having it inside of a for loop would be a problem - my algorithm takes a sparse matrix, packs it, sends it to the card, and then the GPU is supposed to step through the packed matrix (an array) row by row. Are you saying that essentially, since I have a while loop, that the other blocks will ignore the is last block, and just continue on? That’s trouble for me!

I’ve implemented your suggestion, and simply removed the while loop. Instead, I put the kernel call in host code inside of a for loop. I had an unoptimized version that did this and it worked. I’m not getting the correct result, but I feel that might be my code, and not a result of the isLastBlock code.

Thanks again for your reply.

tera · March 14, 2011, 1:22am

Look at SPWorley’s reply again - it is perfectly valid and explains why you get multiple outputs from the kernel.

And after trying out his suggestion and finding that it fixes the problem you have asked about, how can you possibly write that you think it doesn’t?

fender177 · March 14, 2011, 3:44am

Feel free to re-read my reply. I’m not saying that his suggestion didn’t work. I’m saying that I implemented it, and I’m just curious about how cuda schedules while loops. The appendix led me to believe that the islastblock code would halt the other blocks. I was wrong, which is why I posted the question.

What I was saying in my reply is that even after implementing the correct code, I’m not getting the correct output. That, however, is probably my code,not the islastblock code.

Sorry for the confusion.

SPWorley · March 14, 2011, 4:11am

Then it’s useful to look at the lastBlock example to see how it works. It’s pretty simple. It’s basically just a global counter that’s incremented once for each block as it finishes its work.

Since the increments are atomic, the block even knows if it’s the last block out the the grid to finish, and it can do special cleanup or summarization there if you like.

The interesting detail is the need for the uncommon threadfence() command which is used to make sure that blocks don’t signal their completion before all their results are flushed out of the instruction pipeline.

Since the test is just so simple (by counting completion number) you can see why the test needs to be done once per block, and why it needs to be the last thing the block does.

You may want to skip such fanciness when first designing your algorithm. The simpler and cleaner way to do a final test is to simply use a second followup kernel call, even with just one block, to do it. This makes ordering very clear and easy, and it’s not even inefficient on Fermi.

The fancy lastBlock method is just a really clever hack to pull out a bit of performance from tight kernels that need a trivial cleanup step.

Topic		Replies	Views
Memory fence functions CUDA Programming and Performance	1	872	March 13, 2011
sum up blockresults with last thread-block? CUDA Programming and Performance	2	2412	July 7, 2008
Is there any way to find last thread block that finish kernel execution ? CUDA Programming and Performance	1	814	April 3, 2017
Serializing and after parallelism again CUDA Programming and Performance	8	2565	July 12, 2008
How to write global memory with last thread? CUDA Programming and Performance	3	565	February 17, 2020
CAN I SYNCRONIZE ALL THE THREADS OF A GRID? CUDA Programming and Performance	3	4186	June 25, 2008
Synchronize all blocks in CUDA CUDA Programming and Performance	12	46913	October 25, 2013
do not understand thread/block division CUDA Programming and Performance	10	2905	April 23, 2012
this code resets my computer CUDA Programming and Performance	21	7448	March 29, 2008
Blocks execution problem Unexpectable behavior CUDA Programming and Performance	4	859	February 2, 2011

isLastBlockDone

Related topics