Problem with correct branching within a warp

cvnguyen · May 27, 2009, 9:32pm

No need for any__syncthreads() does not imply no need for any __threadfence_block()

Anyway, I believe that using explicit synchronization is faster than basing on ‘volatile’, whose behavior is not well defined. The point here is that writes to shared/memory always face latency. The programmers always have a better decision than the compiler on if this latency affects the program logic.

Cygnus_X1 · May 27, 2009, 10:03pm

Agreed, but they don’t imply you have to use it either. The way the last sentence I quoted is written suggest (the underlined part) that making it volatile should solve the problem.

Again, agreed that making something volatile don’t have to make the result fast, but should have to make them correct. That is my aim for the moment.

As for my assertion I simply removed first ‘return’ and kept the second one. This one error-reporting thread re-converges with the rest of the warp before quitting, which seems to solve this particular problem of mine.

cvnguyen · May 28, 2009, 6:25am

Thanks for an interesting topic. I am still thinking about a convenient way to “assert” and keep track of errors appearing in kernel modules. I may use somethings like this:

__device__ int error_code, error_line;

#define KERNEL_ERR(err) {error_code = err; error_line = __LINE__;}

...............

__global__ mykernel(....)

{

	...........

	...........

	if (error_appear)

		KERNEL_ERR(err);

	..........

}

extern "C" myhost(........)

{

	..........

	mykernel<<<......>>>(.............);

	check_and_display_error(..........);

	..........

}

The problem with CUDA is that threads are run in serial-parallel-mixed (let’s say there are more than one warps in a block), so the error code may be overwritten by other threads. If you find an efficient/convenient way to manage it, many users here will appreciate your recommendations.

The overhead of error checking code is not a big problem. We can use switches to control those code and disable them when we need speed.

Cygnus_X1 · May 28, 2009, 10:45am

I believe that my CUDA_ASSERT macro solves all the problem, except you have to remove the first ‘return’ (check out posts above). Never thought about putting the LINE macro though :)

Notice, to resolve the problem of multiple threads, I atomically increase the access semaphor and only if it is at 0 I set up the error code and additional values. The semaphor is never decreased. Even if other thread ends up in error state, it will just exit without overwritting anything.

Only problem that may exist is that while one thread wants to report a bug, other may change shared/global variables and thus influence the report. Therefore it is not a remedee to all your problems, but certinly may help :)

And obviously, I give no warranty :P

Edit: It is still buggy due to some strange branch behaviour.

Topic		Replies	Views
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16815	January 5, 2008
global to shared mem loads and sync CUDA Programming and Performance	26	11441	February 21, 2008
32 thread block doesn't need _syncthreads()? CUDA Programming and Performance	18	13744	January 21, 2024
CUDA Memory Consistency CUDA Programming and Performance	23	55585	March 8, 2007
Global thread barrier CUDA Programming and Performance	78	85692	December 23, 2011
warp synchronization test CUDA Programming and Performance	5	1656	September 2, 2014
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25631	July 5, 2007
help with kernel synchronization? CUDA Programming and Performance	22	13900	August 26, 2010
CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ? CUDA Programming and Performance	46	45213	April 17, 2011
Race condition within warp CUDA Programming and Performance	9	2987	September 20, 2016

Problem with correct branching within a warp

Related topics