Problem with correct branching within a warp

No need for any__syncthreads() does not imply no need for any __threadfence_block()

Anyway, I believe that using explicit synchronization is faster than basing on ‘volatile’, whose behavior is not well defined. The point here is that writes to shared/memory always face latency. The programmers always have a better decision than the compiler on if this latency affects the program logic.

Agreed, but they don’t imply you have to use it either. The way the last sentence I quoted is written suggest (the underlined part) that making it volatile should solve the problem.

Again, agreed that making something volatile don’t have to make the result fast, but should have to make them correct. That is my aim for the moment.

As for my assertion I simply removed first ‘return’ and kept the second one. This one error-reporting thread re-converges with the rest of the warp before quitting, which seems to solve this particular problem of mine.

Thanks for an interesting topic. I am still thinking about a convenient way to “assert” and keep track of errors appearing in kernel modules. I may use somethings like this:

__device__ int error_code, error_line;

#define KERNEL_ERR(err) {error_code = err; error_line = __LINE__;}

...............

__global__ mykernel(....)

{

	...........

	...........

	if (error_appear)

		KERNEL_ERR(err);

	..........

}

extern "C" myhost(........)

{

	..........

	mykernel<<<......>>>(.............);

	check_and_display_error(..........);

	..........

}

The problem with CUDA is that threads are run in serial-parallel-mixed (let’s say there are more than one warps in a block), so the error code may be overwritten by other threads. If you find an efficient/convenient way to manage it, many users here will appreciate your recommendations.

The overhead of error checking code is not a big problem. We can use switches to control those code and disable them when we need speed.

I believe that my CUDA_ASSERT macro solves all the problem, except you have to remove the first ‘return’ (check out posts above). Never thought about putting the LINE macro though :)

Notice, to resolve the problem of multiple threads, I atomically increase the access semaphor and only if it is at 0 I set up the error code and additional values. The semaphor is never decreased. Even if other thread ends up in error state, it will just exit without overwritting anything.

Only problem that may exist is that while one thread wants to report a bug, other may change shared/global variables and thus influence the report. Therefore it is not a remedee to all your problems, but certinly may help :)

And obviously, I give no warranty :P

Edit: It is still buggy due to some strange branch behaviour.