My code fails, is it a bug in CUDA? 50 lines of code, fails only in 64-bit CUDA 2.1 :(


Could you please check if the following code works on your 64-bit CUDA 2.1 system? For some reason, it fails on my 64-bit Windows XP and Linux. I tried few different drivers including the latest 181.20 and few different GPUs including GTX 280, Tesla C1060 and 8600 GTS. However, it successfully runs in 32-bit Windows XP, with CUDA 2.0, when compiled with -deviceemu option or if I change parameters slightly, e.g. define SY=8. I compile it using simple “nvcc”.


#include <stdio.h>

#include "cuda_runtime.h"

#define Q( condition ) {if( (condition) != cudaSuccess ) { printf( "\n FAILURE in %s, line %d\n", __FILE__, __LINE__ );exit( 1 );}}

#define SX 2048

#define SY 16

#define BX 512

#define TX 256

#define TY 2

__global__ void cudaStencilDev( int *p )


	p += blockIdx.x * BX;

	p += threadIdx.y * SX + threadIdx.x;

	for( int j = 0; j < SY; j += TY )

		for( int i = 0; i < BX; i += TX )

			p[i+j*SX] = 1;


int main( int argc, char **argv )


	int size = sizeof(int)*SX*SY;

	int *cpu = (int*) malloc( size );

	if( !cpu ) 

		return 1;

	dim3 threads( TX, TY );

	int *gpu;

	Q( cudaMalloc( (void**) &gpu, size ) );

	Q( cudaMemset( gpu, 0, size ) );

	cudaStencilDev<<<(SX/BX), threads>>>( gpu );

	Q( cudaMemcpy( cpu, gpu, size, cudaMemcpyDeviceToHost ) );

	Q( cudaFree( gpu ));

	int nerrors = 0;

	for( int i = 0; i < SX*SY; i++ )

		if( cpu[i] != 1 )


	printf( "errors: %d\n", nerrors );

	free( cpu );

	return 0;


Fails on linux x86_64 / 9800GX2 / CUDA 2.1 / 180.22
Passes on mac os x i386 / 8600M GT / CUDA 2.0
Fails on linux x86_64 / S1070 / CUDA 2.1beta / 180.06

Runs without reporting an error on Windows XP 64 bit, Tesla C1060, CUDA 2.0, 180.60

fails on 64 bit linux with cuda 2.0. emulation works fine.

Hi Vasily,
I opened a bug report.
It works on CUDA 2.1 Linux32, it fails on 64bit.

Thanks, Massimiliano.

Thank you to everyone who participated!

btw, Is cudaThreadSynchronize() NOT required before cudaMemcpying??

Have some intelligence been added to cudaMemcpy? If so, check the CUDA version in which it is actually available.

Since you r checking with multiple CUDA versions, may b one of them did not have and hence failed…

Jus my 2 cents.

No. And as I have told you before in many other threads. IT NEVER HAS BEEN

wow, inner loop has a strange behaviour…

if I change it in:

for(int j = 0; j < SY; j += TY )
p[0+jSX] = 1;
SX] = 1;

or, simply,
for(int j = 0; j < SY; j += TY )
#pragma unroll 1
for(int i = 0; i < BX; i += TX )
p[j*SX + i] = 1;

it works… :wacko: (unrolling bug?)

It used to be like that as far as my knowledge goes.

I have seen people running into errors because of this. (u just have to wait and then copy otherwise you copy pre-mature results)

Since kernel launches are async anyway, cudaMemcpy will be executed immediately after QUEing the kernel launch.

I think this behaviour changed in the middle (in some CUDA release). Is that not so?

Well, I can understand your level of irritation with that boldened “NEVER HAS BEEN” :-)

Sometimes, it just happens… Certain notions are difficult to change once they r registered

in a particular way in my brain…

Best Regards,


I, and the CUDA 0.8 beta documentation I still have on my HD say otherwise. It might have been this way in one of the not-for-public betas, but I can’t say since I wasn’t in on those.

The only errors that come up from aysnc launches without cudaThreadSynchronize() are erroneous benchmarking/timing results.

At which point the cudaMemcpy will perform an implicit cudaThreadSynchronize() to wait for previous operations to complete before copying data back to the host (or new data to the device). (note that device->device memcpys are queued asynchronously). This has always been the behavior since the first public beta of CUDA 0.8.

I’m just trying to make it clear to any new developers reading these posts that disinformation has been posted so they don’t get the wrong ideas in their head.

I also had some funky results with earlier versions of CUDA that were solved with a cudaThreadSynchronize(), though now it is not needed

Does that mean that an explicit cudaThreadSynchronize() becomes necessary for device->device, if memcpy is called after a kernel and returned data are involved?

no, because kernel launches are an implicit global synchronization barrier.

please correct me if im wrong, but the only usage for cudaThreadSynchronize() is when you want to time a kernel. other wise cuda will call it when needed implicitly.

nothing’s really occurring to me that won’t cause a global sync when it’s necessary or wouldn’t be handled better with events.

the only real use of cudaThreadSynchronize() is application-level wall-clock timing benchmarks performed in parts of the application that aren’t aware of the CUDA stream so they don’t know where to insert events.