Could you please check if the following code works on your 64-bit CUDA 2.1 system? For some reason, it fails on my 64-bit Windows XP and Linux. I tried few different drivers including the latest 181.20 and few different GPUs including GTX 280, Tesla C1060 and 8600 GTS. However, it successfully runs in 32-bit Windows XP, with CUDA 2.0, when compiled with -deviceemu option or if I change parameters slightly, e.g. define SY=8. I compile it using simple “nvcc main.cu”.
Vasily
#include <stdio.h>
#include "cuda_runtime.h"
#define Q( condition ) {if( (condition) != cudaSuccess ) { printf( "\n FAILURE in %s, line %d\n", __FILE__, __LINE__ );exit( 1 );}}
#define SX 2048
#define SY 16
#define BX 512
#define TX 256
#define TY 2
__global__ void cudaStencilDev( int *p )
{
p += blockIdx.x * BX;
p += threadIdx.y * SX + threadIdx.x;
for( int j = 0; j < SY; j += TY )
for( int i = 0; i < BX; i += TX )
p[i+j*SX] = 1;
}
int main( int argc, char **argv )
{
int size = sizeof(int)*SX*SY;
int *cpu = (int*) malloc( size );
if( !cpu )
return 1;
dim3 threads( TX, TY );
int *gpu;
Q( cudaMalloc( (void**) &gpu, size ) );
Q( cudaMemset( gpu, 0, size ) );
cudaStencilDev<<<(SX/BX), threads>>>( gpu );
Q( cudaMemcpy( cpu, gpu, size, cudaMemcpyDeviceToHost ) );
Q( cudaFree( gpu ));
int nerrors = 0;
for( int i = 0; i < SX*SY; i++ )
if( cpu[i] != 1 )
nerrors++;
printf( "errors: %d\n", nerrors );
free( cpu );
return 0;
}
Fails on linux x86_64 / 9800GX2 / CUDA 2.1 / 180.22
Passes on mac os x i386 / 8600M GT / CUDA 2.0
Fails on linux x86_64 / S1070 / CUDA 2.1beta / 180.06
I, and the CUDA 0.8 beta documentation I still have on my HD say otherwise. It might have been this way in one of the not-for-public betas, but I can’t say since I wasn’t in on those.
The only errors that come up from aysnc launches without cudaThreadSynchronize() are erroneous benchmarking/timing results.
At which point the cudaMemcpy will perform an implicit cudaThreadSynchronize() to wait for previous operations to complete before copying data back to the host (or new data to the device). (note that device->device memcpys are queued asynchronously). This has always been the behavior since the first public beta of CUDA 0.8.
I’m just trying to make it clear to any new developers reading these posts that disinformation has been posted so they don’t get the wrong ideas in their head.
Does that mean that an explicit cudaThreadSynchronize() becomes necessary for device->device, if memcpy is called after a kernel and returned data are involved?
please correct me if im wrong, but the only usage for cudaThreadSynchronize() is when you want to time a kernel. other wise cuda will call it when needed implicitly.
the only real use of cudaThreadSynchronize() is application-level wall-clock timing benchmarks performed in parts of the application that aren’t aware of the CUDA stream so they don’t know where to insert events.