strange behavior with device emulation

Hi there CUDA people!

I’ve got a problem with the following test code:

static int * d_data = NULL;


void cuda_test() {

	// allocate storage for 3 integers

	int size = 3*sizeof(int);

	if (!d_data) {

  CUT_SAFE_MALLOC(cudaMalloc((void**) &d_data, size));

  CUT_CHECK_ERROR("alloc error");




	// fill host array and transfer it to the device

	int * dataBefore = new int[3];

	for (int i = 0; i < 3; ++i) {

  dataBefore[i] = i;

  printf("dataBefore[%d] = %d\n", i, dataBefore[i]);


	cudaMemcpy(d_data, dataBefore, size, cudaMemcpyHostToDevice);

	CUT_CHECK_ERROR("copy error");


	delete [] dataBefore;

	dataBefore = NULL;


	// prepare readback array and initialize with error values (-1234)

	int * dataAfter = new int[3];

	for (int i = 0; i < 3; ++i) {

  dataAfter[i] = -1234;



	// transfer data back (data should not be changed by the device)

	cudaMemcpy(dataAfter, d_data, size, cudaMemcpyDeviceToHost);

	CUT_CHECK_ERROR("readback error");


	for (int i = 0; i < 3; ++i) {

  printf("dataAfter[%d] = %d\n", i, dataAfter[i]);


	delete [] dataAfter;

	dataAfter = NULL;


	if (d_data) {


  CUT_CHECK_ERROR("dealloc error");




If I turn device emulation off I get the expected output:

dataBefore[0] = 0

dataBefore[1] = 1

dataBefore[2] = 2

dataAfter[0] = 0

dataAfter[1] = 1

dataAfter[2] = 2

However, when device emulation is on, the second cudaMemcpy command seems to have no effect:

dataBefore[0] = 0

dataBefore[1] = 1

dataBefore[2] = 2

dataAfter[0] = -1234

dataAfter[1] = -1234

dataAfter[2] = -1234

Can someone confirm this behavior or tell me that I’m doing wrong?

Thanks in advance!

You have CUT_CHECK_ERRORs in there, which is good. But have you compiled in debug mode so that the error checking is enabled. At a glance, I don’t see any problems with your code: I can only guess that there is a CUDA initialization error or something.

Well, you’re not actually calling a kernel, so perhaps you’ve hit an obscure compiler bug that optimizes away the second cudaMemcpy. What version of CUDA are you using, on what platform, and with which card?

Try calling a kernel that does as little as possible: e.g. read the first byte from global memory, then write it back to the same location. (If you do any less, it might be optimized away itself.)

See if that works as expected under device emulation.

Hmm, that could really be a problem since compiler optimizations are turned on. I’ve now inserted a simple kernel…

__global__ void incrementKernel(int * data) {

     data[threadIdx.x] += 1;


…and it’s corresponding launch command:

dim3 dimGrid(1,1);

dim3 dimBlock(3,1);



The dataAfter array still remains unchanged in device emulation mode while all values are incremented without emulation (as expected). I’m using CUDA 1.1 on debian linux 3.1 (32-bit) with a GeForce 8800 GTX.

The compiler will never optimize away a cudaMemcpy.

You never said if you were building in debug mode or not. You could just call cudaMemcpy like so to check for errors:

cudaError_t error;

error = cudaMemcpy(...);

if (error != cudaSuccess)

    printf("Error: %s\n", cudaGetErrorString(error));

The problem occurs in debug as well as release mode…

Indeed, with MisterAnderson42’s helpful error check snippet I get an “invalid argument” error, but I can’t figure out what it means here. Moreover, also a simple cudaThreadSynchronize() produces an “invalid argument” error (where is that invalid argument??). :blink:
Additionally, I wonder why CUT_CHECK_ERROR("…") keep quiet (also in debug mode).

I then installed the error check snippet everywhere in my code and found out that all CUDA API calls in the constructor of my GPGPU class are successful and every future API call fails. Furthermore, if I disable all API calls in the constructor, all future API calls work well…

At least I’m now able to do a work around (postponement of the initializing CUDA API calls).

Thanks very much for your help!!! :thumbup: