static int * d_data = NULL;
...
void cuda_test() {
// allocate storage for 3 integers
int size = 3*sizeof(int);
if (!d_data) {
CUT_SAFE_MALLOC(cudaMalloc((void**) &d_data, size));
CUT_CHECK_ERROR("alloc error");
cudaThreadSynchronize();
}
// fill host array and transfer it to the device
int * dataBefore = new int[3];
for (int i = 0; i < 3; ++i) {
dataBefore[i] = i;
printf("dataBefore[%d] = %d\n", i, dataBefore[i]);
}
cudaMemcpy(d_data, dataBefore, size, cudaMemcpyHostToDevice);
CUT_CHECK_ERROR("copy error");
cudaThreadSynchronize();
delete [] dataBefore;
dataBefore = NULL;
// prepare readback array and initialize with error values (-1234)
int * dataAfter = new int[3];
for (int i = 0; i < 3; ++i) {
dataAfter[i] = -1234;
}
// transfer data back (data should not be changed by the device)
cudaMemcpy(dataAfter, d_data, size, cudaMemcpyDeviceToHost);
CUT_CHECK_ERROR("readback error");
cudaThreadSynchronize();
for (int i = 0; i < 3; ++i) {
printf("dataAfter[%d] = %d\n", i, dataAfter[i]);
}
delete [] dataAfter;
dataAfter = NULL;
if (d_data) {
cudaFree(d_data);
CUT_CHECK_ERROR("dealloc error");
cudaThreadSynchronize();
}
}
If I turn device emulation off I get the expected output:
You have CUT_CHECK_ERRORs in there, which is good. But have you compiled in debug mode so that the error checking is enabled. At a glance, I don’t see any problems with your code: I can only guess that there is a CUDA initialization error or something.
Well, you’re not actually calling a kernel, so perhaps you’ve hit an obscure compiler bug that optimizes away the second cudaMemcpy. What version of CUDA are you using, on what platform, and with which card?
Try calling a kernel that does as little as possible: e.g. read the first byte from global memory, then write it back to the same location. (If you do any less, it might be optimized away itself.)
See if that works as expected under device emulation.
The dataAfter array still remains unchanged in device emulation mode while all values are incremented without emulation (as expected). I’m using CUDA 1.1 on debian linux 3.1 (32-bit) with a GeForce 8800 GTX.
The problem occurs in debug as well as release mode…
Indeed, with MisterAnderson42’s helpful error check snippet I get an “invalid argument” error, but I can’t figure out what it means here. Moreover, also a simple cudaThreadSynchronize() produces an “invalid argument” error (where is that invalid argument??). :blink:
Additionally, I wonder why CUT_CHECK_ERROR(“…”) keep quiet (also in debug mode).
I then installed the error check snippet everywhere in my code and found out that all CUDA API calls in the constructor of my GPGPU class are successful and every future API call fails. Furthermore, if I disable all API calls in the constructor, all future API calls work well…
At least I’m now able to do a work around (postponement of the initializing CUDA API calls).