Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here

Main points of the code:

  1. patterns from are loaded into memory, using the Data structure, present in “pattern.h”.
  2. several multi-dimensional arrays are allocated
  3. the algorithm is ran over the patterns, using the arrays allocated just before.

If you want to try to run the code don’t forget to modify the PATH constant at the beginning of I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.

I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there

To be precise, here are the differences with the original host-only code:

  • f() and fder(), which are used by the algorithm, become device functions.
  • parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of 0.00001
  • the “w” array is initialized using a fixed value (0.5), not rand() anymore
  • a Data structure is allocated in device’s memory, and the data are sent in device’s memory after they have been loaded from in host’s memory

I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:

Data data;
Data* dev_data;
int* dev_t;
double* dev_x;

input_adult(PathFile, &data);

cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x’s data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);

It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.

I see two possibilities:

  1. the code sending the patterns into device’s memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
  2. the CUDA API is not behaving like it should!

I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.


Disclaimer: I did not look at the code very much, just giving some basic advice.

I think you need to go step by step on the algorithm translation. First you should verify that the arrays you have copied to device memory are indeed allocated correctly and that you’re able to read and write to several cells correctly without any sort of access violations. – Rereading your post, it seems you have verified that the data is loaded correctly, but you might have not attempted to modify it and verify that you’re not somehow changing an erroneous cell, especially if talking about flattened 2D/3D arrays (more on that later).

Once you have taken care of that, you may use Parallel NSight to debug a single instance of the kernel and make sure that the computations are being done correctly – that will eliminate any chance of any variables being of the incorrect type/translated incorrectly and generating garbage numerical values as results.

Also, make sure you wrap around an error check all your CudaMalloc/Memcopy/etc calls – that way you’ll know what calls are breaking. See for example:

Finally, not sure if you’re using flattened arrays on the GPU side when you talk about your multi-dim arrays that data is organized. The benefit of flattening arrays is speed, especially if you’re using global memory directly. See for example talonmies’ answer here:

The same concept can be extended to 3D arrays –

I know it’s a lot to take in at once, but hopefully it all helps!

The links you gave me about arrays will be very helpful, thank you very much! Actually, with these links you address part of my concerns expressed in another topic I created two weeks ago, about this very same code! It was there ->
Beside, yes I lack informations about debugging tools. For the moment I’m quite attached to my “good” old printf debugging. I’m going to read that too…

Now, about the bug itself. The solution has been found by Robert Crovella on stackoveflow (maybe does he post here too? He sounds quite experienced to me). If you are curious about it, just follow this link :
I assumed the original code was free of bug, because it was from an experienced guy and worked perfectly on the host. But I should have read it with a more critical eye anyway!

Thank you very much for your help.

Yup, the answer and advice you were given over at Stack Overflow makes sense, thanks for that.

And yes, when I first ported an algorithm to CUDA I came across the same question of how to process data in CUDA kernels that was originally saved 2D arrays.

I ended up doing flattening because it was a lot more efficient AND easier to keep track of than doing a bunch of CUDA or Host side mallocs for the 2D initializations… it’s messy and awful. Glad to help :)