How to deal with dynamically allocated 3-dimentional arrays in device's memory?


I have to port to CUDA a pre-existing “host-only” backpropagation implementation. Let me briefly introduce the purpose of the array I’m talking about…
We are talking about neural networks. The current “host-only” implementation use several arrays, but lets just talk about the one keeping the weights of connections between neurons.
It looks like following:


This array is supposed to give the weight of the connection between the neuron “i” from layer “layer”, and the neuron “j” from layer “layer – 1”. When the program starts, the user chose a number of layers, and for each layer, a number of neurons. The “w” array is then dynamically built using multiple malloc calls.
The algorithms then update weights in this array during the training phase of the network. Obviously, this array has to be in the device memory when running the CUDA version.

–About CUDA–

This array has to be persistent between two call of the kernel function, and accessed by the host at the end of a training session.
First, I thought about the classic cudaMalloc function used from host code. But I don’t see how I could easily allocate such an array from the host, into the device memory. I would have to keep track of several pointers, and do numerous cudaMemcpy to update pointers in device’s memory. It does not look like an efficient solution to me.
Then, I wondered if I could call a global function that would directly build the array from inside the device code, an global initalization function called with <<<1,1>>>. Something like this:

typedef struct

int L; // number of layer without taking the inpuft layer into account

double ***w;

} WorkingData;


WorkingData* dev_workingData; // pointer to device memory
cudaMalloc((void**)&dev_workingData, sizeof(WorkingData));
CUDA_initWorkingData(dev_workingData, L);


__global__ void initWorkingData(WorkingData* p_workingData, int p_L)
	p_workingData->w = (double ***)malloc( (p_L+1) * sizeof(double **) );

extern "C" void CUDA_initWorkingData(WorkingData* p_workingData, int p_L)
	initWorkingData<<<1,1>>>(p_workingData, p_L);

Now, the thing is that such a code gives me an “calling a host function(“malloc”) from a global function” error, with Visual Studio 2010, despite the “compute_20,sm_20” option in the “Code generation” parameter of my project and CU file (and I have a 2.1 device).

I also read here and there that such malloc inside a kernel should be avoided.

Here are my questions:

  • Why such an error with Visual Studio?
  • Why malloc should be avoided inside a kernel?
  • Is it ok to write such a kernel inteded only to be called with <>>?
  • Any hint about a different way of doing what I want to do?

Thank you
(Sorry, but the [CODE] tag interpretation seems to bug and mess all the code if I use more than one instance of it)

Ok, I got the answer to my first question here:

I had to remove sm_10.

As you probably understood, I’m a beginner. I search for my answers, but I’m not reluctant to receive some help.