Passing an array of structure to kernel

I am creating a neural network that perform computation on GPU. I have done writing the codes in CPU. I have been having trouble passing an array of structure pass to the kernel. Given the following Structure, I want to be able to copy the structure to device, pass the structure to Kernel and perform computation and then copy back to host.

struct neuron {
	double threshold = 0; 
	double weight[2] = {0, 0}; 
	double output = 0; 
	double error = 0; 
	neuralNetworkLayerType layerType; 

I am having trouble calculating the size of 5 neurons, such as neurons[5]. I also want to know whether I should use cudaMemcpy() or cudaMemcpy3D(). Thank you in advance.

Transfering trivially copyable structs to the device is no different than transfering let’s say an array of ints.

The memory required for x neurons is sizeof(neuron) * x. You can use cudaMemcpy.

Thanks striker159. I tried multiple ways. Following is the complete code that I am working with now. I tried to pass a two dimensional array to the Kernel, testNeurons(). Nothing really updates after the Kernel is launched.

#include <stdio.h>
#include <time.h>
#include <cuda_runtime.h>
#include <cassert>
#include <cstdlib>
#include <functional>
#include <iostream>
#include <algorithm>
#include <vector>

#define LEARNING_RATE 0.25 
#define NUMB_OF_EPOCHS 1000000
#define TD_X 4 // training data in x- dimension
#define TD_Y 2 // training data in y- dimension
#define TD_Z 2 // training data in z- dimension   

double TRAINING_DATA[TD_X][TD_Y][TD_Z] = {{{0,0},{0}},

double applyActivationFunction(double weightedSum) {
	// activation function is a sigmoid function
	return (1.0 / (1 + exp(-1.0 * weightedSum)));  

void _setNeurons_(float *neurons[5]){
	srand((long)time(NULL)); /* initialize rand() */

	for (int i = 0; i < 2; i ++){
		neurons[i][0] = 0.5 - (rand()/(double)RAND_MAX); // threshold
		neurons[i][1] = 0.5 - (rand()/(double)RAND_MAX); // weight 1
		neurons[i][2] = 0.5 - (rand()/(double)RAND_MAX); // weight 2
		neurons[i][3] = 0.0; //output
		neurons[i][4] = 0.0; //error

	for (int i = 2; i < 4; i ++){
		neurons[i][0] = 0.5 - (rand()/(double)RAND_MAX); // threshold
		neurons[i][1] = 0.5 - (rand()/(double)RAND_MAX); // weight 1
		neurons[i][2] = 0.5 - (rand()/(double)RAND_MAX); // weight 2
		neurons[i][3] = 0.0; //output
		neurons[i][4] = 0.0; //error

	neurons[4][0] = 0.5 - (rand()/(double)RAND_MAX); // threshold
	neurons[4][1] = 0.5 - (rand()/(double)RAND_MAX); // weight 1
	neurons[4][2] = 0.5 - (rand()/(double)RAND_MAX); // weight 2
	neurons[4][3] = 0.0; //output
	neurons[4][4] = 0.0; //error

void _printTrainingData_(float *neurons[5]){
	printf("[(I: %.2f), (I: %.2f), ", neurons[0][3], neurons[1][3]); 
	printf("(H: %.2f, %.2f, %.2f, %.5f), ", neurons[2][1], neurons[2][2], neurons[2][0], neurons[2][3]);
	printf("(H: %.2f, %.2f, %.2f, %.5f), ", neurons[3][1], neurons[3][2], neurons[3][0], neurons[3][3]);
	printf("(O: %.2f, %.2f, %.2f, %.5f)]\n ", neurons[4][1], neurons[4][2], neurons[4][0], neurons[4][3]);

void _forwardProp_(double input[], float *neurons[5], const int Nsize) {
	double weightedSum = 0; 
	for( int i = 0; i < Nsize; i++){
		switch (i) {
			case 0: case 1: // input layer
				neurons[i][3] = input[i];  
			case 2: case 3: // hidden layer
				weightedSum = neurons[i][0] + 
								  neurons[i][1] * neurons[0][3] + 
		    		              neurons[i][2] * neurons[1][3];
				neurons[i][3] = applyActivationFunction(weightedSum); 
			case 4: // output layer
				weightedSum = neurons[i][0] + 
	    		                  neurons[i][1] * neurons[2][3] + 
	    		                  neurons[i][2] * neurons[3][3];
		    	neurons[i][3] = applyActivationFunction(weightedSum); 

void _printResult_(double result[]) {
	printf("    Input 1    |    Input 2    | Target Result |  Result    \n");
	for(int i = 0; i < 4; i++ ) {
		for(int j = 0; j < 2; j++) {
			printf("    %.5f    |", TRAINING_DATA[i][0][j]); 
		printf("    %.5f    |   %.5f   \n", TRAINING_DATA[i][1][0], result[i]);

void _printNetworkInfo_(){

	// the number of inputs, hidden layers and output layers are set. 
	// the number of iterations and learning rate can be vary. 
	printf("Number of inputs: %d\n", 2); 
	printf("Number of hidden layers: %d\n", 2); 
	printf("Number of output: %d\n", 1);
	printf("Number of iterations: %d\n", NUMB_OF_EPOCHS);
	printf("Learning Rate: %.2f\n", LEARNING_RATE);


__global__ void testNeurons(float *Neurons[5]){
	printf("start computing in GPU\n");
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	Neurons[i][0] = 1.35; // update some numbers. 
	Neurons[i][1] = 3.46; // update some numbers.  
	Neurons[i][2] = 5.53; // update some numbers. 
	Neurons[i][3] = 2.34    ; // update some numbers. 

	printf("Neurons[%d][3] = %.2f\n", i, Neurons[i][3]);
	printf("done updating in GPU\n");

int main(void){

	// set up device

 	double result[] = {0, 0, 0, 0}; 
 	int N = 5; // number of neurons
 	int V = 5; // number of variables 
 	size_t nBytes = V * sizeof(float); 

 	// declare and initialize neurons as pointers
 	// malloc device global memory
 	float *neurons[V];
 	float *dev_neurons[V]; 

 	for(int i = 0; i < N; i++){
 		neurons[i] = (float *) malloc(nBytes);  
 		cudaMalloc((void**) &dev_neurons[i], nBytes); 

	_setNeurons_(neurons); // initialize neurons 

	// transfer data from host to device

	for(int i = 0; i < TD_X; i++) {   // TD_X - Traning Data Dimension X 
		_forwardProp_(TRAINING_DATA[i][0], neurons, N);
		result[i] = neurons[4][3]; // get output

	cudaMemcpy(dev_neurons, neurons, N * nBytes, cudaMemcpyHostToDevice);  

	// train network from CPU. 
	float GPUtime; 
	cudaEvent_t start, stop; 

	cudaEventRecord(start, 0); 

	testNeurons <<< 1, 5>>> (dev_neurons); 
	cudaMemcpy(neurons, dev_neurons, N * nBytes, cudaMemcpyDeviceToHost); 

	cudaEventRecord(stop, 0); 
	cudaEventElapsedTime(&GPUtime, start, stop); 

	printf("Compute time on GPU: %3.6f ms \n", GPUtime); 


Hi @WolfAtTheGate

As @striker159 said, you can use cudaMemcpy in the same way you are transferring an array of ints. I will provide you a small baseline to start with

#include <stdlib.h>

typedef struct {
  int dummy_int_;
  double dummy_double;
} Neuron;

int main() {
  const kNetSize = 5;

  // Let's suppose you initialise everything first on the CPU
  Neuron * myNetCPU = malloc(sizeof(Neuron) * kNetSize);
  Neuron * myNetGPU;
  cudaMalloc((void**)&myNetGPU, sizeof(Neuron) * kNetSize);

  // Here you initialise and do some stuff needed before transferring to GPU
  cudaMemcpy((void*)myNetGPU, (void*)myNetCPU, kNetSize, cudaMemcpyHostToDevice);
  // Deploy your kernel passing the myNetGPU pointer

  // Copy back the network
  cudaMemcpy((void*)myNetCPU, (void*)myNetGPU, kNetSize, cudaMemcpyDeviceToHost);

  return 0;

If you are familiar with C++ and STL, there are some interesting stuff to do with Thrust, which is the STL for CUDA.

Regarding to your question about CudaMemcpy, you can get more info in: The idea basically is that for 1D arrays, you can use the normal cudaMemcpy, but for more advance structures, where you require some special layout, you can use the cudaMemcpy2D or cudaMemcpy3D. For example, you may want to use them for image processing, where you want some special layout like col major order or row major order. This helps in terms of indexation and allocation, which makes the algorithm implementation a bit more intuitive in that field.


Hi @luis.leon, Thank you for the reply. I tried both your code and @striker159 combined. Please see the code above. Please also note that I also fix the pointer issues.

I correct the cudaMemcpy statements to

cudaMemcpy((void*)dev_neurons, (void*)neurons, N * nBytes, cudaMemcpyHostToDevice);


cudaMemcpy((void*)neurons, (void*)dev_neurons, N * nBytes, cudaMemcpyDeviceToHost);

but my kernel function, __global__ void testNeurons(float *Neurons[5]) is still not updating the neurons.


I think your issue is over here:

float *neurons[V];
float *dev_neurons[V]; 

for(int i = 0; i < N; i++){
  neurons[i] = (float *) malloc(
  cudaMalloc((void**) &dev_neurons[i], nBytes); 

Here, you have a two dimensional array and only one dimension is being allocated. In principle, when you pass the pointer to the kernel, it must be a full GPU pointer. In this case, when declaring:

float * dev_neurons[5];

A CPU array (statically allocated) is also taking place, but on the CPU.

Your struct was really fine actually and more professional. Just trying:

Neuron *neurons;
Neuron *dev_neurons; 

neurons = (Neuron*) malloc(N * sizeof(Neuron));
cudaMalloc((void**) &dev_neurons, N * sizeof(Neuron));

After readapting your functions to the struct, you can recycle your cudaMemcpy’s that you already have in your code and it should work.

I just want to recall that it is better representing your arrays as 1D arrays than using 2D arrays in the way you are coding. For example:

float * neurons;

neurons = (float*) malloc (N * V * sizeof(float));

// Let's set your layout is something like:
for(int neuron_idx = 0; neuron_idx < N * V; neuron_idx += V) {
  neurons[neuron_idx] = 0; // Var 1
  neurons[neuron_idx + 1] = 1; // Var 2
  neurons[neuron_idx + 2] = 2; // Var 3
  neurons[neuron_idx + 3] = 3; // Var 4

Hope this help you.


@luis.leon I tried every single way that I can think of. I checked memory allocation and such. I will try your suggestion as my last resort.

Neuron *neurons;
Neuron *dev_neurons; 

neurons = (Neuron*) malloc(N * sizeof(Neuron));
cudaMalloc((void**) &dev_neurons, N * sizeof(Neuron));

If it doesn’t work, I will change to Neuron structure into a one dimensional array, which I am good at. I kind of wanted to try something new. I noticed that even the advanced CUDA codes from Stanford University that I have seen, which are published on Nvidia website, use one dimensional arrays for computations.

So sad, I kind of want to break the 1D and expand my skills into handling 2D or 3D but I cannot find any strong examples out there. Thank you anyway.