struct of arrays for parameters between host and device

I have a problem with cuda using struct of arrays.

Here is an example:

typedef struct A {
int b[10];
} A;

global changeD(A* a){
int thx = threadIdx.x;
(a->b)[thx] = 5;

int main(){
A* a = (A*)malloc(sizeof(A));

A* aD;
cudaMalloc((void**)&aD, sizeof(A));
changeD<<<1, 10>>>(aD);
cudaMemcpy(a, aD, sizeof(A), cudaMemcpyDeviceToHost);


The problem is that a is not changed to 5. How to handle this kind of struct parameters? Is it that CUDA does not support more than one level indirection? Because here in the structure, there is another array which can be thought as a pointer. I check some examples in the CUDA SDK, it seems that only one level indirection for pointers is used.

Another interesting thing is that the structure can be used to read, e.g., in c=(a->b)[thx], c can get the value correctly. but write into (a->b)[thx] can not achieved. In other words, you can get the old value of (a->b)[thx], but you can not change it, although the code can pass the compilation. Any idea is appreciated.

I think there’s no problem with using structs like this as a kernel parameter. The value of (a->b)[thx] is changed, but you copied your struct to the device, the value was changed in device memory. To get the changed value you need to copy back the struct to be read in the host.

I copied it back, the last statement: cudaMemcpy(a, aD, sizeof(A), cudaMemcpyDeviceToHost);

I didn’t initialized the aD, but since the result is not 5, it still shows that the writing fails.

Hi search,

I took your code and executed in my machine. I just add cutilSafeCall to avoid any error. The code works here. Try this on your machine.

#include <stdlib.h>

#include <iostream>

#include <cutil_inline.h>

#define N 10

using namespace std;

typedef struct A {

	int b[N];

} A;

__global__ void changeD(A* a) {

	(a->b)[threadIdx.x] = 50;


int main() {

	A* a = (A*)malloc(sizeof(A));

	A* aD;

	cutilSafeCall( cudaMalloc( (void**) &aD, sizeof(A) ) );

	dim3 _grid( 1, 1, 1 );

	dim3 _threads( N, 1, 1 );

	changeD<<< _grid, _threads >>>(aD);

	cutilCheckMsg( "Kernel execution failed" );

	cutilSafeCall( cudaMemcpy(a, aD, sizeof(A), cudaMemcpyDeviceToHost) );

	for( int i = 0; i < N; i++ ) {

		cout << "a[" << i << "] = " << a->b[i] << endl;


	cutilSafeCall( cudaFree( aD ) );



The exit of this code were:

a[0] = 50

a[1] = 50

a[2] = 50

a[3] = 50

a[4] = 50

a[5] = 50

a[6] = 50

a[7] = 50

a[8] = 50

a[9] = 50

Thank you, tluisrs. I have tried your code. It changes!. However, I test the code in my application, it is changed when I don’t comment a following function call after the assignment. However, If I comment that function call, it stays the same. Does it relate to some memory write latency or optimization done by cuda? I will investigate more on this wried behavior.

Yes. I think there should be no problem using struct of arrays for parameter between host and device.

The mistake I made is in the cudaMemcpy part when copying the data back from device.

suppose aD and a are type of A*, Instead of using

cudaMemcpy(a, aD, sizeof(A), cudaMemcpyDeviceToHost);

I used

cudaMemcpy(a, aD, sizeof(a), cudaMemcpyDeviceToHost);

It uses the wrong size because sizeof(a) is the pointer size in machine, while sizeof(A) is the structure size!

Hope nobody else makes this mistake.

Glad you found the problem.