Issues with memory transfer 3D arrays, structs, macros, etc

This code is still in the beginning stages, but I’m running into an issue with values not being reflected. I’ve got a 3D array of structures and all I’m trying to do right now is make sure that values are being changed and transferred properly. I’ve completely simplified the kernel function to make sure it was nothing in there that was causing the problem(s) but still changes in the kernel/device aren’t reflected on the host variables.

#define ROWS 20

#define COLS 50

#define DEPTH 10

#define At(a,r,c,l) ((a) + ((r) * ROWS * DEPTH + (c) * DEPTH + (l)))

#define STRUCTMEMBERS 3

__global__ void a(struct ww *m1, struct ww *m2, struct ww *m3, struct ww *result);

struct ww{

	int x;

	int y;

	int z;

};

int main(int argc, char** argv){

	size_t size = ROWS * COLS * DEPTH * sizeof(struct ww);

	struct ww *e1 = (struct ww *)malloc(size);

	struct ww *e2 = (struct ww *)malloc(size);

	struct ww *e3 = (struct ww *)malloc(size);

	struct ww *d_e1;

	struct ww *d_e2;

	struct ww *d_e3;

	

	srand(time(NULL));

	int i, j, k;

	[loop through using At macro to fill with arrays with random values]

	printf("%d\n%d\n%d\n\n",At(e1,0,0,0)->x,At(e1,0,0,0)->y,At(e1,0,0,0)->z);

	cudaMalloc(&d_e1, size);

	cudaMalloc(&d_e2, size);

	cudaMalloc(&d_e3, size);

	cudaMemcpy(d_e1, e1, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_e2, e2, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_e3, e3, size, cudaMemcpyHostToDevice);

	/*

	 *Kernel to do stuff

	 *

	 */

	dim3 dimBlock(20,50,10);

	dim3 dimGrid(1,1);

	a<<<dimGrid,dimBlock>>>(d_e1,d_e2,d_e3,d_eavg);

	cudaMemcpy(e1, d_e1, size, cudaMemcpyDeviceToHost);

	printf("%d\n%d\n%d\n\n",At(e1,0,0,0)->x,At(e1,0,0,0)->y,At(e1,0,0,0)->z);

[free allocated stuff]

	return 0;

}

__global__ void a(struct ww *m1, struct ww *m2, struct ww *m3, struct ww *result){

	At(m1,0,0,0)->x=55;

}

I ran into an issue a few weeks ago with a different program, and the solution ended up being something related to addressing, but my macro works fine in host space (i.e. I can change, edit, print, etc structure members without issue).

Thanks.

Check return values of all CUDA calls for error codes. Your block size of 205010=10000 is larger than the maximum permitted block size, so the kernel never executes.

Ah, derp. When you say check errors in CUDA calls, is that using cudaGetLastError() or whatever it is? I tried using that before but something kept going wrong and it wouldn’t work (getLastError I mean, not the error(s) it should be reporting). What’s the include needed to get it working?

Monkeyed the code around and changed the computation to create a lot of blocks instead of a lot of threads and everything was working fine. However I’m working on a new/slightly related piece of code and getting the same problem

Basically every CUDA function call is returning an error code, not just cudaGetLastError(). No includes are needed, they are implicitly added when .cu files are compiled.