cudaDeviceSynchronize returned error code 700 while using dynamically allocated array?

I am dynamically allocating a multi-dimensional array, float tube_array[tube_ID][param]. I first define the following in the data.h file:

typedef struct __align__(16)
	float **tube_array;
	int tube_ID;
	int param;

Then create the array from the contents of a text file, inside the function void readtubedata(SimulationStruct * simulations)

	for (int i = 0; i < simulations->OP_h.tube_ID; i++)
		for (int j = 0; j < simulations->OP_h.param; j++)
			if (fscanf(file, "%f", &simulations->OP_h.tube_array[i][j]) != 1) {
				printf("\n error");

with SimulationStruct given as

typedef struct __align__(16)
	OPstruct OP_h;
}SimulationStruct; // there are other items here, but only the struct of interest in shown.

I can ask the CPU to print some value from the array and it does so correctly. Within,

printf("\n %f", simulations->OP_h.tube_array[3][2]);

prints and prints correct values for this particular entry and all others.

However, when I try the same line of code - modified to

printf("\n %f", OP[0].tube_array[3][2]);

from m-gpu.cuh, the printing results in the error given in the title. Calling another variable like OP[0].param only does not result in the error. I know that this error is given when I try to make out-of-bound accesses, and I know the error occurs only when I try to access the dynamically allocated array. Is there a special way to dynamically allocate the array that I’m missing?

You cannot access memory allocated using a host allocator such as new or malloc from device code.

If you want to access that area from device code, one possible method would be to use a managed allocator, e.g. cudaMallocManaged().

We currently use

	cudaError(cudaMemcpy(HostMem->PAT, DeviceMem->PAT, sizeof(unsigned int), cudaMemcpyDeviceToHost));

to copy the integer counter PAT from device to host. Could this be used to copy a malloc generated array to host memory?

The thing that appears to be needed (to me, anyway) is to copy data to device memory, or at least make it accessible from device code.

Yes, you could create a device allocation using cudaMalloc, and then copy the data from host to device (using cudaMemcpy). For the type of situation you present:

that would involve a deep copy, and requires multiple steps carefully sequenced. There are numerous questions on how to do deep copies in CUDA, because although the concept is well within the confines of C++ programming, many folks struggle with it. CUDA programming in general is certainly benefitted by having a crisp facility with pointers, and that is certainly true of a deep copy.

You haven’t even shown the allocation of that pointer, so I’m working without full visibility here. Guessing, basically.

Sorry, here is the current allocation:

	simulations->OP_h.tube_array = (float**)malloc(simulations->OP_h.tube_ID * sizeof(float *));
	simulations->OP_h.param = 7; // choice of the user.
	if (simulations->OP_h.tube_array == NULL)
		fprintf(stderr, "no memory\n");
	for (int i = 0; i < simulations->OP_h.tube_ID; i++)
		simulations->OP_h.tube_array[i] = (float*)malloc(simulations->OP_h.param * sizeof(float));
		if (simulations->OP_h.tube_array[i] == NULL)
			fprintf(stderr, "no memory\n");

Here is the current memory allocation procedure that for the struct, OP

#include <curand_kernel.h>
#include "data.h"

void InitDCMem(SimulationStruct* sim)
	cudaError(cudaMemcpyToSymbol(OP, &(sim->OP_h), sizeof(OPstruct)));

So, yes, you could do what I said.

A nested allocation like that is generally not optimal for performance.

This question/answer may give you some ideas.

But at any rate, those areas allocated with malloc cannot be directly accessed from CUDA device code, and attempting to do so will result in an illegal access error giving rise to the 700 error report.

As an aside,

that isn’t a procedure that allocates memory.

Right, thank you. I’ll make some more attempts and update with results.

Thanks for your help, Robert. It seems to be pretty complicated, and since I’m a novice, I’ll move on and try to simplify the problem.

Is it much simpler to use a 1D dynamically allocated array than 2D? Is the process similar?

Generally, I would say its simpler. The previous answer I linked mentions the word “flattening” and gives an example.