Problems with creating an array of Cuda pointers

I am trying to get the following code segment to create an array of cuda arrays on the host. In device emulation mode the code works as expected. When I tried this aproach on the card, first it gave me a warning message that it could not determine object type assuming global variable. When I run the file all of the resulting matrix calculations are 0. Does anyone know what could cause this problem and how to fix it?


float ** d_input = malloc(MatrixLengtrhsizeof(float));
int i;

for(i=0;i<NumberM;i++) {
CUDA_SAFE_CALL(cudaMalloc((void**) &d_input, MatrixLengthMatrixWidthsizeof(float));
}

for(i=0;i<NumberM;i++) {
CUDA_SAFE_CALL(cudaMemcpy(d_input[i], h_input[i], MatrixLengthMatrixWidthsizeof(float));,
cudaMemcpyHostToDevice) );
}

Are you passing d_input to the kernel as an array of pointers? Since d_input is a pointer itself, it is in host memory. You need a device memory d_d_input (probably pick a better name) and copy the pointers in d_input to it.

About the warning messages, there doesn’t seem to be a way to suppress them or to tell the compiler that the pointer does point to global memory.

I have tried to create then copy the pointer array to a device value and have not been able to get it to work. Would you be willing to tell me the sintax to make the pointer array on the device?

No error checking, and I probably made typos, but here you go.

float **h_array_list, **d_array_list;

// allocate array lists

h_array_list = (float*)malloc(num_arrays * sizeof(float *));

cudaMalloc((void**)&d_array_list, num_arrays * sizeof(float *));

// allocate arrays on the device

for (int i = 0; i < num_arrays; i++)

   cudaMalloc((void**)&h_array_list[i], data_size);

// copy array list to the device

cudaMemcpy(d_array_list, h_array_list, num_arrays * sizeof(float*), cudaMemcpyHostToDevice);

// allocate array list on the host

float **array_list;

array_list = (float**)malloc(num_arrays * sizeof(float*));

// allocate arrays on the host

for (int i = 0; i < num_arrays; i++)

   array_list[i] = malloc(data_size);

// ****fill out data here

// populate data arrays on the device

for (int i = 0; i < num_arrays; i++)

   cudaMemcpy(h_array_list[i], array_list[i], data_size, cudaMemcpyDeviceToHost);

As you can see, it is quite a bit of a pain to do this. If at all possible, use 2D structures in memory allocated by cudaMallocPitch. I do use the nasty ** construction at one point in my code where a kernel sums up a variable number of arrays, but I only do so because the memory for each array is allocated separately by different classes. Thinking about it now, I should have written a little memory manager class that would dole out rows of a 2D memory area so that I wouldn’t have to do the ugly ** mess.

Thanks it is finally working. I was also wondering if you might be able to help me with the 2d mem to eliminate all of the pointers. When I tried to switch the code to 2d arrays, the resulting output was incorrect. Is there a good example of using 2d arrays that I could look at to see where I am going wrong?

Just allocate the multiple arrays using cudaMallocPitch. Then you can access element i of array j with “array_data[j*width + i]”, where width is the pitch returned from cudaMallocPitch divided by the size of the element of array_data (because pitch is returned in bytes).

Hi, I have the same problem. I tried modifying your code MisterAnderson to work with a kernel, but I can’t make it work.

When I read back the result of the kernel I’m getting random values, which means:

1. I'm not writing the values correctly inside the kernel

2. I'm not getting the results from the kernel correctly

The modified code is compiling, and I have 10 arrays with 2 elements each (each element is a float).

The code is this:

#include <stdio.h>

__global__ void kernel(float ** input, int num_arrays, int num_sub_arrays) {

	const int index = blockIdx.x * blockDim.x + threadIdx.x;

	

	if (index < num_arrays) {

  for (int i = 0; i < num_sub_arrays; i++) {

  	input[index][i] = 10 + i;

  }

	}

}

int main() {

	int num_arrays = 10;

	int num_sub_arrays = 2;

	int data_size = num_sub_arrays * sizeof(float);

	float **h_array_list, **d_array_list;

	// allocate array lists

	h_array_list = (float**)malloc(num_arrays * sizeof(float *));

	cudaMalloc((void**)&d_array_list, num_arrays * sizeof(float *));

	// allocate arrays on the device

	for (int i = 0; i < num_arrays; i++)

   cudaMalloc((void**)&h_array_list[i], data_size);

	// copy array list to the device

	cudaMemcpy(d_array_list, h_array_list, num_arrays * sizeof(float*), cudaMemcpyHostToDevice);

	// allocate array list on the host

	float **array_list;

	array_list = (float**)malloc(num_arrays * sizeof(float*));

	// allocate arrays on the host

	for (int i = 0; i < num_arrays; i++)

   array_list[i] = (float*)malloc(data_size);

	// ****fill out data here

	dim3 grid(num_arrays,num_sub_arrays);

	dim3 block(8,8);

	kernel<<<grid, block>>>(d_array_list, num_arrays, num_sub_arrays);

	// populate data arrays on the device

	for (int i = 0; i < num_arrays; i++)

   cudaMemcpy(h_array_list[i], array_list[i], data_size, cudaMemcpyDeviceToHost);

	for (int i = 0; i < num_arrays; i++) {

  float * listSubArrays = array_list[i];

  printf("i = %i\n", i);

  for (int j = 0; j < num_sub_arrays; j++) {

  	float theFloat = listSubArrays[j];

  	printf("\t myFloat = %f\n", theFloat);

  }

  

	}

}

I have the same problem. Does someone know a solution for this?

Or is there a better way to do it? I need up to 20 pointers (can be hard coded) that points to a variable length of data fields.

Thanks.