CUDA complex memory allocation problems Problem allocating pointer to pointer me


I’m trying to allocate hirarchial memory on CUDA device(8800Ultra), and for some reason it fails.
It works good on device emulation but fails on the rela device.
I attached the code sample.
Please advise.


device float *** Theta;

CUDA_SAFE_CALL(cudaMalloc((void**)&Theta, (sizeof(float*)*initialModelNum)));
CUT_CHECK_ERROR(“**MEME_CUDA_ERROR : can’t allocate global memory.”);

for (int i = 0 ; i < 10 ; ++i)
CUDA_SAFE_CALL(cudaMalloc((void**)&(Theta[i]), (sizeof(float*)*4)));
CUT_CHECK_ERROR(“**MEME_CUDA_ERROR : can’t allocate global mem”);

for (int letter = 0; letter < NUCL_ALPHABET_SIZE; ++letter)
CUT_CHECK_ERROR(“**MEME_CUDA_ERROR : can’t allocate global mem.”);

As i described above, at the device emulation it works good, but in the actual device, it fails on the second allocation, just aborting.


What about replace ‘i’ with natural num?

CUDA_SAFE_CALL(cudaMalloc((void**)&(Theta[0]), (sizeof(float*)*4)));

I think we cannot get the device address simply by ‘&Theta[i]’ in Host func,

you could try this API func :

cudaGetSymbolAddress(void **devPtr, const char *symbol);

Device emulation does NOT model the parallel hardware correct.

Secondly, host pointers and device pointers are BOTH host-pointers in device-emulation. so, you wont get segmentation faults when you access device-pointers in device emulation.

Coming to point,

As David said, (&Theta[i]) – does NOT write into the device memory that you first allocated. You need to allocate device arrays and store the pointer returned in a host array and then copy that host-array into the device-array. Hope that makes sense for you.

What I am saying is:

Say you wanna allocate an array of pointers inside the device:

  1. first cudaMalloc(&arrayOfPointers) for the array of pointers.

  2. for I=0 TO N do

    cudaMalloc(&hostArray[i], ....);
  3. cudaMemcpy(arrayOfPointers, hostArray, …)

Second allocation will try to dereference device memory pointer on host, so it is normal that it crashes. You should allocate array of pointers in host memory, fill it with values and then copy to device memory, i.e. something like this:

float* host_Theta;

__device__ float *dev_Theta;

host_Theta = (float*)malloc( sizeof(float*)*initialModelNum);

CUDA_SAFE_CALL(cudaMalloc((void**)&dev_Theta, (sizeof(float*)*initialModelNum)));

CUT_CHECK_ERROR("**MEME_CUDA_ERROR : can't allocate global memory.");

for (int i = 0; i < 10; ++i)


CUDA_SAFE_CALL(cudaMalloc((void**)&(host_Theta[i]), (sizeof(float*)*4)));

CUT_CHECK_ERROR("**MEME_CUDA_ERROR : can't allocate global mem");

for (int letter = 0; letter < NUCL_ALPHABET_SIZE; ++letter)



CUT_CHECK_ERROR("**MEME_CUDA_ERROR : can't allocate global mem.");



cudaMemcpy( dev_Theta, host_Theta, sizeof(float*)*initialModelNum, cudaMemcpyHostToDevice );