Can we do malloc inside a global function

jam11 · February 15, 2010, 3:29pm

You can’t use cudaMalloc3DArray the way you are trying to. cudaMalloc3DArray is for allocating memory for storing 3D texture data. All cudaArrays are opaque memory types that are not intended to be manipulated by user code other than the texture access calls and the alloc/copy APIs.

As I said in my previous reply, if you want dynamically allocated 3D arrays, you will have to recursively call cudaMalloc - for an nxnxn 3D array, that means a total of 1 + n + n**2 malloc calls, and all sorts of convoluted, recursive copy to symbol calls or initialization kernels. And when you are done, you will have made everything slower because there are now two extra levels of pointer indirection required to get to your data. Just use a 1D linear memory space allocated with cudaMalloc and index into it using a 3D index.

But I am not using . cudaMalloc3DArray at all. Look at the code. It is as you suggeted " 1D linear memory space allocated with cudaMalloc and index into it using a 3D index" ,but I cannot access more than an index of 29791 _DSIZE max =31.

avidday · February 15, 2010, 3:38pm

so what was the reference to cudaMalloc3DArray in your previous post for? The last piece of code you posted has at several errors in that mean that if it works, it is only by accident.

jam11 · February 15, 2010, 4:56pm

I think I found one reason for the confusion. There is a difference again between emulation and release code. I compiled the last code with and without emulation and the emulation code would not run above _DSIZE=31 while the release code would run up to _DSIZE=445. I don’t think this is a memory issue because I still have 1GB free in the host memory.

That code works and I hope it is not by accident!

[codebox]

include <assert.h>

include <stdio.h>

include <cuda_runtime.h>

include <cutil_inline.h>

// 3D to linear array conversion

define FTNREF3D(i_index,j_index,k_index,i_size,j_size,i_lb,j_lb,k_l

b) (i_size)(j_size)(k_index-k_lb)+(i_size)*(j_index-j_lb)+i_index-i_lb

#ifndef gpuAssert

include <stdio.h>

define gpuAssert( condition ) {if( (condition) != 0 ) { fprintf( stderr, “\n FAILURE %d in %s, line %d\n”, condition, FILE, LINE );exit( 1 );}}

endif

define _DSIZE 445

device float *ad;

global void testkernel2(float *d)

{

unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x + 1 ;

ad[FTNREF3D(idx,idx,idx,_DSIZE,_DSIZE,1,1,1)] = 23.0*idx;

    d[idx] =ad[FTNREF3D(idx,idx,idx,_DSIZE,_DSIZE,1,1,1)] ;

}

int main()

{

int device = 0;

cudaSetDevice(device);

float d ,_d;

    float *_a;

assert( !(( d = (float *)malloc(_DSIZE * sizeof(float)) ) == NULL) );

gpuAssert( cudaMalloc( (void**)&_a, _DSIZE*_DSIZE*_DSIZE* sizeof(float) ) );

gpuAssert( cudaMalloc( (void**)&_d, _DSIZE * sizeof(float) ) );

gpuAssert ( cudaMemcpy(_d, d, _DSIZE * sizeof(float), cudaMemcpyHostToDevice) );

gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float )));

testkernel2 <<< 1, _DSIZE >>> (_d);

( cudaThreadSynchronize() );

( cudaMemcpy(d, _d, _DSIZE * sizeof(float), cudaMemcpyDeviceToHost) );

for(int i = 0; i < _DSIZE; i++) {

    fprintf(stdout, "%2d %6.1f\n", i, d[i]);

}

cudaFree(_a);

cudaFree(_d);

free(d);

return cudaThreadExit();

}

[/codebox]

avidday · February 15, 2010, 5:07pm

I am just about done with this, I am sorry, but this:

gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float )));

shouldn’t even compile according to the documentation, and it has not right to work, even if it does compile. You are copying a pointer to a float, not a float. On 64 bit machines, they are not the same size. On 32 bit machines, they are. Whichever architecture you are working on, if it works, it is accidental. It should be something like this:

gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float  *),(size_t)0, cudaMemcpyHostToDevice));

jam11 · February 15, 2010, 5:48pm

I am just about done with this, I am sorry, but this:
gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float )));
shouldn’t even compile according to the documentation, and it has not right to work, even if it does compile. You are copying a pointer to a float, not a float. On 64 bit machines, they are not the same size. On 32 bit machines, they are. Whichever architecture you are working on, if it works, it is accidental. It should be something like this:
gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float  *),(size_t)0, cudaMemcpyHostToDevice));

Ok. I’ve got it. I am on a 64bits machine.

sizeof(float) = 4 sizeof(float * ) = 8

gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float *)));

is OK for both emulation and release code.

I did not noticed that I had changed float* to float.

jam11 · February 21, 2010, 2:51pm

From page 97 of the Progamming guide:

“Intermediate data structures may be created in device
memory, operated on by the device, and destroyed without ever being mapped by
the host or copied to host memory”

How do you create/destroy an array inside the kernel? Can you do cudaMalloc/cudaFree inside the kernel?

avidday · February 21, 2010, 3:03pm

That doesn’t mean what you think it means. It simply means that you can allocate and manipulate device memory directly through the API, without requiring any analogous host memory allocation or mapping of device allocations to host memory (ie. cudaMalloc doesn’t require any host memory allocation or mapping to work). This is in contrast to older programmable shader languages which required each device buffer be bound to a host buffer.

You can’t, and that text you are quoting doesn’t say that you can.

Topic		Replies	Views
malloc memory in kernel linked via in/out variable CUDA Programming and Performance	10	2010	October 17, 2015
Global arrays? CUDA Programming and Performance	24	10790	August 18, 2010
Does anyone know how to build a 2D/3D array using CudaMalloc? CUDA Programming and Performance	7	19740	November 3, 2010
pointer in global device memory CUDA Programming and Performance	9	11709	November 23, 2011
How to allocate a 3d array such that you can use the indecies to access its elements CUDA Programming and Performance	20	5463	October 24, 2009
Dynamic Global Memory Allocation ? faulty documentation - malloc within Kernel CUDA Programming and Performance	4	18187	December 30, 2010
Strange memory gremlins Getting pwned by pointers CUDA Programming and Performance	9	12231	July 1, 2009
Defining Array in Kernel CUDA Programming and Performance	26	3584	June 29, 2009
How to deal with dynamically allocated 3-dimentional arrays in device's memory? CUDA Programming and Performance	1	799	April 14, 2013
Shared Memory initialization CUDA Programming and Performance	19	45396	March 26, 2007

Can we do malloc inside a __global__ function

Related topics

Can we do malloc inside a global function