Global memory's lifespan : Application. How do you do that? How could you allocate a variable to

thanasio · September 28, 2011, 2:35pm

Hi,

I have read in different articles and books (ie. Kirk : Chapter 5), that global memory’s lifespan can be the whole application. I want to declare a variable to reside in global memory (not texture!) and last there for the whole application but i have never seen such an example. Has anybody got prior experience to that? If yes a little code snippet would help loads…

Thanks in advance,
Than

alexish · September 28, 2011, 7:15pm

Hi,
it’s very simple, declare it in your cu file like any global variable in C with prefix device.

Example:

device float myglobal;

global
void kernel()
{
…
float x = something*myglobal;
…
}

But be very careful if you need to write to the global variable.

thanasio · September 29, 2011, 8:20am

Hi,

it’s very simple, declare it in your cu file like any global variable in C with prefix device.

Example:

device float myglobal;

global

void kernel()

{
....

float x = something*myglobal;

.....
}

But be very careful if you need to write to the global variable.

Hi Alexish, thanks for the reply, does that apply to 1D arrays as well? For example :

device float *myglobal;

void feedMyGlobal(c++_array){

cudamalloc(size*float)

cudamemcopy (myglobal, c++_array)

}

global

void kernel()

{

....

if(tid<size)

float x = something*myglobal[tid];

.....

}

void freeGlobs(){

cudaFree(myGlobal)

}

So if i call feedMyGlobal(c++_array) the global variable pointer will be set in the device and then can i call kernel() without troubles?

Cheers Than

thanasio · September 29, 2011, 9:38am

Hey Alexish,

are you sure that works? I’ve tried to implement it with pointer that i would allocate memory and copy values with cudamalloc and cudamemcopy. Although the code compiled…there was no success, kept getting :

Cuda error: Kernel execution failed in file ‘gpu_code/bonds_cuda.cu’ in line 205 : unspecified launch failure.

the same code would work fine when the variable was declared in the wrapper function, passed as a parameter in kernel and freed after kernel…

…strange…External Image)

SpongePatoche · September 29, 2011, 10:31am

Can you change

__device__ float *GlobalArray;

by

static float *GlobalArray;

And gives GlobalArray in params of your kernel.

thanasio · September 29, 2011, 1:32pm

Hey sponge patoche,

good advice it works…

however though if i call the same kernel again, the compiler throws :

Cuda error: Kernel execution failed in file ‘gpu_code/bonds_cuda.cu’ in line 244 : invalid device pointer.

basically if i free the global and reallocate it to the device, it is all fine, but if not ie. i live the global resident on the device for multiple kernel calls,

then on the second call i get the error. Also the same happens with texture1D…

So, where is the application lifetime of the global variable…?

SpongePatoche · September 29, 2011, 2:15pm

How is structured your app ?

At home, I used global static variables in a shared library to keep a pointer to device memory between different calls.

(If you can make shared library contains your global static varaibles and kernel, I think your app run good)

thanasio · September 29, 2011, 2:33pm

So, you more or less mean you have :

device static float pointer;//or do you actually mean global static float…?

void feedPointer(){

cudamalloc (*pointer)

cudamemcopy (c++_pointer, *pointer)

}

global kernel(){

usesValuesOf(pointer)

}

and then you call feedPointer() once, then you can make as many kernel() calls as you like and free pointers before exiting…?

as i’ve tried this structure and doesn’t work for me…

cheers,

Than

SpongePatoche · September 29, 2011, 2:50pm

In my shared library I have something like that :

static float *globalArray = NULL;

void

InitDeviceMemory(int size)

{

     // Init memory space

     cudaMalloc((void**) &globalArray, size * sizeof(float));

}

void

MyFunction(float *data, int size)

{

    // Copy input data in device memory

     cudaMemcpy(globalArray, data, size * sizeof(float), cudaMemcpyHostToDevice);

// Launch kernel with globalArray in params

     // Kernel <<< nBlock, nThreadPerBlock >>> (globalArray);

}

void

FreeDeviceMemory()

{

     cudaFree(globalArray);

     globalArray = NULL;

}

1 - InitDeviceMemroy

2 - MyFunction lots of time

3 - FreeDeviceMemory

thanasio · September 29, 2011, 3:23pm

In my shared library I have something like that :

static float *globalArray = NULL;

void

InitDeviceMemory(int size)

{

     // Init memory space

     cudaMalloc((void**) &globalArray, size * sizeof(float));

}

void

MyFunction(float *data, int size)

{

    // Copy input data in device memory

     cudaMemcpy(globalArray, data, size * sizeof(float), cudaMemcpyHostToDevice);

// Launch kernel with globalArray in params

     // Kernel <<< nBlock, nThreadPerBlock >>> (globalArray);

}

void

FreeDeviceMemory()

{

     cudaFree(globalArray);

     globalArray = NULL;

}

1 - InitDeviceMemroy

2 - MyFunction lots of time

3 - FreeDeviceMemory

That helps…! your global variable is allocated in cuda memory, but each time you call your kernel, you have to copy values to the global array.

It seems that i got things wrong here…so you can allocate global memory once and use the slot in a series of kernel calls, but values in memory only last for the kernel Therefore, we have to copy values to the same memory location each time before we call a kernel…

…my perception was that the variable could live in the device for the whole application and keep its values…something like a static variable in c/c++…

anyway…thanks…that helped…!!!

SpongePatoche · September 29, 2011, 3:32pm

In my app, every call to MyFunction make a kernel processing from data sources, but these data not used after so I crushed them…

But if you want to call lots of time kernel with different data, what happens to the old values in your app ?
Can you precise your script ?

mfatica · September 29, 2011, 3:56pm

…my perception was that the variable could live in the device for the whole application and keep its values…

Global memory has the lifespan of the application.

thanasio · September 29, 2011, 4:17pm

To mfatica : that is my point and the present topics title. But is it global memory allocation that can span for the whole application or the actual pointer that holds the values. If you take a glance on the previous posts you’ll get an idea…

To sponge… : I basically have two pointers, pointing at constant data that will not change throughout the application. My application is an N-Body problem. I cannot use constant memory as i can have more than 40-50 000 bodies on my app so constant memory is totally out of the question. Apparently for 1d arrays global memory performs better than texture on Fermi cards (2.1 capability)…

So an ideal senario for me would be to have them resident on global memory for the whole application… Here is a very simple snippet that i am trying to run :

device static float* r0_dev;
device static float* kb_dev;

template
global void compute_bonds_energy2(float* e, Lock lock, float *pos_a,
float *pos_b, float *r0, float *kb) {

__shared__ float cache[threads];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;


while (tid < molsize) {
	
	printf("|index : %d, tid %d, pos : %lf", cacheIndex, tid, kb[tid]);


	tid += blockDim.x * gridDim.x;
}

}

void bondGlobals(MMFF94S_bond_calcs_t* bondsIn, Bond_LookUp_t *lookUp_bonds,
int numAtoms) {

CUDA_SAFE_CALL(cudaMalloc((void**) &r0_dev, numAtoms * sizeof(float)));

CUDA_SAFE_CALL(cudaMalloc((void**) &kb_dev, numAtoms * sizeof(float)));


printf("global allocated \n");

}

float compute_e_bonds(MMFF94S_bond_calcs_t* bondsIn,
Bond_LookUp_t *lookUp_bonds, int numAtoms) {

CUDATimer cu_timer, cu_timer2;
int blocksPerGrid;

//wrapper function timer-----------------------
cu_timer2.Start();
//---------------------------------------------


blocksPerGrid
		= imin(cuda::numBlocks,(numAtoms+cuda::threadsPerBlock - 1)/cuda::threadsPerBlock);

float e, *dev_e, *dev_pos_a, *dev_pos_b;//



//allocate pos_a
CUDA_SAFE_CALL(cudaMalloc(
				(void**) &dev_pos_a,
				3*lookUp_bonds -> bonds_struct_length
				* sizeof(float)));
//allocate pos_b
CUDA_SAFE_CALL(cudaMalloc(
				(void**) &dev_pos_b,
				3*lookUp_bonds -> bonds_struct_length
				* sizeof(float)));

CUDA_SAFE_CALL(cudaMemcpy(dev_pos_a, bondsIn -> pos_a,
				3*lookUp_bonds -> bonds_struct_length * sizeof(float),
				cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(dev_pos_b, bondsIn -> pos_b,
				3*lookUp_bonds -> bonds_struct_length * sizeof(float),
				cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(r0_dev, bondsIn -> r0, numAtoms * sizeof(float),
				cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(kb_dev, bondsIn -> kb, numAtoms * sizeof(float),
				cudaMemcpyHostToDevice))
//----------------------------------------------------

//alocate e
CUDA_SAFE_CALL(cudaMalloc((void**) &dev_e, sizeof(float)));


//----------------------------------------------------
CUDA_SAFE_CALL(cudaMemcpy(dev_e, &e, sizeof(float),
				cudaMemcpyHostToDevice));

Lock lock;

//---------------------------------------------
cudaThreadSynchronize();

cu_timer.Start();
compute_bonds_energy2<cuda::threadsPerBlock> <<<blocksPerGrid, cuda::threadsPerBlock>>>(dev_e, lock, dev_pos_a, dev_pos_b, r0_dev, kb_dev/*, dev_lengths, dev_offsets */);
CUT_CHECK_ERROR("Kernel execution failed");

//stop kernel timer and ouitput
cudaThreadSynchronize();
double t = cu_timer.GetET();

cudaFree(dev_pos_a);
cudaFree(dev_pos_b);

CUDA_SAFE_CALL(cudaMemcpy(&e, dev_e, sizeof(float),
				cudaMemcpyDeviceToHost));
CUDA_SAFE_CALL(cudaFree(dev_e));
//stop wrapper function timer and output
double t2 = cu_timer.GetET();

printf("Time to generate bonds wrapper : %3.1lf ms \n", t2);
printf("Time to generate bonds kernel : %3.1lf ms \n", t);
cudaThreadSynchronize();
return e;

}

at the moment a call like

bondGlobals();
kernel();
kernel();

will execute for the first kernel and fail on the second…

SpongePatoche · September 29, 2011, 4:56pm

Ok … I have the same behavior for my app and it’s work.

Except that I just have

static float * XXX

and not

__device__ static float *XXX

Have you tried?

When you call :

bondGlobals()

compute_e_bonds()

compute_e_bonds()

App crashes in the kernel or in compute_e_bonds ?

Are you sure at every call of cudaMemcpy(r0_dev, …) and cudaMemcpy(kd_dev, …), numAtoms doesn’t exceed initialization value of cudaMalloc() ?

alexish · September 29, 2011, 8:59pm

Hi Alexish, thanks for the reply, does that apply to 1D arrays as well? For example :

device float *myglobal;

void feedMyGlobal(c++_array){

cudamalloc(size*float)

cudamemcopy (myglobal, c++_array)

}

global

void kernel()

{
....

if(tid<size)

float x = something*myglobal[tid];

.....
}

void freeGlobs(){

cudaFree(myGlobal)

}

So if i call feedMyGlobal(c++_array) the global variable pointer will be set in the device and then can i call kernel() without troubles?

Cheers Than

You can also do that for arrays, but like this:

declare the pointer in cu file

device float* myglobal; (at this point the pointer could also be declared as constant, on the GPU side constant)

Allocate device memory

float* devptr;

cudaMalloc(&devptr, size);

copy the address of the allocated pointer to myglobal

cudaMemCopyToSymbol(“myglobal”, &devptr, sizeof(float *), cudaMemcpyHostToDevice);

But why you don’t pass directly devptr to your kernel ?

thanasio · September 30, 2011, 9:24am

Sponge : it crashes in kernel. Memcpy copies the same values exactly…so it does not exceed initialisation value…
Alexish: i am trying to run an application that demands optimization and i was dreaming that i could malloc and copy once, read values from the array ion multiple kernel calls and free their memory before exiting the application.

alexish · September 30, 2011, 12:27pm

I am doing exactly the same thing. Allocations in global mem have application lifetime so you simply do:

cudaMemalloc a devptr
call your kernels passing devptr as argument how many times you need
at the end of your computation copy your results back to cpu (if needed) and cudaFree devptr

thanasio · October 3, 2011, 1:23pm

Thanks for the advice guys…i sorted the problem, it was one of the variables i was passing…i tried a simple kernel and it worked…so it is just a case of debugging now really…

Topic		Replies	Views
global variables CUDA Programming and Performance	5	5415	December 12, 2007
Global memory variables reuse CUDA Programming and Performance	2	3265	September 15, 2008
persistent memory CUDA Programming and Performance	4	8287	September 1, 2011
Keep kernel alive? CUDA Programming and Performance	5	7455	July 29, 2010
How to NOT free device variables. Is it possible? CUDA Programming and Performance	10	4760	February 9, 2010
Global variable in CUDA context CUDA Programming and Performance	2	3164	February 28, 2012
How to share a globla array CUDA Programming and Performance	1	654	May 19, 2011
Simple Question about kernels and global memory CUDA Programming and Performance	4	4030	June 12, 2009
Scopes, local and global variables CUDA Programming and Performance	10	33356	April 6, 2007
Global arrays? CUDA Programming and Performance	24	10791	August 18, 2010

Global memory's lifespan : Application. How do you do that? How could you allocate a variable to

Related topics