Global memory's lifespan : Application. How do you do that? How could you allocate a variable to

Hi,

I have read in different articles and books (ie. Kirk : Chapter 5), that global memory’s lifespan can be the whole application. I want to declare a variable to reside in global memory (not texture!) and last there for the whole application but i have never seen such an example. Has anybody got prior experience to that? If yes a little code snippet would help loads…

Thanks in advance,
Than

Hi,
it’s very simple, declare it in your cu file like any global variable in C with prefix device.

Example:

device float myglobal;

global
void kernel()
{

float x = something*myglobal;

}

But be very careful if you need to write to the global variable.

Hi Alexish, thanks for the reply, does that apply to 1D arrays as well? For example :

device float *myglobal;

void feedMyGlobal(c++_array){

cudamalloc(size*float)

cudamemcopy (myglobal, c++_array)

}

global

void kernel()

{

....

if(tid<size)

float x = something*myglobal[tid];

.....

}

void freeGlobs(){

cudaFree(myGlobal)

}

So if i call feedMyGlobal(c++_array) the global variable pointer will be set in the device and then can i call kernel() without troubles?

Cheers Than

Hey Alexish,

are you sure that works? I’ve tried to implement it with pointer that i would allocate memory and copy values with cudamalloc and cudamemcopy. Although the code compiled…there was no success, kept getting :

Cuda error: Kernel execution failed in file ‘gpu_code/bonds_cuda.cu’ in line 205 : unspecified launch failure.

the same code would work fine when the variable was declared in the wrapper function, passed as a parameter in kernel and freed after kernel…

…strange…External Image)

Can you change

__device__ float *GlobalArray;

by

static float *GlobalArray;

And gives GlobalArray in params of your kernel.

Hey sponge patoche,

good advice it works…

however though if i call the same kernel again, the compiler throws :

Cuda error: Kernel execution failed in file ‘gpu_code/bonds_cuda.cu’ in line 244 : invalid device pointer.

basically if i free the global and reallocate it to the device, it is all fine, but if not ie. i live the global resident on the device for multiple kernel calls,

then on the second call i get the error. Also the same happens with texture1D…

So, where is the application lifetime of the global variable…?

How is structured your app ?

At home, I used global static variables in a shared library to keep a pointer to device memory between different calls.

(If you can make shared library contains your global static varaibles and kernel, I think your app run good)

So, you more or less mean you have :

device static float pointer;//or do you actually mean global static float…?

void feedPointer(){

cudamalloc (*pointer)

cudamemcopy (c++_pointer, *pointer)

}

global kernel(){

usesValuesOf(pointer)

}

and then you call feedPointer() once, then you can make as many kernel() calls as you like and free pointers before exiting…?

as i’ve tried this structure and doesn’t work for me…

cheers,

Than

In my shared library I have something like that :

static float *globalArray = NULL;

void

InitDeviceMemory(int size)

{

     // Init memory space

     cudaMalloc((void**) &globalArray, size * sizeof(float));

}

void

MyFunction(float *data, int size)

{

    // Copy input data in device memory

     cudaMemcpy(globalArray, data, size * sizeof(float), cudaMemcpyHostToDevice);

// Launch kernel with globalArray in params

     // Kernel <<< nBlock, nThreadPerBlock >>> (globalArray);

}

void

FreeDeviceMemory()

{

     cudaFree(globalArray);

     globalArray = NULL;

}

1 - InitDeviceMemroy

2 - MyFunction lots of time

3 - FreeDeviceMemory

That helps…! your global variable is allocated in cuda memory, but each time you call your kernel, you have to copy values to the global array.

It seems that i got things wrong here…so you can allocate global memory once and use the slot in a series of kernel calls, but values in memory only last for the kernel Therefore, we have to copy values to the same memory location each time before we call a kernel…

…my perception was that the variable could live in the device for the whole application and keep its values…something like a static variable in c/c++…

anyway…thanks…that helped…!!!

In my app, every call to MyFunction make a kernel processing from data sources, but these data not used after so I crushed them…

But if you want to call lots of time kernel with different data, what happens to the old values in your app ?
Can you precise your script ?

…my perception was that the variable could live in the device for the whole application and keep its values…

Global memory has the lifespan of the application.

To mfatica : that is my point and the present topics title. But is it global memory allocation that can span for the whole application or the actual pointer that holds the values. If you take a glance on the previous posts you’ll get an idea…

To sponge… : I basically have two pointers, pointing at constant data that will not change throughout the application. My application is an N-Body problem. I cannot use constant memory as i can have more than 40-50 000 bodies on my app so constant memory is totally out of the question. Apparently for 1d arrays global memory performs better than texture on Fermi cards (2.1 capability)…

So an ideal senario for me would be to have them resident on global memory for the whole application… Here is a very simple snippet that i am trying to run :

device static float* r0_dev;
device static float* kb_dev;

template
global void compute_bonds_energy2(float* e, Lock lock, float *pos_a,
float *pos_b, float *r0, float *kb) {

__shared__ float cache[threads];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;


while (tid < molsize) {
	
	printf("|index : %d, tid %d, pos : %lf", cacheIndex, tid, kb[tid]);


	tid += blockDim.x * gridDim.x;
}

}

void bondGlobals(MMFF94S_bond_calcs_t* bondsIn, Bond_LookUp_t *lookUp_bonds,
int numAtoms) {

CUDA_SAFE_CALL(cudaMalloc((void**) &r0_dev, numAtoms * sizeof(float)));

CUDA_SAFE_CALL(cudaMalloc((void**) &kb_dev, numAtoms * sizeof(float)));


printf("global allocated \n");

}

float compute_e_bonds(MMFF94S_bond_calcs_t* bondsIn,
Bond_LookUp_t *lookUp_bonds, int numAtoms) {

CUDATimer cu_timer, cu_timer2;
int blocksPerGrid;

//wrapper function timer-----------------------
cu_timer2.Start();
//---------------------------------------------


blocksPerGrid
		= imin(cuda::numBlocks,(numAtoms+cuda::threadsPerBlock - 1)/cuda::threadsPerBlock);

float e, *dev_e, *dev_pos_a, *dev_pos_b;//



//allocate pos_a
CUDA_SAFE_CALL(cudaMalloc(
				(void**) &dev_pos_a,
				3*lookUp_bonds -> bonds_struct_length
				* sizeof(float)));
//allocate pos_b
CUDA_SAFE_CALL(cudaMalloc(
				(void**) &dev_pos_b,
				3*lookUp_bonds -> bonds_struct_length
				* sizeof(float)));

CUDA_SAFE_CALL(cudaMemcpy(dev_pos_a, bondsIn -> pos_a,
				3*lookUp_bonds -> bonds_struct_length * sizeof(float),
				cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(dev_pos_b, bondsIn -> pos_b,
				3*lookUp_bonds -> bonds_struct_length * sizeof(float),
				cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(r0_dev, bondsIn -> r0, numAtoms * sizeof(float),
				cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(kb_dev, bondsIn -> kb, numAtoms * sizeof(float),
				cudaMemcpyHostToDevice))
//----------------------------------------------------

//alocate e
CUDA_SAFE_CALL(cudaMalloc((void**) &dev_e, sizeof(float)));


//----------------------------------------------------
CUDA_SAFE_CALL(cudaMemcpy(dev_e, &e, sizeof(float),
				cudaMemcpyHostToDevice));

Lock lock;

//---------------------------------------------
cudaThreadSynchronize();

cu_timer.Start();
compute_bonds_energy2<cuda::threadsPerBlock> <<<blocksPerGrid, cuda::threadsPerBlock>>>(dev_e, lock, dev_pos_a, dev_pos_b, r0_dev, kb_dev/*, dev_lengths, dev_offsets */);
CUT_CHECK_ERROR("Kernel execution failed");

//stop kernel timer and ouitput
cudaThreadSynchronize();
double t = cu_timer.GetET();

cudaFree(dev_pos_a);
cudaFree(dev_pos_b);

CUDA_SAFE_CALL(cudaMemcpy(&e, dev_e, sizeof(float),
				cudaMemcpyDeviceToHost));
CUDA_SAFE_CALL(cudaFree(dev_e));
//stop wrapper function timer and output
double t2 = cu_timer.GetET();

printf("Time to generate bonds wrapper : %3.1lf ms \n", t2);
printf("Time to generate bonds kernel : %3.1lf ms \n", t);
cudaThreadSynchronize();
return e;

}

at the moment a call like

bondGlobals();
kernel();
kernel();

will execute for the first kernel and fail on the second…

Ok … I have the same behavior for my app and it’s work.

Except that I just have

static float * XXX

and not

__device__ static float *XXX

Have you tried?

When you call :

bondGlobals()

compute_e_bonds()

compute_e_bonds()

App crashes in the kernel or in compute_e_bonds ?

Are you sure at every call of cudaMemcpy(r0_dev, …) and cudaMemcpy(kd_dev, …), numAtoms doesn’t exceed initialization value of cudaMalloc() ?

You can also do that for arrays, but like this:

  1. declare the pointer in cu file

device float* myglobal; (at this point the pointer could also be declared as constant, on the GPU side constant)

  1. Allocate device memory

float* devptr;

cudaMalloc(&devptr, size);

  1. copy the address of the allocated pointer to myglobal

cudaMemCopyToSymbol(“myglobal”, &devptr, sizeof(float *), cudaMemcpyHostToDevice);

But why you don’t pass directly devptr to your kernel ?

Sponge : it crashes in kernel. Memcpy copies the same values exactly…so it does not exceed initialisation value…
Alexish: i am trying to run an application that demands optimization and i was dreaming that i could malloc and copy once, read values from the array ion multiple kernel calls and free their memory before exiting the application.

I am doing exactly the same thing. Allocations in global mem have application lifetime so you simply do:

  1. cudaMemalloc a devptr

  2. call your kernels passing devptr as argument how many times you need

  3. at the end of your computation copy your results back to cpu (if needed) and cudaFree devptr

Thanks for the advice guys…i sorted the problem, it was one of the variables i was passing…i tried a simple kernel and it worked…so it is just a case of debugging now really…