How to NOT free device variables. Is it possible?

Hello,

I ve got a C++ program which is using CUDA through an external function, just like the cpp integration example.
I m copying some large variables to the device (global memory) at the beginning of the program and then
I do not use them in the CPU program at all. I was wondering if i could just leave them
there in the device and not allocate, copy or free them again.
The external function is being called all the time…its a real time DSP program. So im just copying these variables
to the GPU and back to the CPU all the time - at each iteration. The thing is that once i create them in the CPU main program and send them to the device
I do not process them again in the CPU part of the program at all.
All the processing is being done in CUDA. Then i copy them
back to the CPU (processed) and back to the GPU in the next iteration to process them again.
The point is as I said that I do not use them in the CPU part of the program. Only in cuda.

These variables are too large to stick them in shared memory…

How would I just allocate them once in the device code and then leave them there and not free them and not do the
meaningless copying all the time???
In case I m not making any sense, here is what happens in short:

1.Load filter response in CPU
2.Send filter response and copy it to GPU.
3.Send realtime audio buffer to GPU.
4.Send and copy circular buffer to the GPU.
5.Process the 3 in the GPU.
6.Send processed audio buffer to the CPU and from there to the output of the program.
7.Free device arrays

So my problem is that I do not want to send the filter response (and the circular buffer) back and forth all the time…
Cause i don’t have to. I do not use them at all in the CPU part after initialization…
What do i do?

Thanks,
Filippos

The only thing you have to do is… nothing!
As long as the context stays alive, your malloc’ed device pointers will stay active and the data memcpy’ed to the device will be accessible.

Thanks for the quick reply.!

Its a bit tricky though as it appears. I m currently using a worker thread to call the external function and then the cuda code.

It is a DWORD WINAPI threadfunc(LPVOID param) … declaring it global in the source file where I ve got the constructor and most important functions of my plugin.

It is not working very well…

If i free all the data in each iteration it works fine… but it crashes badly if I don’t.

I ve trying synchronising using critical sections events and mutexes with no luck.

The truth is its running fine through a console application…but the same solutions do not work for my dll application ( a VST plugin that is…). Here is the code:

// includes, system

#include

#include <stdlib.h>

[codebox]// Required to include CUDA vector types

#include <vector_types.h>

#include “cutil_inline.h”

#include <windows.h>

#include

#include

#include

#include <process.h>

#include “thread.h”

#include “criticalsection.h”

using namespace std;

extern “C” void

runTest(int* data, int* data2, int len, int height, int start);

int loop;

int* data;

int* data2;

int length = 12435;

int init = 1;

int height = 253;

CRITICAL_SECTION cs;

int DivUp(int a, int B){

return (a % b != 0) ? (a / b + 1) : (a / B);

}

DWORD WINAPI ThrdFunc ( LPVOID n )

{

HANDLE hEvent = OpenEventA ( EVENT_ALL_ACCESS , false, "MyEvent" );

HANDLE StartEvent = OpenEventA ( EVENT_ALL_ACCESS , false, "Start_Event" );

if ( !hEvent ) { return -1; }

if ( !StartEvent ) { return -1; }

while ( 1 )

{

	

	WaitForSingleObject ( StartEvent, INFINITE );

	runTest(data, data2, length, height, init);

	

	EnterCriticalSection(&cs);

    ResetEvent(StartEvent);

    SetEvent(hEvent);

	LeaveCriticalSection(&cs);

}

CloseHandle ( hEvent );

CloseHandle ( StartEvent );

cout<<"End of the Thread......"<<endl;



return 0;

}

int

main(int argc, char** argv)

{

InitializeCriticalSection(&cs);

int hcounter=0;

init = 1;

int global_counter = 0;

HANDLE     hEvent = CreateEventA ( NULL , true , false , "MyEvent" );

HANDLE     StartEvent = CreateEventA ( NULL , true , false , "Start_Event" );



float t_size = height * length;

DWORD Id;

HANDLE hThrd = CreateThread ( NULL, 0, (LPTHREAD_START_ROUTINE)ThrdFunc,0,0,&Id );

data  = new int[length];

data2 = new int[height*length];

for(int i =0;i<height;i++)

{

	for(int j =0;j<length;j++)

	{

		data2[j + i*length] = j;

	}

}

for ( int counter = 0; counter < 250; counter ++ )

{

	

     SetEvent(StartEvent);

     WaitForSingleObject(hEvent, INFINITE );

	

	 if(counter == 0){     // allocate and copy all arrays only in the first iteration

		 init = 0;

	 }

	

	 ResetEvent(hEvent);

	 cout << data2[120] << endl;

}

init = 2;   // run one last time and free all device variables

SetEvent(StartEvent);

WaitForSingleObject(hEvent, INFINITE );

//ResetEvent(hEvent);

getchar();

//cutilExit(argc, argv);

return 0;

}

[/codebox]

Do i need to do anything else?? Why is it crashing ??

What is a good (or at least working) way to keep the worker thread alive and not having to free

and re-copy device variables all the time?

I ve also read that its possible to stick the worker thread function as static member

in the main application class…? Would that do any difference ?

Again, I don’t know why… but the above code works for a console app but not for a dll plugin…

Thank you,

Filippos

I’ve not tried this yet so I might be wrong but I’m facing a similar scenario and this is how I would do it…

Using OpenMP you can declare that several functions can run in parallel, have one of these functions do your cuda stuff, something like the following…

void function_in_a_thread()
{
load stuff to device
do{
kernel call
spin until kernel is finished
copy result back
}while(some other global flag is not set)
cudaFree()
}

then the other function can run everything CPU side making other threads as a when needed. Your kernel should also spin until some GPU variable is set as new data is copied to the device.

Does this help?

I’ve not tried this yet so I might be wrong but I’m facing a similar scenario and this is how I would do it…

Using OpenMP you can declare that several functions can run in parallel, have one of these functions do your cuda stuff, something like the following…

void function_in_a_thread()
{
load stuff to device
do{
kernel call
spin until kernel is finished
copy result back
}while(some other global flag is not set)
cudaFree()
}

then the other function can run everything CPU side making other threads as a when needed. Your kernel should also spin until some GPU variable is set as new data is copied to the device.

Does this help?

I have exactly the same question right now. What do you mean by “context staying alive”?

I tried to do full memcopy for the first call to cuda function and then bypass most the memcopy and cudafree but it stops with “out of memory” error.

I have 4 large arrays which do not change and 2 smaller arrays that are changing at each iteration.

Any more clues?

The CPU thread that mallocs the device memory must remain, and it must be the one to invoke kernel calls. If that thread is destroyed then its all over, at least thats my understanding. This raises a question though…

What if I declare a pointer to device memory globally, then allocate the memory for that pointer in one thread, can I not sill use that memory from another thread and will it not persevere even after that thread is destroyed?? This would seem more useful as I would no longer need to keep that CPU thread alive when I could be using the CPU for other stuff.

Pointers are only valid inside the context they were created in. If you want to keep an allocation, you have to keep the context, either by keeping the thread that created it alive, or using context migration if you are using the drvier API. Once the thread holding the context exits, the context is gone, and so is the memory allocation tied to it.

Hi,

thanks for all the answers guys… anthonyfmorse I ll have a look at openMP …but I really
don’t want to use 100 different libraries in one simple program…Especially for
basic operating system concepts like CPU threads.

Does anyone have a working implementation for this?
It really looks like a very trivial scenario and I cannot understand why the usual windows
mutexes, events etc don’t work very well.

On the other hand, i was wondering if there is any way to keep the CPU waiting by using
a command from inside the GPU (don’t thing so but you never know…)? Is that possible?? Something like the cudaThreadSynchronize()?..
kind of?

Anyway, if anyone has actually implemented this in C++ PLEASE share it with us. Or at least
give us some good hints.

By the way, (and this not CUDA related but it would provide an answer to my problem),
does anyone know if VST plugins do not like worker threads ? I mean, since the same code is working
for a console app …why isn’t it working for the VST dll program? Is there a compatibility issue
with VST and CPU worker threads?

Member named Mathew Hill had answered to this question (not with a code example tho) a while ago in one of my previous posts (VST-CUDA integration) so if any of you want to have a look there…

Also, if you re still out there Matthew can you please give me a more detailed description regarding how
you did this??? As i said I’ve tried the global worker thread and its working but ONLY if I close it in each iteration…

Best regards,
Filippos

I have a plugin working in our own internal system with simular work flow. and i know of a few plugins for maya that work like that. there should be no problem keeping the worker thread alive and frozen waiting for the next job to come along.

Thanks erdooom,
Good to know that for a start.
Could you possibly give me some more details regarding that thread…How are u doing it?
Any code would be great…But if you can’t post code can you at least describe to
me the steps you re taking?

For example,
Is your worker thread a class member or is it global?
Are you using events, mutexes or any of these?
If you are how are you doing it? Does it look like the code I posted above?
Why is that code not working anyway?
And of course, are you calling CUDA code through that thread or is it something else?

Thanks again,
Filippos