VST - CUDA integration

Hello.
I m really new in CUDA programming…
I have been trying to integrate CUDA with a steinberg VST plugin project in Visual studio. I have installed CUDA in VS and everything works fine. Within the project I have managed to link the necessary CUDA libraries and include files.
Everything works fine except one thing…
Whenever I compile the project the compiler stops and gives me an error for using the <<< >>> syntax:
e.g.

functionName<<<x,y>>>() ;

The error is the following:

error C2059: syntax error : ‘<’

I have also checked the example project called “cppIntegration” and tried to replicate the method used there. It still gives me the same error. Although I m not using this syntax within a .cpp file but only in .cu files.

Another thing I 've read in forums is that you can enclose the CUDA program within a .dll library and use it in another project from there. I tried doing this as well and it gave me exactly the same error.

Does anyone know what the problem might be??If I m doing anything wrong.?
Any help would be much appreciated .
Thank you
Filippos

compile with nvcc, not cl

I’m unsure how your project is configured, but I have had success doing VST and CUDA by adding a custom build rule to the project for the .cu extension (you get one with the CUDA SDK in common\scripts). Then when you add in .cu files nvcc will compile them for you, as tmurray recommended you do. Your .cu file needs to have some extern C functions to be called from the cpp VST files (which also need declaring in a .h file for cpp to be made aware of them), but the helpful cpp integration examples supplied by NVIDIA should assist you if you get stuck with that.

When you’re compiling stuff successfully, the next challenge is going to be that VST+CUDA plays funny tricks on you with threads/contexts. Although it’s up to the host what it does, the VST constructor is typically called from a different thread to processReplacing, so any memory you allocate in CUDA using the constructor won’t be visible to the audio processing thread calling processReplacing. You’re probably using Runtime API so suggesting moving contexts around won’t be much help. The way around this is to get your VST constructor to start up a thread and then make all memory allocations and CUDA audio processing requests via this worker thread. You’ll have lots of fun with critical sections once you get into this properly as now almost everything you do has thread safety issues - particularily if you need to change the quantity of data allocated dependant on user interactions with the plug-in during its audio processing lifetime. You should also consider doing some double/triple buffering on your audio processing requests as you probably will find typical ASIO/VST block sizes these days (e.g 256 samples) are too small to get anything near to decent performance on many cards (a discussion about these factors could run to many paragraphs and I still don’t really feel like I understand all the issues).

Hope that helps.

For some time I have wanted to put up an example project file to do just this based on the adelay sample, but Steinberg’s restrictive rules on redistribution of the SDK prohibit it. I also figured most people who want to do this have already worked it out as the number of questions on the topic is pretty minimal, although I didn’t personally find the process especially easy - a thorough read of the CUDA documentation on a long flight certainly turned out to be a good use of time but resulted in some strange glances at my lap :)).

Out of interest, what kind of plug-in are you intending to develop?

Matt

Hi again.

First of all thanks for the replies. Been very helpful as I ve managed to build the VST project, integrated with CUDA, using the build rule that NVIDIA provides for the .cu files.

To answer to your question Matthew, I really dont have anything in particular in mind. I suppose that I will try to implement something based on FFT. A filter or a reverb… something that would slow down the CPU noticeably… Just to see it calculating it without and not slowing down everything…magik! But dunno really.

I m just going to do my last year in September at a Music technology software development course, and I ve started thinking to use CUDA for my desertation.

Regarding the code now… as you said Matthie, I used a worker thread in the .cpp file of the VST, from inside which, I call the “extern “C” void” function, which in turn runs the kernel using the <<< >>> syntax. This way it compiles the code, but I have some other problems…

FIrst, the plugin is no longer recognised by ABLETON LIVE as a valid VST plugin and it cannot load it. On the other hand, the plugin still works in AUDACITY, but as you may know, Audacity does not support real-time processing with a VST. I don’t have any other DAW programs to check the VST, just these two. Would anyone happen to know why its doing this?

Here is the linker output from the compilation:

Linking...

   Creating library C:\Documents and Settings\flag\Desktop\Programs&Source\vst_sdk2_4_rev1\vstsdk2.4\public.sdk\samples\vst2.x\Backup of again\win\Release\again.lib and object C:\Documents and Settings\flag\Desktop\Programs&Source\vst_sdk2_4_rev1\vstsdk2.4\public.sdk\samples\vst2.x\Backup of again\win\Release\again.exp

LINK : warning LNK4098: defaultlib 'LIBCMT' conflicts with use of other libs; use /NODEFAULTLIB:library

And that is the only thing I found that might be responsible for the problem. I have only added the “cudart.lib” and “cutil32.lib” to the linker additional dependencies and of course, the paths to those libraries. Why is it doing that?

I also have some questions regarding the worker thread that will execute the GPU code. I tried to use MFC threads, but whenever I tried including afxwin.h or afxmt.h etc it throws the following error:

C:\Program Files\Microsoft Visual Studio 9.0\VC\atlmfc\include\afx.h(24) : fatal error C1189: #error :  Building MFC application with /MD[d] (CRT dll version) requires MFC shared dll version. Please #define _AFXDLL or do not use /MD[d]

If I then go project settings and change the configuration to use MFC in a shared DLL it throws a different error :

Linking...

mfcs90d.lib(dllmodul.obj) : error LNK2005: _DllMain@12 already defined in vstplugmain.obj

   Creating library C:\Documents and Settings\flag\Desktop\Programs&Source\vst_sdk2_4_rev1\vstsdk2.4\public.sdk\samples\vst2.x\Backup of again\win\Release\again.lib and object C:\Documents and Settings\flag\Desktop\Programs&Source\vst_sdk2_4_rev1\vstsdk2.4\public.sdk\samples\vst2.x\Backup of again\win\Release\again.exp

LINK : warning LNK4098: defaultlib 'LIBCMT' conflicts with use of other libs; use /NODEFAULTLIB:library

C:\Documents and Settings\flag\Desktop\Programs&Source\vst_sdk2_4_rev1\vstsdk2.4\public.sdk\samples\vst2.x\Backup of again\win\Release\again.dll : fatal error LNK1169: one or more multiply defined symbols found

So at the end I just used WIN32 threads. Here is the code for the worker thread:

extern "C" void runTest(pFloat* data, unsigned int len);

float* param1;

int fSIZE;

unsigned __stdcall ThreadFunc(void* pArguments)

{	

		float* pFloat;

		pFloat = param1;

	

	runTest(pFloat, fSIZE);			   // run GPU code

	_endthreadex(0);

	return 0;

}

This compiles fine and it runs the GPU code too. The code is in the .cpp file of the VST( e.g. again.cpp). The float variable “param1” can be used by both the VST process replacing function, as well as the worker thread. Iwas wondering how can I copy the audio input array and send it to the runTest() function as a parameter. Can I do it the way it is shown above?? I mean by copying the input array somewhere in the processReplacing to the “param1” variable and its size to the “fSIZE” variable, and from there, use it in the worker thread like that…? It seems to me a bit scruffy and wrong to tell the truth…

I really don’t know if I should be using different threading system… I suppose I need to add synchronization but I haven’t done it before except for MFC threads, where I just created a class for data storage, which was accessible from both the worker thread and the main thread. I had also created a struct in the same header file with the worker thread’s declaration, which helped the worker thread access the class used for data storage. For synchronization, I had used mutexes and it all worked fine really. But I don’t know how to implement this for WIN32. I tried doing it the same way, but was getting an error when I was trying to access the data storage class from the worker thread.

Does anyone know what kind of threading system would the best for my case? WHere I would be able to share data between the worker thread and the main thread?

Thank you, filippos

If you’re having issues with a host not running a plug-in, the best thing to do really is to attach the debugger to your host so you can see where it’s going wrong; set up a breakpoint at the beginning of your constructor, attach the debugger to the host and then make your host re-scan the plug-ins directory and then step through your plug-in when it hits the break point until it dies. This is assuming you have the cudart.dll in a location the plug-in can find it (for you that may just mean making sure it’s in the system path somewhere which the SDK will do for you, but if you deploy to other systems it might need a bit more thought).

I remember errors with things like LIBCMT, I think I fixed it by going to the C++ -> Code Generation section make sure the runtime library is set to /MT and on Linker -> Input the ‘ignore specific library’ putting MSVCRT.lib in the box. I’d also say there is no need to complicate matters with MFC.

My advice is to use win32 threads, events (to notify a waiting thread loop when you’ve filled it’s buffers up with audio data) and Critical Sections for your locking (they are pretty light-weight). I found this resource useful http://www.computersciencelab.com/MultithreadingTut1.htm for threading and you can learn the rest from MSDN if you don’t know it already.

Hello,

Had a good summer break and now trying to continue this project I started before summer.
I 've managed so far to copy the audio input to the GPU, and send it back to the vst host code. Everything works fine!

There is one problem tho. As I mentioned before in this post, the plugin is not working real time.
If I try to load the plugin with ABLETON for example it will give me an error. The only program that loads the plugin is AUDACITY. But I m suspecting that its working with audacity because it does not support real time effect processing.

I ve spotted what is causing the problem. If I go to the Linker additional dependencies and remove “cutil32.lib” its working!
But then if I remove this library I cannot use CUDA!

No matter what I do after I include this library, the VST is not recognized by the Host Programs(except audacity)…
Even if I have no CUDA code or no “.cu” files at all in the project it will still go wrong. As soon as I remove the library from the dependencies …it works!

I really dont know what to think…
Please help, if anyone knows how to solve this.

The project Im working on is a sample project from the vstsdk 2.4 with modified properties to work with CUDA(CUDA include dirs, CUDA libs etc).

cheers

I’d compile yourself a debug build and step in to the plug-in at the entry point and see where it’s crashing. I don’t see what real-time processing should really be doing incorrectly (assuming you have gotten a decent threading model sorted out as I already discussed).

Hello matt.

Thanks again for the reply. Really appreciate it. Well… this problem is solved now!. I just removed the cutil.lib from the Linker’s input completely and included cuda.h in my source code and it worked. Now its working in Ableton Live as well.! All good so far.

What I m trying to do now is use the CUFFT library to convert from amplitude to frequency and vice versa. But I can’t understand some of the results I m getting back.

What I do first is copy the audio input array to the CUDA device and then to a cufftReal type variable. Here is some code :

[codebox]

extern “C” void

runTest(float* inL, float* inR, float* outL, float* outR, float Fc, unsigned int len)

{

cudaSetDevice( cutGetMaxGflopsDeviceId() );

const unsigned int numThreadsPerBlock = 100;

const unsigned int numBlocks = len / numThreadsPerBlock;

const unsigned int mem_size = sizeof(float) * len;

// allocate device memory

float* LEFT_IN;

cutilSafeCall(cudaMalloc((void**) &LEFT_IN, mem_size));

float* RIGHT_IN;

cutilSafeCall(cudaMalloc((void**) &RIGHT_IN, mem_size));

float* LEFT_OUT;

cutilSafeCall(cudaMalloc((void**) &LEFT_OUT, mem_size));

float* RIGHT_OUT;

cutilSafeCall(cudaMalloc((void**) &RIGHT_OUT, mem_size));

// copy host memory to device

cutilSafeCall(cudaMemcpy(LEFT_IN, inL, mem_size,                 // copy the audio input to a local variable

                        cudaMemcpyHostToDevice) );

cutilSafeCall(cudaMemcpy(RIGHT_IN, inR, mem_size,

                        cudaMemcpyHostToDevice) );

cufftHandle plan, plan2, plan3, plan4, plan5, plan6;

cufftReal* REAL_DATA_L;

cufftReal* REAL_DATA_R;

cufftComplex* COMPLEX_DATA, *COMPLEX_DATA2;

float *w_L, *w_R, *GAIN_L, *GAIN_R;

cutilSafeCall(cudaMalloc((void**) &w_L, mem_size));

cutilSafeCall(cudaMalloc((void**) &w_R, mem_size));

cutilSafeCall(cudaMalloc((void**) &GAIN_L, mem_size));

cutilSafeCall(cudaMalloc((void**) &GAIN_R, mem_size));

cutilSafeCall(cudaMalloc((void**) &REAL_DATA_L, mem_size));

cutilSafeCall(cudaMalloc((void**) &REAL_DATA_R, mem_size));

cutilSafeCall(cudaMalloc((void**)&COMPLEX_DATA, sizeof(cufftComplex)*len));

cutilSafeCall(cudaMalloc((void**)&COMPLEX_DATA2, sizeof(cufftComplex)*len));

for(int i = 0;i < len; i++)

{

	REAL_DATA_L[i] = LEFT_IN[i];                           // copy the audio input to the cufftReal type variable to pass to

	                                                                      // the FFT.

	REAL_DATA_R[i] = RIGHT_IN[i];

	

}

int BATCH = len / numThreadsPerBlock;

cufftPlan1d(&plan, numThreadsPerBlock, CUFFT_R2C, BATCH);

    cufftPlan1d(&plan2, numThreadsPerBlock, CUFFT_R2C, BATCH);

cufftExecR2C(plan, REAL_DATA_L, COMPLEX_DATA);

cufftExecR2C(plan2, REAL_DATA_R, COMPLEX_DATA2);

cufftPlan1d(&plan3, numThreadsPerBlock, CUFFT_C2C, BATCH);

    cufftPlan1d(&plan4, numThreadsPerBlock, CUFFT_C2C, BATCH);

cufftExecC2C(plan3, COMPLEX_DATA, COMPLEX_DATA, CUFFT_FORWARD);

cufftExecC2C(plan4, COMPLEX_DATA2, COMPLEX_DATA2, CUFFT_FORWARD);

cufftPlan1d(&plan5, numThreadsPerBlock, CUFFT_C2R, BATCH);

    cufftPlan1d(&plan6, numThreadsPerBlock, CUFFT_C2R, BATCH);

cufftExecC2R(plan5, COMPLEX_DATA, REAL_DATA_L);

cufftExecC2R(plan6, COMPLEX_DATA2, REAL_DATA_R);

float RC = 1 / 2 * PI * (Fc*20000.f); // RC is actually Resistance * Capacitance

                                                                       // I've change d here the formula from Fc = 1 / 2*pi*R*C to 

                                                                           // to calculate R*C with a given cut-off frequency

for(int i = 0; i < len; i++)

{

	w_L[i] = 2 / PI * REAL_DATA_L[i];              // Calculating angular frequency

	w_R[i] = 2 / PI * REAL_DATA_R[i];

	GAIN_L[i] = 1 / sqrt(1 + (pow(w_L[i] * RC, 2)));          // Calculating gain of the signal according to cut-off freq  

	GAIN_R[i] = 1 / sqrt(1 + (pow(w_R[i] * RC, 2)));            

    }

cutilSafeCall(cudaMemcpy(outL, GAIN_L, mem_size, // Copy the gain array to the host to multiply with audio input

                        cudaMemcpyDeviceToHost));               // in processReplacing() VST function

cutilSafeCall(cudaMemcpy(outR, GAIN_R, mem_size,

                        cudaMemcpyDeviceToHost));

[/codebox]

So this is the part where I use cufft in the .cu file. I hope I didn’t miss anything as I copied and pasted parts of the code and not all of it. SO…the problem here is that I m getting half positive and half negative values from the FFT. From what I know so far a forward FFT will return an array of frequencies if you pass an array of amplitudes for an argument. SO… Iwas expecting to get that frequency array after I executed this :

[codebox]

cufftExecC2C(plan, COMPLEX_DATA, COMPLEX_DATA, CUFFT_FORWARD);

cufftExecC2R(plan, COMPLEX_DATA, REAL_DATA_L);

[/codebox]

What I think I should be getting is an array of frequencies. Right? I might be completely wrong. Honestly. I m still a bit confused with the Fourrier transforms, so please correct if i’m wrong. But what I get in the real data array at the end is something like this :

[codebox]

36.8230

-38902923

122.5556

-7.77028 // these are printed elements of the real data array

124.6875

-95.4375

343,985

-1.083464

etc etc etc

[/codebox]

Could somebody please help me out with this? What do I need to do to get the frequency array??

Cheers

If you’re having trouble with the fundamentals of FFTs then I would recommend getting yourself a copy of Matlab (or Octave if you don’t have access to it which should do all you need) and try to get your head around using FFTs in an environment where there are fewer degrees of freedom to get confused with. Matlab’s FFT commands are very easy to use and will help you debug your output from CUFFT much more easily. The details of how FFT results get packed and scaled by various FFT libraries can get confusing at times.

Hey there, iam planning to make a VST-Cuda-Plugin myself.
FLag is your project still in progress?

also i would like to know more about the worker-thread thing. and why i wont do something like that:

vst::processreplacing()
{
inputs()
cudastuff
run kernel

outputs
}

thank you

CUDA_VST_RT - you will find it’s not going to work because the CUDA stuff all needs to happen from the same thread, i.e. the memory allocation and kernel calls need to be done by the same thread (ok not strictly true if you use Driver API but I doubt that was your plan) and a typical VST host isn’t going to use the same thread to instantiate as it is going to use to process audio.

Hey hill_matthew, thank you very much for the reply.
ok then thats a good thing to know, but iam not a “prof.programmer”. could you please give me some further information about the suggested workerthread solution?
that would be very kind.

thanks again,
cheers

Hi guys, Matthew, FLag, etc.,

I’m working to do the same thing with attempting to use CUDA within a VST and I am wondering if you are still working on your projects or if there is a newer thread on the topic.

I am working through my university and we have a powerful computer with 4 Teslas at our disposal and we’re excited to see what’s possible.

We’re running Windows XP 64 bit with Visual Studio 2008 and I have experience with VSTs but I am brand new to CUDA.

Thanks,
Matt

Hello Matt,

I’m still working with CUDA and VST… I have managed to get all those issues mentioned in this post sorted… And I’m very happy that a lot of people are interested in this stuff.

I’m also in the final year of an audio software development course… and doing for my dissertation an IR reverb on VST using CUDA for the convolution part… So i’m still working on it full time!

If there is anything you need help with regarding CUDA-VST integration please post it here and I’ll answer as soon as I see it.

Best regards,

Filippos

Hey guys,

Is anybody still working on this stuff?

I am having all sorts of fun with CUDA+VST right now. Currently just finished sorting out the semaphores
for directing buffer traffic between threads in my vst plug. ouch my head! I had only vague knowledge of the producer-consumer
pattern until now.

As regards double buffering with VST, I was thinking of just collecting up enough buffers to fill up an AudioEffect::getBlockSize()
sized buffer to send to GPU as pinned memory. But…

Will this definitely mean I’m adding latency? Do I have to give buffers of same size back in processReplacing? If there was some way I
could delay and give a whole block in one without a penalty? Probably not, but I thought I’d put it out there.

I suppose might have 3 cuda streams or some such method with;
-one transferring to device
-one processing on the device
-one transferring back to host
and try and work out all sorts of fiddly sized blocks?

I don’t really know where to go to be honest! How have others gone about this?

Thanks! =)

A quick solution to circumvent thread issues, from the programming guide 2.3:

[i]"Portable Memory

A block of page-locked memory can be used by any host threads, but by default, the

benefits of using page-locked memory described above are only available for the

thread that allocates it. To make these advantages available to all threads, it needs to

be allocated by passing flag cudaHostAllocPortable to cudaHostAlloc()."

[/i]

Don’t make any assumptions on the number of samples processReplacing will be called with, especially don’t assume it is going to be the size returned by getBlockSize() as this is only a maximum. It’s up to the host how many samples it provides and some of them do choose to provide unexpected numbers of samples in processReplacing at times. So you need to be running a double buffer at least to be sure you get a convenient quantity of samples to process at once (imagine how you’d handle 256, 256, 13, 256, 243 as this kind of thing can happen in projects with a lot of automation). VST can be quite loosely defined at times, so you need to program VST plugins with a sprinkling of paranoia and pessimism!

A welcome addition for sure :)