Real Time Audio Processing AudioMulch 1.0, VST-Plugin, Cuda SDK 2.3


at the moment i have to run in EMU-Mode. first some code:

[codebox]//VST Plugin Main App cpp

void VSTServer::processReplacing (float** inputs, float** outputs, VstInt32 sampleFrames) //this is called everytime the audiobuffer is full



	//DSP! ////////////////////////////////////////////////////////////////////////////////

	if (!CudaIsInit){				

		width = BufferSize;

		height = cEffect.numOutputs;	

		inputBuffer = new float[width * height];

		outputBuffer = new float[width * height];


		for (int i = 0; i<width*height; i++)

			outputBuffer[i] = 0.0;

		CudaIsInit = true;


	GPUGAIN(fGain, width, height, inputBuffer, outputBuffer, BlockBufferIn); //calling cpp inegrated cuda code


// processed audio goes to soundcard! DO I NEED SOME SYNC STUFF HERE, GPUGAIN should be done before routing the audio


code above: without some buffering stuff. interesting part: GPUGAIN(fGain, width, height, inputBuffer, outputBuffer, BlockBufferIn);

fgain = gain for audio, width = number of elements in the audiobuffer, height = channels.



// includes, project

#include <cutil_inline.h>

// includes, kernels

#include <>

extern “C” void GPUGAIN(constant float fGain, unsigned int width,unsigned int height,float* inputBuffer,float* outputBuffer, float** BlockBufferIn)


float *a_d;			// pointer to device mem

int i,j;

size_t MEMsize = width*height*sizeof(float);

unsigned int ArraySize = width * height;

// allocate array on Device

( cudaMalloc((void **) &a_d, MEMsize));

for(i = 0; i < height; i++)


	for(j = 0; j < width; j++)


		unsigned int index = i * width + j;

		inputBuffer[index] = BlockBufferIn[i][j];		// putting all audiodata in one vector



// copy data from host to device

( cudaMemcpy(a_d, inputBuffer, MEMsize, cudaMemcpyHostToDevice));

// do calculation on device:

// Part 1 of 2. Compute execution configuration

int nBlocks = (ArraySize)/BLOCK_DIM + ((ArraySize)%BLOCK_DIM == 0?0:1);


// Part 2 of 2. Call GPUGAIN kernel

GPUGAIN<<<  nBlocks, BLOCK_DIM>>> (fGain, a_d, (ArraySize));

// Retrieve result from device and store in b_h

(cudaMemcpy(outputBuffer, a_d, MEMsize, cudaMemcpyDeviceToHost));

// cleanup







#define BLOCK_DIM 32

global void GPUGAINOnDevice(float constant fGain, float *a, int size)


int Xidx = blockIdx.x* blockDim.x + threadIdx.x;

if (Xidx < size)


	a[Xidx] = a[Xidx] * fGain;




#endif // GPUGAIN_KERNEL_H[/codebox]

Ok. its working… almost… in the audio output i got som clicking noise.

Would it be better to put the audiodata in a Matrix instaed of a vector?

do i have to sync the gpu to the cpu?

any “no goes” in the principal idea?

this code is made by editing some existing sample code (

none here working at a vst-plugin for cuda?

You should be calling cudaThreadSynchronize() after your kernel call to ensure the kernel has finished execution before copying back the data from device memory to the VST buffer.

I didn’t look at your code too closely but I didn’t actually see where you were actually copying the source audio data to the GPU or vise versa so you’ll obviously need to make sure you are doing that as well before you’ll hear anything. You don’t need any further synchronization for these copies since you won’t be using pinned memory and they will occur synchronously.

Lastly, do you have any idea what the size of the buffer is being passed to you? If the buffer size is too small the processing time of the kernel combined with the launch latency may be causing the VST buffer to underrun, which will result in what sounds like crackles or pops as garbage data will be sent to the soundcard in the incomplete buffer.

There’s no need to call cudaThreadSynchronize() before memcopies, they implicitly synchronize.

Yes… that’s what I said.

Ok, Thanks seems like the sync. was a problem. there are no sound pops anymore, but the emu-mode is way to slow.
the second code snippet in my first post, shows were i call “( cudaMemcpy(a_d, inputBuffer, MEMsize, cudaMemcpyHostToDevice));”

the size of the data send to the gpu is selectable, by choosing the buffersize (= number of audio samples in the array)