Cuda Ring Buffer (Help?)

I am trying to set up a large arrary on the device and then copy smaller sections to it. I hope to use the large array as a ring buffer for frames.

In C++ I would do it like this

// Create Buffer

pFilterBuff	= new float[ImageSize*FilterNumFrames];

ppFilterFrame = new float*[FilterNumFrames];

for(int i=0;i<FilterNumFrames;i++){

	(1)        ppFilterFrame[i] = &pFilterBuff[i*ImageSize];

	memset(ppFilterFrame[i],0,(sizeof(float)*ImageSize));

}

..... Some Host Code to load frame from file ....

for( int i=0;i<NumFrames;i++){

	//Calculate Indexs for my buffer

	int BuffIndx = i % FilterNumFrames;

	int StartIndx = BuffIndx+1;

	// Load Cuurent Frame to Buffer

	memcpy(ppFilterFrame[BuffIndx],pCurrentFrame,(sizeof(float)*ImageSize));

	//Do Calulations on Buffer

	for(int i=StartIndx;i<StartIndx+FilterNumFrames;i++){

  Indx = i%FilterNumFrames;

  for(int j=0;j<ImageSize;j++){

  	(2)      pResultFrame[j] += Alpha[AlphaIndex]*ppFilterFrame[Indx][j];

  }

  AlphaIndex++;

	}

}

I have got a version of this to compile in cuda and in emulation it gives the correct answer, But when I run it out of emulation mode it givels an unknown launch failure.

Below is what I am doing now (I am only showing that deals code for the buffer).

float*  dg_ppFilterBuffer[FilterNumFrames];

//Allocate Memory on Device

for(int i=0;i<FilterNumFrames;i++){

	CUDA_SAFE_CALL(cudaMalloc((void**)&dg_ppFilterBuffer[i],(sizeof(float)*ImageSize)));

}

//Copy Current Frame to Device

CUDA_SAFE_CALL(cudaMemcpy(dg_ppFilterBuffer[BuffIndx],pThisFrame,(sizeof(float)*ImageSize),cudaMemcpyHostToDevice));

//Perform Filter

if(nFrameNum >= FilterNumFrames-1){

	FilterFrames_Global_kernel_new<<<grid,threads>>>(StartIndx,dg_ppFilterBuffer,dg_pResultFrame);

}

Here is my Kernel Code

__global__ void

FilterFrames_Global_kernel_new(int StartIndex,float** dg_pFilterFrame,float* dg_pResultFrame)

{

	int Findx	= 0;

	int Mindex = (blockIdx.y*ImageW) + (blockIdx.x*(blockDim.x)) + threadIdx.x;

	int Alphaindex = 0;

	pSums[threadIdx.x] = 0;

	for(int i=StartIndex;i<StartIndex+FilterNumFrames;i++){

  Findx	= (i%FilterNumFrames);

  pSums[threadIdx.x] += dg_Alpha[Alphaindex] * dg_pFilterFrame[Findx][Mindex];

  Alphaindex++;

	}

	dg_pResultFrame[Mindex] = pSums[threadIdx.x];

	__syncthreads();

}

The only solution I have found is to build the buffer on the host computer and then recopy the whole buffer every frame, but this make the code run slower on the card then the host. I am sure it is because of the copy time.

Does anyone have any suggestions to this problem? Can anyonje see where I have an error?

Thanks!

Your dg_ppFilterBuffer seems to be on the host, and shouldn’t be accessed from device.
You should copy it to constant memory or device memory first.

Thanks for your response but, I don’t think that is the problem or I am really confused here. (If I am confused I would appreciate any enlightenment!) I know that my variable names are hard to follow, but I think I am allocating dg_ppFilterBuffer on the device with a cudaMemcpy and then passing that address to the kernel. In the kernel I refer to the pointer as dg_pFilterBuffer.

I have read some more forms on pointers on the device and I am not sure I can do what I want here. I am going to see if I can figure out a way to do this with indexes in to a large array instead of pointers. My only problem with this is that I think I will have to copy a frame to the device and then element copy into the larger array which could be time consuming.

I could be wrong but dg_ppFilterBuffer isn’t on the device. The buffers in dg_ppFilterBufer are but the actual array itself isn’t. I’m not sure how cuda handles the array. It’s possible that the compiler could be smart enough to figure it out but you may need to allocate the dg_ppFilterBuffer first, and then allocate all the buffers in the array afterwards. Haven’t tried doing that so not sure if it would work or if this is your problem. I believe this is your problem though. So you would do something like:

float** dg_ppFilterBuffer;

cudaMalloc((void**)&dg_ppFilterBuffer, FilterNumFrames*sizeof(float*));

//Allocate Memory on Device

for(int i=0;i<FilterNumFrames;i++){

CUDA_SAFE_CALL(cudaMalloc((void**)&dg_ppFilterBuffer[i],(sizeof(float)*ImageSize)));

}

Not sure if that code is 100% correct but hopefully it makes sense. Pretty similar to what you are doing when creating your ppFilterFrame. dg_ppFilterBuffer is now allocated on the device. You may have to modify the code above to get it to work. Not sure if you can allocate onto memory on the device.

You can also allocate one large array and cudaMemcpy data into it instead of using an array of arrays. I’ve done my ring buffer that way and it looks to work fine. I would think the array of arrays could be a little easier to do though.

Hope that helps.

Thanks you both for your help. I went ahead and fixed my problem (fo now) by allocating one big array on the device and doing a element wise copy for each new frame.

doeney, I see what you are saying now and I will try this just to see if it runs faster than my current soultion. But as it stands my soultion is almost 3X on the device compared to doing it on the host for a filter of 12 frames long. Which is good I think. Of course any speed I can get will be helpfull.

Agian Thanks for the responses!!

I think I found out that this cannot be done. I get a memory break every time I try. I also seem to rember reading that I this could not be done.

I have done somthing like this but I can not figure out how to do the cudaMemcpy to a location inside the array. If it is not to much trouble could you put up an example? I hate to ask but I am sure that if I have this question others might too.

I am going to keep working on the cudaMemcpy myself and if I get somting that works I will post for others.

Thanks!!

Neeley

Yeah I wasn’t sure about the allocation. Thought it might be possible but I could see that it wouldn’t work.

The one large array approach is pretty similar to how you would do it on the CPU. Basically allocate an 2D space that you want and then use cudaMemcpy2D to copy into the correct location. I have a whole cache class system built in C++ but here is a basic overview of how it works.

unsigned short*	mDeviceData;

cudaMallocPitch((void**)&mDeviceData, &mPitch, width * numBands * bytesPerBand, height);

mSize[0] = width;

mSize[1] = height;

load(unsigned short* buffer, const int x, const int y, const int width, const int height)

{

  cudaError_t err;

  int xOffset = x % mSize[0];

  int yOffset = y % mSize[1];

 err = cudaMemcpy2D(((char*)mDeviceData) + ((yOffset * mPitch) + (xOffset * sizeof(unsigned short))), mPitch,  buffer, width * sizeof(unsigned short), width * sizeof(unsigned short), height, cudaMemcpyHostToDevice);

  if (err != cudaSuccess)

  {

	std::cerr << "Failed to copy data onto the GPU: " << err << std::endl;

  }

}

read(const int x, const int y, const int width, const int height, unsigned short* buffer)

{

  cudaError_t err;

  int xOffset = x % mSize[0];

  int yOffset = y % mSize[1];

 err = cudaMemcpy2D((void*)buffer, width * sizeof(unsigned short), ((char*)mDeviceData) + ((yOffset * mPitch) + (xOffset * sizeof(unsigned short)), mPitch, width * sizeof(unsigned short), height, cudaMemcpyDeviceToHost);

  if (err != cudaSuccess)

  {

	std::cerr << "Failed to copy data from the GPU: " << err << std::endl;

  }

}

I then call load passing in a buffer I want copied onto the device and call read to copy data out of the device. I assume that I will always load or read without wrapping, so I don’t have to do multiple I/O to the device. I am creating a 12000x1024 buffer to run convolution kernels on a very large image. I stream data from disk onto the buffer and run the kernel(s) when I have loaded a reasonably large chunk of the buffer onto the device. Once the data is processed I copy the data off the device and allow new data to be loading in its place.

The kernels have to know how to address the memory correctly as well. This means that they will need to do quite a bit of modulus math. I’m hoping to optimize the calls, as the programming guide states that integer modulus math is quite slow. Right now it seems pretty good so I’m not too worried.

What I’d really like to see though is a way to asynchronously copy data into the buffer while the kernel is working on a different section. This would allow me to hide the load and reading of the data onto the device.

Also note that if you aren’t running any kernels that require multiple passes then you likely can use textures instead. Textures already support WRAP mode addressing which is what you want. But textures are read-only so you have to do device to device copies if you want to do multiple passes. The convolutionSeperable and convolutionTexture examples are good to look through.

downey Thanks agian!

I figured out what my problem was earlier when I was trying to do a memcopy into the middle of an array on the device. That is kinda what lead me down this path. I am sure your example will help lots of folks and I am posting mine that works on 1d arrays to help as well

CUDA_SAFE_CALL(cudaMalloc((void**)&dg_pFilterBuffer,(sizeof(float*)*FILTERNUMFRAMES*IMAGESIZE)));

int  CpyIndx = (nFrame%FILTERNUMFRAMES)*IMAGESIZE;

CUDA_SAFE_CALL(cudaMemcpy(&dg_pFilterBuffer[CpyIndx],pInFrame,(sizeof(float)*IMAGESIZE),cudaMemcpyHostToDevice));

I am sure that alot of folks know how to do this, but in case there are others (like me) then this example should work.

Neeley