CUDA image processing Accelaration tips anyone?

This is a code I’ve written as part of an Image Processing library that is to do a simple addition between two images.

Kernel and host wrapper methods inserted here:

[codebox]

#pragma warning(disable : 4996) // this is a known VS2005 issue and was solved in 2008 where the original code was compiled

#include “CudaHostAPI.h”

#pragma message (“Note: Including CudaHostAPI.h”)

#include “CudaHostDefs.h”

#pragma message (“Note: Including CudaHostDefs.h”)

#include <cuda.h>

#pragma message (“Note: Including cuda.h”)

// GPU function to Add two float images (this outputs the output in the input for faster accessing)

// ************************************************************


global void AddComp(float* fpIn1Img, float* fpIn2Img)

{

// for all threads in the block grid, compute the X & Y pixel indices in the source image of the starting pixel that THIS INDEXED THREAD will process

unsigned int uiOffset = __mul24(__mul24(__mul24(blockIdx.y, blockDim.y) + threadIdx.y , gridDim.x ) + blockIdx.x , blockDim.x) + threadIdx.x;

// compute for this thread

fpIn1Img[uiOffset] = fpIn1Img[uiOffset] + fpIn2Img[uiOffset];

return;

}

// GPU Host wrapper for Addition of two float image

// ************************************************************


int AddGPU(float* fpIn1Img, float* fpIn2Img, float* fpOutImg, size_t szWidth, size_t szHeight)

{

// locals to manage CUDA calls and setup thread parameters

cudaError_t cuErr1, cuErr2;

// set up block and grid dimensions

dim3 blockSz(BLOCKWIDTH,BLOCKHEIGHT);

// To cover the whole image with a grid of thread blocks, 

//		divide image width in pixels by the thread block width and set image height as grid height

dim3 gridSz(CUDA_DIVUP(szWidth, BLOCKWIDTH), CUDA_DIVUP(szHeight, BLOCKHEIGHT));



cuErr1 = cudaMemcpy(fpCudaIn1Img, fpIn1Img, szWidth * szHeight * sizeof(float), cudaMemcpyHostToDevice);

cuErr2 = cudaThreadSynchronize();

if (!((cuErr1 == CUDA_SUCCESS) && (cuErr2 == CUDA_SUCCESS))) {

	return -11;

}

cuErr1 = cudaMemcpy(fpCudaIn2Img, fpIn2Img, szWidth * szHeight * sizeof(float), cudaMemcpyHostToDevice);

cuErr2 = cudaThreadSynchronize();

if (!((cuErr1 == CUDA_SUCCESS) && (cuErr2 == CUDA_SUCCESS))) {

	return -11;

}

// Call the GPU kernel to do the core computations

AddComp <<<gridSz, blockSz>>>(fpCudaIn1Img, fpCudaIn2Img);

cuErr1 = cudaGetLastError(); 

// get the timing and check errors

if (!(cuErr1==CUDA_SUCCESS)&&(cuErr2==CUDA_SUCCESS)) {

	return -11;

}

if (fpOutImg) {

	cuErr1 = cudaMemcpy(fpOutImg, fpCudaOutImg, szWidth * szHeight * sizeof(float), cudaMemcpyDeviceToHost);

}

else {

	cuErr1 = cudaMemcpy(fpIn1Img, fpCudaIn1Img, szWidth * szHeight * sizeof(float), cudaMemcpyDeviceToHost);

}

cuErr2 = cudaThreadSynchronize();

if (!((cuErr1==CUDA_SUCCESS)&&(cuErr2==CUDA_SUCCESS))) {

	return -12; // return HandleError (-12, "AddGPU(): cudaMemcpy(fpOutImg)", NOMSG);

}

return (szWidth * szHeight);

}

[/codebox]

I tried to play around with the block and grid dimensions but to no avail, I also tried to replace some of the dimensions by loops with pointer advancements but this also came up slower, the code above is only about 1.5-2 times faster than the parallel naive C++ code (which is of course un-optimizable baring multi-core code).

Oh yeah I’m running this on a 32 core Geforce 9500 GT.

I noticed that changing to loops instead of 2d block dimensions can have a very bad hit on performance even if I maintain 16 float coalescing (the image dimensions for testing are 1024x1024), I’m thus left clueless, can any of you veterans tell me whether there be any way to accelerate this ere code? External Image

First thing, what exactly are you timing? If it is the whole operation and not just the kernel, I’m really surprised that it’s not slower than the CPU version as you have the copy to and from the GPU for a very low bandwidth and low arithmetic intensity kernel (single copy from each buffer, one operation, single write), so you timing would be pretty much dominated by the PCI-E communication time which is slower than memory transfer to the CPU.

Second

I’m seeing that you are not allocating memory so I assume that it is allocated somewhere else. You would get the best performance by using pinned memory allocations. Which brings me to the next three points

  • why use cudaThreadSynchronize with a synchnous cudaMemcpy? It’s completely useless and wastes time if it even does anything

  • if you use pinned memory you should probably use the async version of cudaMemcpy and maybe split the work into several kernels so that you can combine this with concurrent copy and execute, although the kernel may be of too low intensity to see the benefit

  • you should use pitched allocations and 2d memcpys (unless using async memcpys, in which case you should emulate that) or you may have some trouble with coalescing. I’m not clear on your pixel indexing math, it’s quite complex

OR, seeing as you image in CPU memory uses concurrent pixels (not pitched memory) you should use 1d kernels instead of 2d kernels which would be even better

Last thing, you indexing arithmetic is very complex, you should split it up and let the compiler optimize away registers if needed so it would be readable. And if I understand it correctly it is limited to 16 mega pixel images before you run into an overflow

Thanks for the speedy reply.

I forgot to mention some things:

    [*]BLOCKWIDTH is 32 and BLOCKHEIGHT is 16 (it wont let me run bigger blocks and returns with a bad configuration error.

    [*]The fpCudaIn1/In2/OutImg pointers are cudaMalloced just once (I know this means I can’t run operations in parallel but thats ok, it saves me the malloc operation which takes some time)

    [*]The whole method is wrapped by a simple c++ method in a class which supports this and many other methods.

    [*]The method is run, say, 100 iterative times and the timing is done on an average of the run times.

[list=1]

[*]I actually started with a 2D memcpy and pitch code but turned to this no spaces image since I thought it would be faster, I accept the comment and will revert if this doesn’t take longer to run, is 2D memcpy as fast as 1D memcpy? I’m still just playing around to see what I can get with these simple image processing methods.

[*]The way I see it is if the average to and back BW is faster than twice the CPU I should be able to accelerate even simple methods like these for big enough images, no?

[*]The timed execution I got (on averaging 100 runs for a 1024x1024 image) was 3.5 msecs for the CUDA code, 4.5 for pretty naive C++ code

[*]I saw that if the kernel is not run at all the cuda execution drops to about 2.2 msec, varying very little when the kernel is run but is empty, i.e. contains no code (this would suggest that the kernel running itself takes up more cycles than the code in the kernel takes to run which makes sense)

[*]This is not the best card, better BW cards should out run the CPU even more convincingly (this might not be the case if the C++ code uses openMP, etc…)

[*]What else, oh yes the cudaThreadSynchronize was in the code I took the skeleton from (I was wondering if I could take that off and I will)

[*]As for the concurrent several kernels issue, are you referring to streams? I’ve not done this yet but as I understand it, it probably wont work when you need a result for the current operation while wrapping it with a Image Processing class that expects an answer for each method run, is this not true?

  1. the block width/height you gave is 512 threads which is the maximum number of threads per block. On lower end cards you may prefer 256 threads (16x16) as it will give you better occupancy.

  2. You can avoid the static malloc issue by also passing a pointer to the working buffers

  3. 2d memcpy is probably implemented in a loop so it won’t be as effective, but it comes close. Pitched memory does waste memory though (pitch is a multiple of 256), it’s useful though for code that is actually 2d, more so with 1.1 compute cards or when using textures due to coalescing/alignment issues. Your current code is actually 1d and should probably be implemented that way, unless you want it to be more generic

  4. Like I said, I’m surprised the CPU version is slower as the GPU is limited by the PCI-E bandwidth, of about 2.5GB/s with non-pinned memory, and depending and exact setup, can go up to around 5.5GB/s with pinned memory. CPU bandwidth is at least 8GB/s, depending on your setup, and can go higher.

  5. You can get get more acceleration CPU wise also using SSE rather easily with this code

  6. Concurrent copy and execute depends on streams, you should probably look at the simple streams example for that. What you would do is copy, execute, copy each patch of the image in a loop and then wait for the results to be ready before returning.

[list=1]

[*]Check. Why will the occupancy be better? I thought that more threads means better occupancy and better scalability, not true?

[*]This I dont understand, I was thinking of enabling an image to stay on the card memory with a flag for long multi operator formulas but how can I pass a pointer to a buffer on the host? (if that is what you mean by working buffer)

[*]I’ve changed the code to 2d (mainly cause I can’t ensure that the images will be divisible by BLOCKWIDTH), wasting memory is not such a big issue since I only use one working image 2048x2048 and no image exceeds this.

[*]

[*]Maybe eventually I’ll implement just the heavy operators (boxFilters, Sobel, Canny, Hough, etc…) on the GPU and use openMP or IPP for these simple methods or else on the smaller images(I think I should get a x3.5 speed-up for my quad-core easily) but right now I wanted to put the CUDA framework in the project (and see that it is well designed/understandable/ stable/ functional/ modifiable/ etc…)

[*]I’ve thought about this but I’ll leave this also to a second phase of optimization, any ideas on what sort of performance benefit I could get? or any tips on how I should go about implementing this? (I mean how much do I break up the image, that sort of thing)

[list=1]

[*]Check. Why will the occupancy be better? I thought that more threads means better occupancy and better scalability, not true?

[*]This I dont understand, I was thinking of enabling an image to stay on the card memory with a flag for long multi operator formulas but how can I pass a pointer to a buffer on the host? (if that is what you mean by working buffer)

[*]I’ve changed the code to 2d (mainly cause I can’t ensure that the images will be divisible by BLOCKWIDTH), wasting memory is not such a big issue since I only use one working image 2048x2048 and no image exceeds this.

[*]

[*]Maybe eventually I’ll implement just the heavy operators (boxFilters, Sobel, Canny, Hough, etc…) on the GPU and use openMP or IPP for these simple methods or else on the smaller images(I think I should get a x3.5 speed-up for my quad-core easily) but right now I wanted to put the CUDA framework in the project (and see that it is well designed/understandable/ stable/ functional/ modifiable/ etc…)

[*]I’ve thought about this but I’ll leave this also to a second phase of optimization, any ideas on what sort of performance benefit I could get? or any tips on how I should go about implementing this? (I mean how much do I break up the image, that sort of thing)

Also… say I have a 1024x1024 image and my block is 16x16 what is the efficiency difference between running with gridSize 64x64, computing the pixel by a heavy 2 levelx2 dimension calculation, performing a single action for each thread and running in a double for loop 64x64, advancing pointers? my logic says the second way should be faster since less computations are done but this doesn’t seem to be the case, what am I missing? (perhaps i miss wrote the loop code?).

Also… say I have a 1024x1024 image and my block is 16x16 what is the efficiency difference between running with gridSize 64x64, computing the pixel by a heavy 2 levelx2 dimension calculation, performing a single action for each thread and running in a double for loop 64x64, advancing pointers? my logic says the second way should be faster since less computations are done but this doesn’t seem to be the case, what am I missing? (perhaps i miss wrote the loop code?).

  1. because compute 1.1 devices support a maximum of 756 threads. 756/256 = 3 which means that you can schedule three blocks at a time for full occupancy. 756/512 = 1.5 which means you can schedule only one block at a time for 0.66 occupancy. Although in this case it probably won’t matter as you only need to hide register read/write delays for which 192 threads are enough.

  2. You can supply an allocation buffer so the user would allocate device memory for the flow, or use some plan approach such as with fft, and also transfer work buffer pointers to your function instead of just host pointers.

  3. You also can’t ensure that it would divide in 1D. What you would do is allocate the number of pixels as you do now, make a 1d kernel i.e dimBlock.x = 256, dimBlock.y = 1 and in the kernel get offset = dimBlock.x*blockIdx.x + threadIdx.x and if (offset > numPixels) return;, same as you do for 2d

  4. If you implement heavier operators you are probably best off with 2D images. Box filters are implemented in the SDK I think, but box and sobel are relatively easy (sobel is a convolution is it’s a bit harder, but good to do at least once).

  5. You would get performance benefit for heavier kernels, as copy can take a bit longer. Depending on the image size it can be rather good performance benefit. How much to break up depends on your algorithm and core count, but basically you want big enough chunks to keep the GPU busy, so say for a Tesla c1060 with 30 multicores and full occupancy of 1024 threads per core, you want around 30*1024 = 30K pixels per copy. On fermi you would get even better performance boost as you can get concurrent upload and download, g200 gets you only one at a time (again, this depends on kernel runtime). You can alternately use memory mapping which would do everything behind your back instead if it’s available for your card.

  1. because compute 1.1 devices support a maximum of 756 threads. 756/256 = 3 which means that you can schedule three blocks at a time for full occupancy. 756/512 = 1.5 which means you can schedule only one block at a time for 0.66 occupancy. Although in this case it probably won’t matter as you only need to hide register read/write delays for which 192 threads are enough.

  2. You can supply an allocation buffer so the user would allocate device memory for the flow, or use some plan approach such as with fft, and also transfer work buffer pointers to your function instead of just host pointers.

  3. You also can’t ensure that it would divide in 1D. What you would do is allocate the number of pixels as you do now, make a 1d kernel i.e dimBlock.x = 256, dimBlock.y = 1 and in the kernel get offset = dimBlock.x*blockIdx.x + threadIdx.x and if (offset > numPixels) return;, same as you do for 2d

  4. If you implement heavier operators you are probably best off with 2D images. Box filters are implemented in the SDK I think, but box and sobel are relatively easy (sobel is a convolution is it’s a bit harder, but good to do at least once).

  5. You would get performance benefit for heavier kernels, as copy can take a bit longer. Depending on the image size it can be rather good performance benefit. How much to break up depends on your algorithm and core count, but basically you want big enough chunks to keep the GPU busy, so say for a Tesla c1060 with 30 multicores and full occupancy of 1024 threads per core, you want around 30*1024 = 30K pixels per copy. On fermi you would get even better performance boost as you can get concurrent upload and download, g200 gets you only one at a time (again, this depends on kernel runtime). You can alternately use memory mapping which would do everything behind your back instead if it’s available for your card.

  1. Thanks, I get it. Isn’t this a pretty heavy dependency on the CUDA SDK model version? You mean 768, right?

  2. I still don’t see where you are getting at here, what user? the library user? the one-time mallocs at CUDA initialization and free at de-initialization work pretty well for me in any case, why should I change this?

  3. I get it 1d no pitch image 1d grid and block, makes sense.

  4. I already implemented sobel and box filters which gave me an increase of efficiency for a factor of about 20 (depending on block size) boxFilter was implemented in the examples (not the SDK itself as far as I know).

  5. Thanks again.

  1. Thanks, I get it. Isn’t this a pretty heavy dependency on the CUDA SDK model version? You mean 768, right?

  2. I still don’t see where you are getting at here, what user? the library user? the one-time mallocs at CUDA initialization and free at de-initialization work pretty well for me in any case, why should I change this?

  3. I get it 1d no pitch image 1d grid and block, makes sense.

  4. I already implemented sobel and box filters which gave me an increase of efficiency for a factor of about 20 (depending on block size) boxFilter was implemented in the examples (not the SDK itself as far as I know).

  5. Thanks again.

Thanks for all the assistance

Thanks for all the assistance

  1. It depends on the Card version. Compute 1.0 (should be non-existant by now) and 1.1 support a maximum of 768 threads, 24 warps and 8 blocks per multi processor (the lower of the three). Compute 1.2/1.3 devices support 1024 threads, 32 warps and 8 blocks. It can be beneficial to adapt the algorithm to the specific card you are working on in terms of blocksize (compute device and shared memory size), number of blocks (depending on the number of multi processors in order to use partial serialization in the code) etc. The gain is often not huge but I’ve seen factors of x2 in some cases.

  2. If you want to chain several algorithms in a row, you want to keep the data on the GPU. one way is static buffers, but that limits the option of multiple threads of working on multiple images. The other option is to give the user a “handle” in one form or another to a set of work buffers that will be passed along to the algorithm call. This allows the user more flexibility in terms of multi threading.

You would do something like (written for clearness, not good coding style / design)

HANDLE h = AllocateGPUBuffers();

UploadData(img, h);

Soble(h);

BoxFilter(h);

CopybackData(out, h);

were HANDLE can be something like

struct HANDLE_t {

unsigned char *a;

unsigned char *a;

unsigned char *a;

};

typedef HANDLE_t * HANDLE;

  1. Welcome
  1. It depends on the Card version. Compute 1.0 (should be non-existant by now) and 1.1 support a maximum of 768 threads, 24 warps and 8 blocks per multi processor (the lower of the three). Compute 1.2/1.3 devices support 1024 threads, 32 warps and 8 blocks. It can be beneficial to adapt the algorithm to the specific card you are working on in terms of blocksize (compute device and shared memory size), number of blocks (depending on the number of multi processors in order to use partial serialization in the code) etc. The gain is often not huge but I’ve seen factors of x2 in some cases.

  2. If you want to chain several algorithms in a row, you want to keep the data on the GPU. one way is static buffers, but that limits the option of multiple threads of working on multiple images. The other option is to give the user a “handle” in one form or another to a set of work buffers that will be passed along to the algorithm call. This allows the user more flexibility in terms of multi threading.

You would do something like (written for clearness, not good coding style / design)

HANDLE h = AllocateGPUBuffers();

UploadData(img, h);

Soble(h);

BoxFilter(h);

CopybackData(out, h);

were HANDLE can be something like

struct HANDLE_t {

unsigned char *a;

unsigned char *a;

unsigned char *a;

};

typedef HANDLE_t * HANDLE;

  1. Welcome

I’m not sure exactly what you mean, but if you mean one block sized 16x16 inside which you have a double for loop sized 64x64, then you are not filling the GPU properly.

For a card with a single SM this should be faster, but for a card with several SM you will be using only one, and partially at that. You would be better of doing partial serialization size running a 64 sized for loop inside the kernel (only in y so you achieve coalescing) and rung a grid sized 64x1. This way you have enough blocks to fill the card. You should also do loop partial loop unrolling to reduce the amount of conditionals.

Personally by the way, I found that pointer advancement has the potential to be very infective with Cuda due to problematic register usage. It depends on the case, but using indexing can be more effective (although in that specific case, it was 12 pointers that were indexed using the same value).

I’m not sure exactly what you mean, but if you mean one block sized 16x16 inside which you have a double for loop sized 64x64, then you are not filling the GPU properly.

For a card with a single SM this should be faster, but for a card with several SM you will be using only one, and partially at that. You would be better of doing partial serialization size running a 64 sized for loop inside the kernel (only in y so you achieve coalescing) and rung a grid sized 64x1. This way you have enough blocks to fill the card. You should also do loop partial loop unrolling to reduce the amount of conditionals.

Personally by the way, I found that pointer advancement has the potential to be very infective with Cuda due to problematic register usage. It depends on the case, but using indexing can be more effective (although in that specific case, it was 12 pointers that were indexed using the same value).

I actually implemented something like this when I tried using NPP instead of CUDA, This is a good option but the reason I moved to CUDA is to get better control of the operations (NPP is closed source) and I wanted it to be transparent to the user since the user of the library is already a quite large and fully implemented (a pretty complex proprietary Image Processing application)

Hi everyone,

I have a problem with the results of image processing …

I try to implement the edge detection with Sobel, but as you can see, unwanted lines appear in the picture.

Does maybe someone knows why this is so?

The program is similar to the CUDA SDK, but I process about 1000 images with 3- way overlap and therefore can not use texture memory.

Here is a simple example on how to work

What is wrong?

__device__ unsigned char Sobel(unsigned char pix00,unsigned char pix01, unsigned char pix02,

					  unsigned char pix10,unsigned char pix11, unsigned char pix12,

					  unsigned char pix20, unsigned char pix21, unsigned char pix22){

					  short horizontal  = pix02 +2*pix12 +pix22 - pix00 -2*pix10 - pix20;

					  short vertical = pix00 + 2*pix01 + pix02 - pix20 -2*pix21 - pix22;

					  short suma = (short)(abs(horizontal) + abs(vertical));

					  if(suma<0)

						return 0;

					  else if(suma>0xff)

						return 0xff;

					  else

						  return (unsigned char)suma; 

}

__global__ void kernel( unsigned char *d_Data,int width, int hight){

	

	unsigned char pix00 = d_Data[threadIdx.x-1 + (blockIdx.x-1)*width];

	unsigned char pix01 = d_Data[threadIdx.x+0 + (blockIdx.x-1)*width];

	unsigned char pix02 = d_Data[threadIdx.x+1 + (blockIdx.x-1)*width];

	unsigned char pix10 = d_Data[threadIdx.x-1 + (blockIdx.x+0)*width];

	unsigned char pix11 = d_Data[threadIdx.x+0 + (blockIdx.x+0)*width];

	unsigned char pix12 = d_Data[threadIdx.x+1 + (blockIdx.x+0)*width];

	unsigned char pix20 = d_Data[threadIdx.x-1 + (blockIdx.x+1)*width];

	unsigned char pix21 = d_Data[threadIdx.x+0 + (blockIdx.x+1)*width];

	unsigned char pix22 = d_Data[threadIdx.x+1 + (blockIdx.x+1)*width];

	d_Data[threadIdx.x + blockIdx.x*width ] = Sobel(pix00, pix01, pix02, 

							pix10, pix11, pix12,

							pix20, pix21, pix22);

	

}

int main ( int argc, char ** argv ){

	unsigned char* h_Data;

	unsigned char* d_Data;

	unsigned char* image;

	char *name_image_load="tony10070.pgm";

	char *name_image_save="tony10070_res.pgm";

	int size;

	unsigned int hight;

	unsigned int width;

	// load image in unsigned char* image 

	size=width*hight;

	cudaMallocHost((void ** )&h_Data, size*sizeof(unsigned char));

	

	cudaMalloc ((void ** ) &d_Data, size*sizeof(unsigned char));

	

	cudaMemcpy(h_Data, image, size, cudaMemcpyHostToHost);

	cudaMemcpy(d_Data, h_Data, size, cudaMemcpyHostToDevice);

	

	kernel<<<hight, width>>>(d_Data, width, hight);

	cudaMemcpy(h_Data,d_Data, size, cudaMemcpyDeviceToHost);

	

	cudaThreadSynchronize();

	

	// save image

	cudaFree (d_Data);

	cudaFreeHost(h_Data);

	return 0; 

}

Thank you very much