Sequential faster than parallel


Need some help (again) hahaha.

I’ve been trying to make an RGB to Grayscale conversion.

I have a sequential algorithm and the parallel one, but still the sequential seems to have better performance than the parallel, I know something is wrong but don’t know what.

My codes are the following:

void pprRGB2Gray(pprColorImage *imageRGB, pprGrayImage *imageGray)


	u_int idxI, idxJ, idxK;

	float r, g, b, gray;


	imageGray->row = imageRGB->row;

	imageGray->col = imageRGB->col;




	for (idxI = 0; idxI < imageRGB->row; idxI++) {

		for (idxJ = 0; idxJ < imageRGB->col; idxJ++) {

			gray = ((float)imageRGB->data[0][idxI*imageRGB->col+idxJ])*0.3 +

			((float)imageRGB->data[1][idxI*imageRGB->col+idxJ]*0.59) +


			imageGray->data[idxI*imageRGB->col+idxJ] = gray;



__global__ void kernelRGB2Gray(u_char *imageR, u_char *imageG, u_char *imageB, u_char *imageGray, u_int size)


	int tid = blockIdx.x * blockDim.x + threadIdx.x;


	while (tid < size){

		imageGray[tid] = (float)imageR[tid]*0.3f + (float)imageG[tid]*0.59f + (float)imageB[tid]*0.11f;

		tid += blockDim.x * gridDim.x;



void pprRGB2Gray(pprColorImage *imageRGB, pprGrayImage *imageGray)


	u_int size;

	u_char *d_imageR, *d_imageG, *d_imageB, *d_imageGray;


	size = imageRGB->row*imageRGB->col;

	imageGray->row = imageRGB->row;

	imageGray->col = imageRGB->col;


	//Allocate memory on GPU for each channel of the RGB image and for the Grayscale image.

	CHECK_ERROR( cudaMalloc( (void**)&d_imageR, sizeof(u_char)*size ) );

	CHECK_ERROR( cudaMalloc( (void**)&d_imageG, sizeof(u_char)*size ) );

	CHECK_ERROR( cudaMalloc( (void**)&d_imageB, sizeof(u_char)*size ) );

	CHECK_ERROR( cudaMalloc( (void**)&d_imageGray, sizeof(u_char)*size ) );


	//Copy each channel of the RGB image to the GPU.

	CHECK_ERROR( cudaMemcpy( d_imageR, imageRGB->data[0], sizeof(u_char)*size, cudaMemcpyHostToDevice ) );

	CHECK_ERROR( cudaMemcpy( d_imageG, imageRGB->data[1], sizeof(u_char)*size, cudaMemcpyHostToDevice ) );

	CHECK_ERROR( cudaMemcpy( d_imageB, imageRGB->data[2], sizeof(u_char)*size, cudaMemcpyHostToDevice ) );


	//Launch kernel with each channel of RGB plus an empty Grayscale array.

	kernelRGB2Gray<<<128,256>>>(d_imageR, d_imageG, d_imageB, d_imageGray, size);


	//Copy the result of Grayscale back to the CPU and fill the pprGrayImage structure with the new computed data.

	CHECK_ERROR( cudaMemcpy( imageGray->data, d_imageGray, sizeof(u_char)*size, cudaMemcpyDeviceToHost ) );


	//Free the allocated memory on GPU.






I should add somethings:

Working on a Macbook Pro with nVidia 320M

The number of blocks and threads was set empirically

your app is memory-bound and bandwidth of nVidia 320M is only 20GB/s at most.
Also unit of read/write is character, this is not good on sm11 or sm12.

You can try your app on Fermi card, and you will see difference.
otherwise, you should read/write integers, not characters.

Thank you,

So if I try it with integers instead of chars I will notice a better performance? why is that?

Should I test all my CUDA apps not in the 320M? which video card is the a good one but no so expensive?

I think LSChien is referring to limitations in the device memory controllers for different CUDA hardware versions. In compute capability 1.1 devices, reading 1-byte characters from memory forces the controller to issue 16 separate 32-byte transactions to service a half-warp, even if the characters being read are located in consecutive memory locations. In contrast, if each thread in a half-warp accesses consecutive integers at a particular read instruction, then the memory controller will only need to issue one transaction.

With compute capability 1.2 and 1.3, the memory controller is much smarter, but still not optimal for reading characters. The smallest memory transaction is 32-bytes per half-warp, which means a factor of two reduction in efficiency as a half-warp only needs 16 characters per read instruction.

Once you get to compute capability 2.0 and later, there is an L1 and L2 cache which will hide this minimum transaction size inefficiency when reading large blocks of consecutive characters. Normally, the size of a cache line is 128 bytes, which means the the first half-warp will cause 8x more data to be read than necessary, but the next seven half-warps reading characters will already have their data in the L1 cache and complete their reads with very little latency. On average, you’ll get full utilization of the memory bus.

If you want to see the effect of the cache on your code, the GTS 450 is a reasonable cheap card with compute capability 2.1. This GPU requires two PCI-Express slots of space in your computer case and a 6-pin PCI-Express power connector.

However, you might be able to improve this code on your existing device. There are two possible tricks:

  1. A standard technique when dealing with reading data that cannot be handled efficiently by the memory controller in your CUDA device is to stage it to shared memory. Cast your character memory pointer to an integer pointer, and have each thread copy an integer to a shared memory array. Call __syncthreads() to ensure all threads have finished their write, then access the shared memory array through a character pointer. Each thread will need to process 4 pixels. This will make your code ugly, unfortunately.

  2. Put your array of characters into a texture. All CUDA devices have a texture cache which might be able to help here. I don’t recall if someone has benchmarked this approach already, so I can’t say for sure it will work. Also makes your code a little ugly, but less so than option 1. Take a look at tex1Dfetch() in the programming guide.

Thank you so much.
I’ll try those things and see if I can get good results with my card.

Why do I need two PCI-e slots? did you try to say 2 6-pin connectors and 1 PCI-e slot?

Thanks again.

The GTS 450 is physically the width of two PCI-Express slots to make room for the heatsink and fan on the GPU and memory chips. (If you look under the plastic shroud, it looks like it could have been reduced to a single slot, but maybe that was pushing the heat envelope.) The card only electrically connects to one PCI-Express slot, but the cooling hardware extends over the PCI-E slot next to it. The GTS 450 only requires one 6-pin power connector.

Ok thanks so much.
Is there an algorithm that would run faster in parallel than in sequential? I really want to try CUDA on my Mac but I would love to see the better performance.

GEMM is computational-intensive, you can try GEMM in CUBLAS.

I will suggest that you can write a simple copy kernel, then compare performance of GPU with performance of CPU.

This ratio is expected gain.

for example, bandwidth of GTX480 is 180GB/s and bandwidth of corei7 is 25GB/s (runnung 4~8 threads),

so we expect that GPU is 7x faster than CPU on memory-bound app.

Ok thank you.
One last question, should I give up doing things on my Mac and try using a video card with compute capability 2.0 at least?

it depends on your purpose. If you want to optimize your app, then Fermi card is better

because its computational power and bandwidth are much better than CPU.

Fermi or not, reading x GB of data from CPU main memory, transferring it on GPU, processing it (even if no work! TRY IT!), and then copying back to main memory, will be slower than just processing it sequentially on one thread of your CPU.

The ratio memory access/processing is just too low to benefit from GPGPU, PCI Express is too slow on Fermi cards (around 3GB/s) and CUDA is not the technology to use as your processing is:


Memory stream bound

one-pass simple processing

subject to SSE optimizations to cope with your memory bandwidth

My best guess is to code that using SSE instead CUDA, sadly.

Too much striding is going on.

You’ll either need to process an red plane entirely by itself, then green plane, then blue plane.

So red plane: R,R,R,R,R, etc
So green plane: G,G,G,G,G, etc
so blue plane: B,B,B,B,B,etc

^ Nice sequential access.

So for example 3 kernels one for red one for green and one for blue, or simply one kernel and do plane by plane, red, green, blue, but not all three at the same time.

Or you need to interleave the R,G,B elements as pairs like so:


^ Nice sequential access.

The question remains is it faster… this is indeed a good question… if you make sure all access is sequential, then at least the kernel will be fast.

If memory copy back/forth over pci express is faster or slower than cpu remains to be seen, this is good point though.

First thing you should do is put some streams and event code into your example, it’s quite easy to do that once that is done you’ll know which parts are actually the slowest ;)

So time kernel, time memory copies… then post results here ;) Also time cpu code ;)