Kernel is slow - don't know why


i’m currently trying to figure out why my kernel runs slow. It processes two stereo images (1280x960) and calculates the 3d cost space. So every pixel is tested against x (in the test case 128) disparity levels (so pixels from the other image). To achieve good performance, the pixels from the right image get cached in the shared memory so global memory should not be the bottleneck. It’s based on the ideas of a paper which reported ~2ms execution time for that settings. With my kernel I only achieve 8ms…

They used a Tesla C2050 for this, I’m working on a GTX 460 which should be roughly even right?

The Visual Profiler says there is only ~3GB/s global memory bandwidth (load) used - but I don’t know why because to my understanding the load operations should all be coalesced and therefor quite fast.

Here is the kernel (I know it’s hacky but it’s currently only for testing)

It’s currently called with a block size of [32, 1, 4]

__global__ void costEstimation(unsigned char* imageLeft, unsigned char* imageRight, unsigned char* costs){

	const uint3 id = make_uint3(

		blockIdx.x * blockDim.x + threadIdx.x,

		blockIdx.y * blockDim.y + threadIdx.y,

		blockIdx.z * blockDim.z + threadIdx.z);

	const uint3 globalSize = make_uint3(

		gridDim.x * blockDim.x,

		gridDim.y * blockDim.y,

		gridDim.z * blockDim.z);

	const uint3 localId = threadIdx;

	const uint3 localSize = make_uint3(blockDim.x,



	int2 size = make_int2(1280, 960);

__shared__ uchar4 rightPixelCache[DISPARITY_LEVELS + 32];

	// fetch into cache

	int cacheSize = DISPARITY_LEVELS + localSize.x;

	if(localId.z == 0) {

		int cacheBasePosX = id.x - localId.x - DISPARITY_LEVELS;

		int cacheId = localId.x;

		int fetchesPerItem = cacheSize / localSize.x;

		for(int i = 0; i < fetchesPerItem; i++) {

			const int2 rightPos = make_int2(cacheBasePosX + cacheId, id.y);

			if(cacheId < cacheSize) {

				rightPixelCache[cacheId] = readPixel(imageRight, rightPos, size);

				//rightPixelCache[cacheId] = make_uchar4(id.x, 1, 1, 0);

				cacheId += localSize.x;






	const int2 leftPos = make_int2(id.x, id.y);

	const uchar4 leftPixel = readPixel(imageLeft, leftPos, size);

	//const uchar4 leftPixel = make_uchar4(0,0,0,0);

	int disparitiesPerItem = DISPARITY_LEVELS / localSize.z;

	for(int i = 0; i < disparitiesPerItem; i++) {

		int disparity = i * localSize.z + localId.z;

		const int2 rightPos = make_int2(id.x - disparity, id.y);

		unsigned char cost = 255;

		int cacheBaseIndex = localId.x + DISPARITY_LEVELS;

		//const uchar4 rightPixel = readPixel(imageRight, rightPos, size);

		const uchar4 rightPixel = rightPixelCache[cacheBaseIndex - disparity];

		cost = abs(leftPixel.y - rightPixel.y);

		costs[(id.y * size.x + id.x) + disparity * size.x * size.y] = cost;



Any help would be appreaciated.



On a GTX 570 I get 5,7ms

load efficiency is ~83%

Can you link us the paper so we can have all information to help you?

Hello zish,

comparing results with other algorithms not implemented by you may be not trivial. Said that, how do you / they time the algorithm?

    [*]Are you/they including input CPU-GPU and output GPU-CPU memory transfer (memcpy)? If not, we should talk about computing time.[*]How much shared memory need your code?[*]Do you/they use the same driver and CUDA versión?.[*]Did you compile the code with compiler optimizations (-O2, -O3)? and they?[*]Did you do a 'hot start' to get the time?

I’m not sure if the term ‘hot start’ is well described neither if it is a cause for concern at present, but the first kernel call in your program takes a bit more time because the driver does some initialization. So, to get the best results and show off to your friends you could try:

    [*]Compile with -O2 and -O3.[*]Be sure you add the arch=sm_20 to the compiler.[*]If your kernel use less than 16KB of shared memory then try with cudaFuncCachePreferL1 (48KB of shared memory).[*]Launch a dummy kernel, i.e., copy data from buffer A to buffer B.[*]Get the times w/o the CPU-GPU transfer time.

It would be interesting to know the results based on these points.

Best regards,


Maybe the paper he means is this

I found it by googling for “3d cost space”.

The GTX 460 has notably lower texture access performance compared to some previous generation Compute 2.0 cards. I am not sure if this would also affect global memory reads. But maybe this is the reason why it appears to be slower than the Tesla C2050.

See here for reference: and and

You should try building the kernel for sm_12 as well as sm_20 and compare results.


Sorry for the late response,

the paper i’m trying to implement is

(couldn’t find it somewhere else, so I hope you have access to it)

They don’t specify which compiler options the used.

I’ve got it down to 3,7ms on the GTX 460 - the difference is pretty small and actually i’m not sure why it made such a huge difference

for(int i = 0; i < disparitiesPerItem; i += 4) {

		int disparity = i * localSize.z + localId.z;		

		int cacheBaseIndex = localId.x + DISPARITY_LEVELS;

		uchar4 costBundle;

		// a

		costBundle.x = abs(leftPixel.y - rightPixelCache[cacheBaseIndex - disparity - 0].y);

		// b

		costBundle.y = abs(leftPixel.y - rightPixelCache[cacheBaseIndex - disparity - 1].y);

		// c

		costBundle.z = abs(leftPixel.y - rightPixelCache[cacheBaseIndex - disparity - 2].y);

		// d

		costBundle.w = abs(leftPixel.y - rightPixelCache[cacheBaseIndex - disparity - 3].y);


		costs4[(id.y * size.x + id.x) + (disparity / 4) * size.x * size.y] = costBundle;


Currently I compile with -O2, tried fast-math but it didn’t change anything and make the calculation twice so I get a “hot start”.

The option arch=sm_20 didn’t change anything. The Kernel uses 1KB shared memory.

The timings are only the kernel execution - using the Visual Profiler so it should be accurate.

There are no texture accesses so I guess that shouldn’t be a problem.

I also tried to implement the other parts of the paper but I’ve got similar frustrating results.

Just a bit background:

The plan was to implement this method and use it in the evaluation of my bachelor thesis (I’ve already implemented another method which works well).

If anyone has/knows a cuda/opencl implementation for stereo matching and can point me to it, it would also pretty awesome.

Thanks for your replies,

I will let you know if I make any progress