For loop in kernel codes

Can anyone help check these codes? I always got error if I uncomment the two lines that are commented now. The codes is very long so I post just the part that has problem.

double sum_w = 0, r, g, b, w;
double *src, *tag, *Ir, *Ig, *Ib; //there are pointers pointing to memory blocks in device memory.
int sRy, sRy_c;

for(int x=0; x<wx; x++)
	double *_s, *_t;
	_s = src;
	_t = tag;

	for(int y=0; y<wy; y++)
		double val = 0;
		for(int i = 0; i < nChannel; i++)
			val += abs(*_s++ - *_t++);
		r = Ir[y] - mean_Ir;
		g = Ig[y] - mean_Ig;
		b = Ib[y] - mean_Ib;
		w = ar * r + ag * g + ab * b + 1;

// sum_w += w;
// dist += w * val;

	src += sRy_c;
	tag += sRy_c;
	Ir += sRy;
	Ig += sRy;
	Ib += sRy;

BTW, I’ve checked that all the parameters passed into the kernel are correct.

Hope to have some comments and suggestions. Thanks!

What errors do You get? Provided as is the code seems OK…



Thanks for the hint… I never properly checked the error message until you mentioned that. The error message I got is this: cudaErrorLaunchOutOfResources…

Previously I knew this part of codes has problem because the result is different from my c++ codes.

Now I’m checking the possible reason for this error :).

What block configuration are you using? This sounds like the thread block needs too many registers to launch.

Hi seibert,

I solved this problem by reducing the block size from 32 to 16, but I’m having new problems :(. I got a CUDA error that says " all CUDA-capable devices are busy or unavailable". Do you know whether there is any way that I can reset CUDA device?

Is this Windows or Linux? On Linux, you might try unloading the nvidia module (rmmod nvidia), but on Windows I have no idea. Rebooting the computer is probably the only guaranteed way to fix a stuck driver.

Also, how many registers is your kernel using?? (pass --ptxas-options=-v argument to nvcc) If you can only launch 16 threads per block, the GPU will be idle most of the time.

Hi seibert,

I tried to reduce the resolution of the images I’m processing using this code to half size, and the codes works fine. I think it could be that I’m passing in too many variables. I’m trying to debugging the cuda codes first using small-size images.

Thanks for your help!