one kernel accesses two textures simultaneously

littlehead · July 30, 2009, 6:59pm

Hi Everyone,

I am trying to compare two 3D images. I bound two textures (say tex1 and tex1) to the two image arrays.

Now in the kernel, I did this:

[codebox]prevalue = tex3D(tex1, (float)PixCoord1.x, (float)PixCoord1.y, (float)PixCoord1.z);

postvalue = tex3D(tex2, (float)PixCoord2.x, (float)PixCoord2.y, (float)PixCoord2.z);

Diff = abs(postvalue - prevalue);[/codebox]

The result is strange, seems both postvalue and prevalue fetch the values from tex1. How can I solve this problem?

Any help will be appreciated!

Yuping

_Big_Mac · July 31, 2009, 12:00pm

Maybe check if you’re not binding these two textures to the same cudaArray in host code, look for typos.

littlehead · July 31, 2009, 5:45pm

No typo. I tried cudaGetLastError and found the problem.

I got: too many resources requested for launch.

I know it’s happening when I am fetching values from textures in two embedded loops. But I have no idea about how to fix it…

Below is the snippet where problem arises, in the kernel. Num = {9, 9, 2}, SearchSize = {9, 9, 3}, texImage1 and texImage2 are both 51251265sizeof(unsigned short int) = 65 MB. Grid size is 1616, block size is 4432, and I’m using Quadro FX 3800. I have a structure array which occupies 1.5 MB global memory, a float array which occupies 0.5 MB global memory, and some local variables which only occupy 180 B.

[codebox]for(int k = 0; k < Num.z; k++)

for(int j = 0; j < Num.y; j++)

	for(int i = 0; i < Num.x; i++)

	{

		int Total = 0;

		                   

        for(int iZ = 0; iZ < SearchSize.z; iZ++)

		    for(int iY = 0; iY < SearchSize.y; iY++)

			    for(int iX = 0; iX < SearchSize.x; iX++)

				{

					PixCoord.x = Start.x + iX;

					PixCoord.y = Start.y + iY;

					PixCoord.z = Start.z + iZ;

					prevalue = tex3D(texImage1, (float)PixCoord.x, (float)PixCoord.y, (float)PixCoord.z);

					postvalue = tex3D(texImage2, (float)(TemStart.x + iX), (float)(TemStart.y + iY), (float)(TemStart.z + iZ));

							

					Total = Total + abs(value2 - value1);	

							

				}											

		if((float)Total < MinMeasure[Idx])

			MinMeasure[Idx] = (float)Total;					                 

		}

	[/codebox]

The two textures occupy most memory, but it’s OK, the board has 1 GB global memory. And all those textures, variables together could not consume the whole global memory. When I was fetching from the two textures and returned the values to two 256 kB arrays without any loop, the kernel ran well. So I am wondering if fetching textures inside the loop will consume much more resource depending on the number of iterations?

Thanks,

Yuping

_Big_Mac · August 1, 2009, 1:51pm

How many registers does this code use? If you’re using more then 16 (32 for CC 1.2+) registers per thread, there’s not enough registers to launch the kernel. In that case, try smaller blocks.

By the way, is this your kernel code? Any chance of parallelizing the outermost loops (the ones over Num)?

littlehead · August 3, 2009, 2:17pm

Thanks, Big_Mac!

I calculated the resouces used in each block, there are 106 kB local variables. I don’t know the size of GPU’s register, but it seems that is the problem. When I change the block size to a half of previous size, the program enters the kernel.

But I have another problem now. Since I have two embedded loops inside the kernel, the GPU running time exceeded the OS’s timeout limit. I thought about parallelize the outer loop, but I would have to call another kernel inside one kernel. I think this is not possible, right?

I also thought about split the kernel into several smaller ones. But in that way, I have to copy many large arrays between kernels, right? Then how will that be more efficient than CPU multithreading? Right now without multithreading, this program executes 18 seconds on CPU, but the CUDA kernel has already consumed 11 seconds before timeout…

Yuping

_Big_Mac · August 4, 2009, 12:16am

You could just launch Num.xNum.yNum.z times more threads. The iterations of the three outermost loops look independent, it seems pretty trivial. You’re getting 162 times “more parallelism” out of the box.

The inner loops (over SearchSize) aren’t independent, they do sequential addition. You could instead try to perform a parallel scan to speed things up, you could then potentially add the 243 elements in parallel in 8 steps instead of 242. That would naturally mean launching 243 times more threads (and those would need to end up in the same block to share memory). It’s a pity you have those irregular sizes - powers of two would be great here. You might run into uncoalesced writes.

You can potentially get 162 * 243 ~= 40 000 times more threads. They would be finer grained and since generally there’s no such thing as too much threads in CUDA, I recommend parallelizing those loops.

The idea here is to have blocks with 243 threads each. Each block represents one iteration of the outer loops. Each thread in a block takes part in performing a parallel scan (look this up if you’re not familiar - it’s an important thing) to carry out the inner loops. You’re gonna launch 162 * (1616) * (44*32) blocks (21 233 664) - now a single block does less work than a thread did in your code. The job of a block is to carry out the innermost loops (over SearchSize) in parallel. Then you take 162 of such blocks and you have the outer loops done - this is what a thread in your code was responsible for.

If that’s too much for a start, at least parallelize the outermost loops. Have a grid of (1616162) and have each thread of a block compute the inner loop (sequentially) and write the MinMeasure thing.

littlehead · August 4, 2009, 1:28pm

You could just launch Num.xNum.yNum.z times more threads. The iterations of the three outermost loops look independent, it seems pretty trivial. You’re getting 162 times “more parallelism” out of the box.

The inner loops (over SearchSize) aren’t independent, they do sequential addition. You could instead try to perform a parallel scan to speed things up, you could then potentially add the 243 elements in parallel in 8 steps instead of 242. That would naturally mean launching 243 times more threads (and those would need to end up in the same block to share memory). It’s a pity you have those irregular sizes - powers of two would be great here. You might run into uncoalesced writes.

You can potentially get 162 * 243 ~= 40 000 times more threads. They would be finer grained and since generally there’s no such thing as too much threads in CUDA, I recommend parallelizing those loops.

The idea here is to have blocks with 243 threads each. Each block represents one iteration of the outer loops. Each thread in a block takes part in performing a parallel scan (look this up if you’re not familiar - it’s an important thing) to carry out the inner loops. You’re gonna launch 162 * (1616) * (44*32) blocks (21 233 664) - now a single block does less work than a thread did in your code. The job of a block is to carry out the innermost loops (over SearchSize) in parallel. Then you take 162 of such blocks and you have the outer loops done - this is what a thread in your code was responsible for.

If that’s too much for a start, at least parallelize the outermost loops. Have a grid of (1616162) and have each thread of a block compute the inner loop (sequentially) and write the MinMeasure thing.

Thank you so much for your patient reply!!

I’m going to check parallel scan now.

But I have to apologize for not explaining clearly. Actually none of the three loops is independent. I am finding match points of each pixel on another image in the outermost loop. For each pixel, I set a search region around it. The size of the search region is Num.x * Num.y * Num.z. When searching in the region, I do template matching, the size of template is SearchSize.x * SearchSize.y * SearchSize.z. Maybe parallel scan will help figure this out, I’ll read the paper and try it.

Btw, isn’t the 3rd dimension of grid always 1?

Thanks,

Yuping

_Big_Mac · August 4, 2009, 10:42pm

Is MinMeasure[Idx] local to an iteration of the Num loop or local to the whole loop? Ie. do all iterations of the Num loop write to the same MinMeasure[Idx]? If so, there indeed is a dependency but it’s fixable by a parallel scan the same way as with the inner loops.

Yes, I was referring to the number of blocks and/or threads, not their dimensionality. That is by (1616162) I meant a number 41 472, not a vector of (16, 16, 162).

littlehead · August 5, 2009, 2:31pm

Yes, each Num iteration writes to the same MinMeasure[Idx]…

Seems parallel scan is the best way to improve this program. I will give it a try.

Thanks!!

Yuping