Running partly empty blocks

Is there any way to specify partly-empty blocks? For example, suppose my image is 129x129 pixels and my blocks are 8x8. I could pad the image out to 136x136, run 17x17 blocks of 8x8 pixels each, and throw out the extra data when I’m done. That’s fine, but dealing with the padding is extra work and may involve an extra copy at some point (e.g., if some external user created the original data array and wasn’t aware of the block size).

Another solution is to do a bounds check inside the kernel, but I want to avoid that for performance reasons.

What I really want is to specify (when I invoke the kernel) that I want to run 17x17 blocks of 8x8 threads each, but only the first 129x129 threads are actually valid and should be run. That means that the 16 blocks on the right edge are only running an 8x1 column, the 16 blocks on the bottom run a 1x8 row, and the bottom-right corner block is only 1x1.

I don’t mind if the cost of running the partly-empty blocks is the same as if they were full; I just don’t want the device to actually read or write any memory for the invalid threads. That way, I get the same effect as padding without the overhead, with the added bonus that my warps in the edge blocks don’t diverge, like they would with a bounds check.

Any thoughts?

Brian

Would the performance impact be significant if the padded number of kernels would be launched and the unnecessary kernels would be kept off from doing anything by putting an if around the hole kernel?

if(adr(pixelToCalculate)<pixelsInImage)

{

  //do stuff

}

It would add the if to every kernel execution but apart from that I doubt the divergence of the kernels would impact the performance because the divergent kernels wouldn’t do anything.

You might try surrounding your kernel with

if(borderblock)
{
care about border stuff…
}
else
{
efficient solution without border stuff
}

This is effectively what you want to do. Border blocks execute a different kernel from the main one. There is no divergence, only some possible code duplication. The only problem with making your kernels ‘fat’ in this way is that the amount of registers increases (even if it makes no sense, but nvcc does it anyway…) So you’ll have to benchmark a bit what is best.

Another thought:

You say that padding the image may involve an extra copy at some point. I’d say you have to copy anyway - from the host to the device. Setting the device memory to the padded size could do the trick. So you might not have an extra copy. At least if you don’t care what’s in those border pixels - if they have to be zero you have to initialize the memory which would add some computation time.

I don’t understand why there would be no divergence in the first solution. Kernels in the blocks containing image and border pixels would execute different instructions at the same time, right? Isn’t that divergence?

Indeed, you could use cudaMemcpy2D when copying to the device to automatically do the conversion between the different “pitch” values, from unpadded to padded and back.

No, having an entire block perform different instructions than another block will not produce a divergence, since every warp in that block will be performing the same instructions.

Of course, inside the “if(borderblock)” you will probably have a few warps diverge where you have an “if (threadIdx.x < pixelWidth blah blah)”. If you are very careful, you may get the compiler to generate predicated instructions. But, I wouldn’t worry about it until you try it and find that the performance is unacceptable. Premature optimization is the root of all evil. That, and the programming guide gives you the impression that divergent warps are to be avoided at all costs. I have found that this is not the case. I put avoiding divergent warps at the bottom of my priority list now, after months of development spent trying things the hard way.

Anyways, the advantage to wumpus’s “if(borderblock)” suggestion is that the middle blocks (of which there will be many more than borders) will not even need to perform the “if (threadIdx.x < pixelWidth blah blah)”, saving you a few cycles even if the if won’t produce a divergent single warp. Whether or not those few cycles change your performance more than 1% is entirely dependent on your application. The best thing to do is try several different options and evaluate them based on performance and code readability/modifiability. I.e. if you can only get a 1% performance boost by doing the “if(borderblock)” thing, it might not be worth it. You will have added headaches because you will have the same code in two places, meaning you will need to update the same thing twice every time you tweak your code. That adds a lot of development time and even a lot more debugging time when you forget to do it :)

Anyways, my suggestion is always to try the simplest to code solution first. Once you have it working, evaluate the performance based on the memory transfer your kernel sustains and the number of GFLOPs you get. If you are pushing device limits already, there probably isn’t much room for improvement by tweaking things like divergent warps.

I realize that divergence can only occur within the same block and that divergence in different blocks isn’t that much of a concern as we have multiple multiprocessors. What I meant is actually what you describe in your second paragraph (the few warps diverging). Anything else would not make any sense because the author is only concerned with the pixels of the image and not with the border pixels.

The blocks that calculate only pixels outside the image wouldn’t be launched (see problem description) so the only thing that matters is where image borders are in the middle of a block.

Of course you’re giving a more general solution and I completely agree with what you say. I will investigate the “Predicated Instructions” you mentioned as I’m not familiar with that concept.

The programming guide is a little lacking when it comes to predicated instructions.

The idea behind a predicated instruction is that the “if” calculates a 1-bit value, then both branches have their instructions executed in all threads. The 1-bit value predicate is passed into the instruction and it will not write results if the predicate doesn’t match the correct branch of the if.

Now, getting nvcc to produce predicated instructions is challenging. You need to have a VERY simple line of code inside the if, other wise the compiler deems that divergent warps are better. And don’t do “if (a && B)” since the compiler really generates
“if (a) { if (B) { … } }” and that prevents the use of predicated instructions. I’ve had some luck with using “v = a & b; if (v)”, but it doesn’t always work.

The only way to actually know if you got predicated instructions or not is to generate the ptx assembly and have a look at it.

The compiler and the runtime can do weird things, no matter what logic or the Guide say.

Please let us know the observed performance you get when you wrap the whole kernel in an if(thx<129) and when you have if(!border) {} else {if(thx<129)}. The performance ought to be the same, and similar to the no-borders case, but something tells me it won’t.

Also, what happens wth

if (thx>=129)
return;

I did the experiment, and the in-kernel boundary check (of a form like “if (threadIdx.x < pixelWidth blah blah)”) is pretty cheap–barley noticeable slowdown compared to padding, for most of my kernels. I think in the interest of simplicity, I’ll probably stick with the boundary check.

My take-home lesson for today is that MisterAnderson42 is right about premature optimization. :-)

Thanks for all your suggestions!

Brian

Have you also met such weird cases in person? One of my kernel’s performance doubles/halves for strictly equivalent version with different flow control structure (e.g. add a goto in an always false if).

My guess is somewhere in the pipeline, all flow controls are translated to predicate, if-else and do-while, like in the shader era. In two of my kernels, using more “free” flow control halves the performance (possibly by generating predicated code for my bottle neck loop).

I really want freedom of flow control, but it may be possible that 8800 doesn’t support it.

I’ve noticed as well… like in the shader era it’s important to move flow control out of inner loops, which is why I suggested in my post to put it around the entire kernel.

It would be nice if some #pragmas were added to CUDA so you could choose manually whether to generate predicated instructions or real flow control.

There’s gotta be an intrinsic for predication. There’s also gotta be an intrinsic for non-devergence (ie an analog for the PTX directive).

Some background: Flow control IS always translated into a form of predication. A warp always has to execute the same exact instruction, and if there is flow control that crosses the warp it has to be “emulated.” The threads in the warp that aren’t supposed to execute some instructions do execute them anyway but don’t record the results. This is a fundamental principle of the architecture paradigm, and it’s directly responsible for much of the performance gains. Please don’t complain about it. However, there exist several flavors of how this flow-control ‘emulation’ is carried out.

“Predication” is the most lightweight and “natural.” It doesn’t actually try to emulate CPU-style flow control, it’s simply an argument that every instruction can have. It’s a register, and if the register is false for a particular thread, the relevant thread is idled for that instruction. It is a simple mask, and is quick and effective.

Truly emulating the type of flow-control that one finds on traditional CPUs, however, is more difficult. CPUs support instructions that tell the processor to stop executing instructions from the current spot in memory and to start reading from a different spot. These instructions are essentially 'goto’s. This capability is very free-form, and conflicts directly with the GPU’s design philosophy.

Thankfully, the flow control capapibility of CPUs is so free-form that language designers long ago decided not to let programmers use it. “Structured” programming (like that practiced in C) emerged by defining exactly two constructs that would make programs much more readable and maintanable. If statements, and loops. These things don’t just jump all over a program, but have a certain entry point and a certain exit point and they must fully overlap, not partly. GPUs can handle these much better because such statements “diverge” and “reconverge” predictably. Threads can diverge because of an if statemnt, but once the if statment is over they come back together.

When a warp diverges, a G80 multiprocessor still can execute no more than one instruction per two cycles across its 16 cores. If there is an if-else statement, a G80 multiprocessor has to spend its 16 processors computing the ‘if’ and then the ‘else.’ Even if only 1 thread actually wants to execute the ‘else,’ all the other threads have to sit and wait for it. (You can think of it as waiting, or as executing predicated instructions whose predicate bits are false).

The crappy compiler is entirely charged with controlling the predication and divergence capabilities, and very often makes the wrong decisions. When it makes a wrong choice, scores of processors end up sitting and waiting on each other when they don’t have to. In PTX assembly, the programmer has the ability to use predication directly (bypassing the entire expensive divergence-reconvergence mechansism). The programmer can also assert that a given if statement will always execute the same for all threads in a warp, and that the divergence-reconvergence mechanisms don’t need to be invoked.

This control is very useful, but is unfortunately not possible from cu.

Also, cu, due to its C heritage, supports the unstructured ‘goto’ statement. Goto’s must NEVER be used in cu because they make the diverge-converge mechanism poop itself. The result is that execution never resynchronizes and the multiprocessor runs the entire kernel (rather than a piece of it) multiple times. However, the fact that an always-false if statement that contains a goto still causes problems is like caused by the compiler being bad.

While I hate nvcc and like ptxas, I have to say that nvcc is innocent here. Nvcc never generates predicated code, it always emits branch. The actual decision is made by ptxas, i.e. ptxas changes branch to predicate even though it supports predicate via intrinsic.
Alex’s theory for goto may indeed be the cause of my problem. That goto DECREASED my cubin by two dwords. Maybe indeed the compiler failed to generate a convergence intrinsic or something.
However, I’d like to say pure structure programming is rather problematic. Sometimes goto is essential in expressing algorithm logic, especially in that kernel of mine. I only use goto to jump over unnecessary arithmetic within one loop iteration, or use it to break out of an inner loop (by the way, break becomes goto after nvcc). Also, my test condition is really coherent across warps.
If I could get a hold of the converge intrinsic, I can place it manually to make things still converge well. However, this currently seems to be done behind ptxas, or doesn’t exist at all. I have to resort to the old “set flag and if” trick in the TP6.0 era. That’s really a big and annoying anti-pattern.

Not true.

__global__ void testkernel(float *a, float *b)

	{

	float sum = 1.0f;

	float tmp = *a;

	if (tmp >= 5.0f)

  sum += tmp;

	*b = sum;

	}

nvcc -ptx test.cu

ld.param.u64  $rd1, [__cudaparm_a];	//  id:14 __cudaparm_a+0x0

	ld.global.f32  $f1, [$rd1+0];  //  id:15

	.loc	12	7	0

	mov.f32  $f2, 0f3f800000;      //  1

	add.f32  $f3, $f1, $f2;        //  

	mov.f32  $f4, 0f3f800000;      //  1

	mov.f32  $f5, 0f40a00000;      //  5

	setp.ge.f32  $p1, $f1, $f5;    //  

	selp.f32  $f6, $f3, $f4, $p1;  //  

	ld.param.u64  $rd2, [__cudaparm_b];	//  id:16 __cudaparm_b+0x0

	st.global.f32  [$rd2+0], $f6;  //  id:17

Though I agree that it would be nice to have some pragma control over when the compiler chooses to do this. It can be very finicky. I especially have trouble getting it to use predicated instructions when the conditional is more complicated.

That’s not really real predication. Real predication looks like “@p” and masks an instruction (unless that’s just a front and it gets translated into a multi-instruction mess in the back).

An implementation using predication would look like:

ld.param.u64  $rd1, [__cudaparm_a];	//  id:14 __cudaparm_a+0x0

	ld.global.f32  $f1, [$rd1+0];  //  id:15

	.loc	12	7	0

	mov.f32  $f2, 0f3f800000;      //  1

	mov.f32  $f5, 0f40a00000;     //  5

	setp.ge.f32  $p1, $f1, $f5;    //

	@$p1  add.f32  $f2, $f1, $f2;    

	ld.param.u64  $rd2, [__cudaparm_b];	//  id:16 __cudaparm_b+0x0

	st.global.f32  [$rd2+0], $f2;  //  id:17

I agree, goto gets an undeserved bad rap, at least among language instructors. But it’s often actually much more readable than an equivalent mess of if’s and braces. The fact that it’s been put into C#, 40 years after it had been condemned, says a lot.

And you’re right, reconverging from it should be possible in many cases. Even automatable. (maybe all that’s required is that the goto goes forward).

e.g., The compiler should easily understand how to unroll divergence in such a case:

if( a ) {

if( b ) {

if( c ) { goto Label; }

}

}

// Some more code

Label:

Indeed, nvcc may emit selp, but not predicate. I forgot that while posting.

That’s exactly my double/half performance case. c is an always false condition on a value loaded from texture, and that goto jumps over half my bottleneck code. That was a stupid mistake, but after that I managed to double the performance of another similar kernel by replacing the goto using a flag.