High execution dependency and register usage

Hi,

I’m trying to implement a real-time image reconstruction algorithm using CUDA. Although my initial attempt produce a correct results when compared to a serialise implementation, the performance is rather poor. I’m trying to optimise the function using Visual profiler and found out that performance is bounded by the instruction and memory latency. And looking at the kernel profile result, it was found that the texture interpolation (tex1DLayared) is stalling the process due to execution dependency.
Another problem found is the high register((38 register/thread) usage which limit the SM to simultaneously excuting only 1 block out of 32 block on a gtx 980.

I would be very grateful if someone can have a look at the script below and suggest some solutions.

Thanks.
Andy

__global__ void kernel_DAS_DynApo7(float2* __restrict__  d_LRI,
	const float* __restrict__ d_pixelMapX, const float* __restrict__ d_pixelMapZ, const float angle,
	const unsigned int imageHeight, const unsigned int imageWidth, const unsigned int numChannel)
{
	//Define linear index;
	const int ind = blockIdx.x *blockDim.x + threadIdx.x;

	if (ind< imageWidth*imageHeight)
	{
		const int i = ind%imageHeight;
		const int j = ind / imageHeight;
		int ChannelInd = blockIdx.y;
		float zIm = d_pixelMapZ[i];
		float xIm = d_pixelMapX[j];
			float xObj = d_pixelMapX[j] - d_ElePosC[ChannelInd];
		float apoTheta = atanf(__fdividef(xObj, zIm));
			float k = 0.7854f*d_pitch*d_t2wl*__sinf(apoTheta);
		float apo;
		if (k == 0) {
			apo = 1;
		}
		else {
			apo = __fdividef(__sinf(k)*__cosf(apoTheta), k);
		}

		if (apo > 0)
		{
			float t_tx = zIm * __cosf(angle) + xIm * __sinf(angle);
			float t_rx = __fsqrt_rn((zIm * zIm) + xObj * xObj);
			float t_bf = (t_tx + t_rx)*d_t2wl + d_t0;

			//Texture interpolation
			float2 temp = tex1DLayered(d_RF_Layertex, t_bf, ChannelInd);
			temp.x = temp.x*apo;
			temp.y = temp.y*apo;
			d_LRI[ind].x += temp.x;
			d_LRI[ind].y += temp.y;
		}
	}
}

Some random comments:

(1) In all likelihood you want atan2f() here, not atanf()

(2) If the call to __fsqrt_rn() is intended to result in a fast square root: this is not what it does. Consider instead compiling with --prec-sqrt=false --ftz=true. Or use hypotf().

(3) You can likely replace most trig function calls with simple algebra: sin(atan2(y,x) = yrhypot(x,y) and cos(atan2(y,x)) = x rhypot(x,y)

(4) When both sine and cosine for the same angle are needed, it’s best to use sincos(). Probably doesn’t make any difference here because these trig functions are already reduced to device intrinsics.

[Later:]

I wonder whether the magic constant 0.7854f really represents π/4 = 0.78539816… It would probably make sense to review the mathematics from an algorithmic perspective (i.e. look at what is actually being computed here) with the goal of reducing the mathematical computations to a minimum. That will likely help with the register pressure.

Integer division and modulo operations have become quite expensive again with recent GPU architectures, and it is doubtful that the compiler can recognize coupled computation such as occurs here. Check the SASS from cuobjdump --dump-sass. If the compiler cannot simplify it, try

const int j = ind / imageHeight;
const int i = ind - imageHeight * j;

If imageHeight is guaranteed to be a power of 2, use shifts and masks to compute i, j, avoiding the division. Or, if imageHeight is one of a few well-known, non-power-of-2 values, consider making it a template parameter, allowing the compiler to optimize division by constant. At least some of these techniques should reduce register usage.

In terms of clarity, it might make sense to rewrite the code in a more natural manner (e.g. using actual divisions instead of __fdividef() calls), then use --use_fast_math to take care of desired optimizations aspects.

Hi Njuffa,

Thanks for your quick suggestions. I can see by simply replacing atan2f, hypotf and changing the modulo operation, it reduce the execution dependency problem and reduce the register usage to 30reg/thread.

Also I can see that you have work out the constant is calculated from pi/4. If I replace pi with a constant symbol do you think it might help with the register pressure?

Again, I can see that the float2 layered texture interpolation is still the main stalling factor. Do you have any idea how I could improve this?

Below is the script after modification

//with implicit fast math
__global__ void kernel_DAS_DynApo7(float2* __restrict__  d_LRI,
	const float* __restrict__ d_pixelMapX, const float* __restrict__ d_pixelMapZ, const float angle,
	const unsigned int imageHeight, const unsigned int imageWidth, const unsigned int numChannel)
{
	//Define linear index;
	const int ind = blockIdx.x *blockDim.x + threadIdx.x;

	if (ind< imageWidth*imageHeight)
	{
		const int j = ind / imageHeight;
		const int i = ind-imageHeight*j;

		int ChannelInd = blockIdx.y;
		float zIm = d_pixelMapZ[i];
		float xIm = d_pixelMapX[j];
			float xObj = d_pixelMapX[j] - d_ElePosC[ChannelInd];
		float apoTheta = atan2f(xObj, zIm);
			float k = 0.7854f*d_pitch*d_t2wl*__sinf(apoTheta);
		float apo;
		if (k == 0) {
			apo = 1;
		}
		else {
			apo = __sinf(k)*__cosf(apoTheta)/ k;
		}
		if (apo > 0)
		{
			float t_tx = zIm * __cosf(angle) + xIm * __sinf(angle);
			float t_rx = hypotf(zIm, xObj);
			float t_bf = (t_tx + t_rx)*d_t2wl + d_t0;

			//Texture interpolation
			float2 temp = tex1DLayered(d_RF_Layertex, t_bf, ChannelInd);
			temp.x = temp.x*apo;
			temp.y = temp.y*apo;
			d_LRI[ind].x += temp.x;
			d_LRI[ind].y += temp.y;
		}
	}
}

Thanks in advance!

Andy

I would think the goal here is to make the code purely memory bound by minimizing the computation, meaning the texture access would dominate the overall performance (even more so than now). The texture access is what is ultimately doing the useful work here.

It seems you have not taken up my point (3), eliminating the call to atan2f(). I am fairly certain that all that this is used for is the normalization of the vector (xObj, zIm) into a unit-length vector (at least that is what I see when I make a graph of the whole situation):

float apoTheta = atan2f(xObj, zIm);
... __sinf(apoTheta)
... __cosf(apoTheta)

This can be replaced by

float oneOverApoMag = rhypotf(xObj, zIm);
... xObj*oneOverApoMag
... zIm*oneOverApoMag

My point about the use of the constant π/4 is, that this is used further on in the computation of ‘apo’ with trig functions, and I am wondering what the significance of that is in a geometric sense. Just like we can eliminate the call to atan2f() by a simple algebraic computation, something similar might be possible there also, but I can’t easily visualize what is actually being computed there. That’s why I suggested it would make sense to review that to see whether all the trig functions are actually needed.

The code as posted isn’t buildable so I cannot perform experiments. I would think that we would like to overlap the computation of ‘apo’ with the texture access. That would require moving the safeguard as follows:

float t_tx = zIm * __cosf(angle) + xIm * __sinf(angle);
float t_rx = hypotf(zIm, xObj);
float t_bf = (t_tx + t_rx)*d_t2wl + d_t0;

//Texture interpolation
float2 temp = tex1DLayered(d_RF_Layertex, t_bf, ChannelInd);
if (apo > 0) {
    d_LRI[ind].x += temp.x * apo;
    d_LRI[ind].y += temp.y * apo;
}

So the write-back would be safeguarded, but not the texture read. Not sure whether that is feasible in your context. Potential out-of-bounds access on a texture should be fine. Depending on the exact texture mode used that would result in junk data of one kind or another, but when that happens the result is ultimately discarded, so no harm done.

With the safeguard out of the way, the compiler should be able to schedule the texture access early, minimizing stall times. If that doesn’t happen you would need to physically move the entire code that feeds into the computation of ‘apo’ to after the call to tex1DLayered().

Below I show the code transformations I had in mind. This code was only re-arranged in an editor, so it may well have bugs. This should save about 30 instructions and some registers compared with your last version in #3. And it should provide the compiler with more opportunities to schedule loads early.

{
    const float PI_OVER_FOUR = 3.14159265358979323f * 0.25f;

    //Define linear index;
    const int ind = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (ind < imageWidth * imageHeight) {
        const int j = ind / imageHeight;
        const int i = ind - imageHeight * j;
        const int ChannelInd = blockIdx.y;

        float zIm = d_pixelMapZ[i];
        float xIm = d_pixelMapX[j];
        float xObj = d_pixelMapX[j] - d_ElePosC[ChannelInd];

        float t_tx = zIm * __cosf (angle) + xIm * __sinf (angle);
        float t_rx = hypotf (zIm, xObj);
        float t_bf = (t_tx + t_rx) * d_t2wl + d_t0;
            
        // Texture interpolation
        float2 temp = tex1DLayered (d_RF_Layertex, t_bf, ChannelInd);

        float oneOverApoMag = 1.0f / t_rx;  // rhypotf (xObj, zIm)
        float kp = PI_OVER_FOUR * d_pitch * d_t2wl * xObj;
        float k = kp * oneOverApoMag;
        float apo = (k == 0.0f)? 1.0f : (__sinf (k) * zIm / kp);
        if (apo > 0.0f) {
            d_LRI[ind].x += temp.x*apo;
            d_LRI[ind].y += temp.y*apo;
        }
    }
}

Hi Njuffa,

Really appreciate for your help! There isn’t any problem with the script you have modified and it does gain in roughly 5% time improvement. However, checking with visual profiler, it seems the register gone up from 30reg/threads to 31reg/thread.

And sorry I haven’t posted the whole program as the initial program loaded the data from matlab. I just manage to create another version which generate some random data for the image reconstruction. You may find the solution file in the link below.

https://drive.google.com/drive/folders/0B4VtavMHEP2LM051cDd1Y0VRN2s?usp=sharing

Thanks very much!

The register usage you observe seems to be much higher than I would expect. Do you use CUDA 8? Do you compile with -use_fast_math? Are you building in release mode?

The code that you previously posted did not build because there were undefined pieces of data. When I did a simple fix-up of these dangling references, the resulting code compiled into something that uses around 20 registers, depending on exactly how I did the fix-ups.

That would lead me to believe that the additional register pressure originates in code that you did not show and that you may want to simplify.

Note that registers are allocated with a certain granularity by the hardware. The details differ by architecture and I don’t recall them, but I think it is reasonably safe to assume that 30 vs 31 registers doesn’t make any difference as the hardware probably allocates 32 registers in either case. What’s the occupancy of the code now?

Ultimately, the code should become limited by the memory accesses as we optimize. The exact balance of flops/bytes for GPUs differs but is typically somewhere around 20-25; it would seem we are getting close to that here. What’s the total memory bandwidth consumed by this code?

Hi Njuffa,

Thanks very much for pointing out that I need to build the program in release mode. I have been compiling in debug mode so far and never notice such a huge impact (more than 100x speedup). I can see now the kernel performance is bound by memory bandwidth. The occupancy is now 85.6% and the register gone down to 16 reg/thread.

This is real exciting and I’m a step closer to implement a real-time image reconstruction (10000fps) . I guess for now I could try to overlap the compute and memcpy by implementing cuda stream?

Thank you so much Njuffa. I really appreciate your help.

Andy

I am pleased to hear that we got it all sorted out now. General advice for any platform (not just CUDA): Never do performance analysis using debug builds.

Yes, using CUDA streams to build a processing pipeline that overlaps data download to the GPU, data upload to the CPU, and kernel computation is typically a good idea. In many cases the data transfers can overlap kernel execution almost perfectly.

Great and thanks again Njuffa!