Profiling Challenge: Unexplained Massive Slowdown - demonstrated by replacing a single line of code!

This simplified example shows a massive 4x slowdown in performance, by modifying a single line of code!
The size of the block is set to 128, specifically tailored for GTX750Ti (1st gen. Maxwell, 128 cores X 5 SMs)

See the entire code below, lines 32/33 in particular:

int val = fabsf(ind); // 529msec
//int val = fabsf(i - ind); // 2084msec - ~4 times slower!

Note that the ptx’s of the two versions differ by more than a single instruction.
This may provide a lead with respect to the culprit.

Configuration: GTX750Ti, Win7 x64, VS 2013, CUDA 7.0
Same results regardless if compiled for cc2.0 or cc5.0.

#include "cuda_runtime.h"

#include <vector>
#include <iostream>
using namespace std;

// Global device buffers, allocated and freed once
// input buffer
float *pVol_d = NULL;
// output buffer
float *pOut_d = NULL;

__global__ void kernel(const float *pVol, const int volSize, float *pOut)
{
	const int nCols = blockDim.x * gridDim.x;
	const int iCol = threadIdx.x + blockDim.x * blockIdx.x;

	const int ind = blockIdx.y * nCols + iCol;

	// All threads in the block join forces in reading the voxels into the shared memory in a perfectly coalesced manner
	extern __shared__ float currVoxels[]; // dynamic shared memory (allocated at runtime to be the size of the cuda block)

	float acc = 0;
	const int nVoxelsPerIter = volSize / blockDim.x;
	for (int j = 0; j < nVoxelsPerIter; ++j)
	{
		currVoxels[threadIdx.x] = pVol[threadIdx.x + j * blockDim.x];
		__syncthreads();

		for (int i = 0; i < blockDim.x; ++i)
		{
			int val = fabsf(ind); // 529msec
			//int val = fabsf(i - ind); // 2084msec - ~4 times slower!
			val = val >= 0 ? 1 : val; // effectively set to 1, regardless of the previous assignment (same result in both cases)
			acc += val*currVoxels[i]; // cc2.x: broadcast (no bank conflict)
		}
	}

	pOut[ind] = acc;
}

int CUDABuffersAllocate(int volSize, int nCols, int nRows)
{
	// return at once if the buffers have already been allocated
	if (pOut_d)
		return cudaSuccess;
	
	// Allocate device buffers

	cudaError cudaStatus;
	// inputs
	cudaStatus = cudaMalloc(&pVol_d, volSize * sizeof(float));
	// output
	cudaStatus = cudaMalloc(&pOut_d, nCols * nRows * sizeof(float));

	return cudaStatus;
}

void CUDABuffersFree(void)
{
	// Free device buffers

	if (pOut_d)
	{
		cudaFree(pOut_d);
		cudaFree(pVol_d);

		pOut_d = NULL;
		pVol_d = NULL;
	}

	// cudaDeviceReset must be called before exiting in order for profiling and
	// tracing tools such as Nsight and Visual Profiler to show complete traces.
	if (cudaDeviceReset() != cudaSuccess)
	{
		cerr << "cudaDeviceReset failed!" << endl;
	}
}

void CallCUDA(const float *pVol, int volSize, float *pOut, int nCols, int nRows)
{
	// return at once if the buffers have already been allocated
	if (!pOut_d)
	{
		cerr << "Uninitialized buffers. CUDABuffersAllocate must be called once before calling CallCUDA()." << endl;
		return;
	}

	cudaError_t cudaStatus;

	// Copy data pertaining to current execution to device

	cudaStatus = cudaMemcpy(pVol_d, pVol, volSize * sizeof(float), cudaMemcpyHostToDevice);

	// Timing net kernel execution

	float time;
	cudaEvent_t start, stop;

	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	cudaEventRecord(start, 0);

	// Launch the CUDA kernels, with dynamic shared memory

	// hardcoded block/grid sizes
	dim3 blockSize(128, 1, 1); // = 640 cores / 5 SMs
	dim3 gridSize(nCols / blockSize.x, nRows, 1); // = 128

	kernel<<<gridSize, blockSize, blockSize.x * sizeof(float)>>>(pVol_d, volSize, pOut_d);

	// Check for any errors launching the kernel
	cudaStatus = cudaGetLastError();
	if (cudaStatus != cudaSuccess)
	{
		cerr << "kernel launch failed: " << cudaGetErrorString(cudaStatus) << endl;
		goto KernelError;
	}

	// cudaDeviceSynchronize waits for the kernel to finish, and returns any errors encountered during the launch.
	cudaStatus = cudaDeviceSynchronize();
	if (cudaStatus != cudaSuccess)
	{
		cerr << "cudaDeviceSynchronize returned error code " << cudaStatus << "after launching kernel!" << endl;
		goto KernelError;
	}

KernelError:

	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&time, start, stop);

	cout << "Net runtime: " << time << " ms" << endl;

	// copy output to host
	cudaStatus = cudaMemcpy(pOut, pOut_d, nCols * nRows * sizeof(float), cudaMemcpyDeviceToHost);

	cout << "Done." << endl;
}

int main()
{
	const int volSize = 128 * 128 * 128;
	const int nCols = 128 * 128;
	const int nRows = 2;

	vector<float> vol(volSize, 1.f); // initialize to 1's
	vector<float> out(nCols * nRows); // output

	CUDABuffersAllocate(volSize, nCols, nRows);
	CallCUDA(&vol[0], volSize, &out[0], nCols, nRows);
	CUDABuffersFree();

	system("Pause"); // required for an empty project

	return 0;
}

an elementary question, if i may:

what happens when you declare ind as int as opposed to const int?

The same behavior holds. Feel free to try it out yourself!

My guess would be that the compiler optimized the fast case where the result of fabsf(ind) is independent of both loop counters and could be done once where ind is defined. In the slow version the fabsf(i - ind) needs to be done nVoxelsPerIter * blockDim.x times.

BTW, why fabsf()? Source and destination are integers.

I’m using fabsf in the first place is since the original code contained floats.

There is no reason for 128^3 fabsf & sub to take 1.5sec, regardless of any optimization done on the 1st version.

Note that if I simply typecast these two versions to float, the 2nd version is only 35% slower.
Still, 180msec is a lot relative to the number of instructions added.

float val = fabsf((float)ind); // 500msec
float val = fabsf((float)(i - ind)); // 678msec

There is definitely something weird going on!

i think you should be able to predict the value or outcome of val, such that the fabs is not required

if (ind > blockDim.x)
{
val = ind - i;
}

else
{
val = i; // 0
}

Indeed, it is true in this case, but that wasn’t the point I was trying to make.
The difference between using ‘int val = fabsf(i - ind)’ and ‘float val = fabs((float)(i - ind))’
only shows there is something else under the hood.

I think I would probably agree with that statement, but that statement has no bearing on your code or the observation being presented.

Rather than try and parse through your code to calculate the number of times that statement actually gets executed, I modified the code to put an atomicAdd on a global (device) variable, immediately after the fabs statement, in the “slow” case.

The total number of times that statement is getting executed is not 128^3 = 2097152 but actually 68719476736 times. This number is actually 128^5 * 2. If you study your code carefully, I think you’ll be able to derive why this number is correct. You are launching a grid of 128x2 blocks each of 128 threads. Each thread executes through an outer for-loop of volSize (=128^3)/128 = 128^2 iterations in the outer loop, each having 128 iterations in the inner loop. This yields a total of 128^5*2 total executions of the fabs statement, when it is not hoisted out of the inner loop by the compiler.

Given that the statement in question involves a non-full-throughput floating point operation, additional ops like subtraction and type conversion, it’s quite reasonable that the difference suggested by Detlef Roettger is indeed the correct and full explanation for the timing difference. Hoisting that statement out of the body of the loops makes a huge difference. Instead of getting executed ~128^5 times, it is getting executed ~128^2 times (assuming hoisting out of both nested loops, or ~128^4 if only hoisted from inner loop), and the difference (128^5-128^2) is in fact about 128^5 executions of the statement. If you divide this number into 1.5 seconds, you get an average execution time of far less than 1ns for this statement.

The “common-expression-hoisting-by-the-compiler” theory is a completely rational explanation for the timing difference.

Many thanks on your highly educational answer, especially with respect to your profiling methodology.
Detlef’s answer already accounted for the optimization done on the 1st version,
and you’ve added the great number of time fabsf() is executed in the 2nd,
but there is an additional issue still remaining, being the difference between these two:

int val = fabsf(i - ind); // 2084msec

, and

float val = fabsf((float)(i - ind)); // 678msec

Any insight?

We already know now that small differences in this inner loop can have a large impact on overall execution time due to the huge 128^5*2 multiplicative factor.

You should now be able to answer the question yourself by dumping the SASS code in each case:

cuobjdump -sass myexe

and comparing the actual differences in instructions emitted by the compiler for each case. Just based on looking at the source code, it would seem to me that there is one less float-to-int cast/conversion in the faster case, but I’m sure there are other differences as well. My suspicion is that fabsf with an integer argument takes a substantially different code path than fabsf with a float argument (i.e. more than just an int-to-float conversion).

I could certainly do the legwork for you, by dumping the SASS and comparing, but your learning would be substantially reduced in that case. I already have enough data to convince myself that very minor differences in this inner loop can be dramatic.

The binary analysis tools are documented here:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#abstract

As an aside, I’m not proud of that “profiling methodology”. I took one look at your code and could not deduce the 128^5*2 number in less than 10 seconds. So I’m lazy and let the machine do the work for me. Adding that atomicAdd makes your code run much slower, so I don’t think it’s a good “profiling” technique.

I enjoy the legwork…
I’ve already inspected the ptx code prior to submitting my reply involving the typecast, and it is considerably different. It was sufficient to determine the two are not equivalent, but what amazed me was the great difference in running time, which I could not have predicted in advance.

In any case, your reply answers my question in full, but as a rule,
would you necessarily recommend dumping SASS rather than PTX?

I almost never recommend analysis of code behavior based on PTX.

PTX code is not actually the machine code that gets executed (SASS is what gets executed), but more importantly the tool (ptxas) that converts PTX to SASS is really more like a compiler than an assembler, and it can do significant optimization on its own. Therefore, the executed code may differ markedly from what the PTX would suggest, so PTX is not a reliable indicator of performance.