Solved - Kernel runs 5 times slower on GTX 960 than GTX 560

I have problem with this kernel. When I run it (same .exe) it is 5x slower on 960 than on 560Ti. Code was optimized for 560 hw parameters but anyway this slowdown I did not expect.

Here is code:

#define	PERM_CYCLES		10000
#define	NUM_OF_COMBINATIONS		(3550*4)

	int p[CUDA_PERMUTE_SIZE + 1];
	int i;

// permutations based on code from
__global__ void CUDA_Generate_Permutations_Inlined (CUDA_PERM_STATE *perm_state, unsigned char *out_buffer)
	char set[CUDA_PERMUTE_SIZE];		// our permutated 16B sequence
	int p[CUDA_PERMUTE_SIZE + 1];
	int i;
	char tmp;
	unsigned int dwEAX, dwEBX, dwECX, dwEDX, dwEDI, dwESI;
	unsigned long long uLong;
	int j, intIterations = 0;

	int index = blockIdx.x * blockDim.x + threadIdx.x;		// index
	if (index >= NUM_OF_COMBINATIONS) return;			// out of data bounds
	if (*(unsigned int*)out_buffer >= OUT_BUFFER_SIZE) return;		// out of range of output buffer (20 MB)
	// state = perm_state[index];		... reload last permutation state
	for (j=0; j < CUDA_PERMUTE_SIZE; j++) {
		set[j] = perm_state[index].set[j];
		p[j] = perm_state[index].p[j];
	p[CUDA_PERMUTE_SIZE] = perm_state[index].p[CUDA_PERMUTE_SIZE];
	i = perm_state[index].i;
		i = 1;

	// create permutations - start of main cycle
	while (i < CUDA_PERMUTE_SIZE)
		// fill registers and compute hash
		dwEDX = ((unsigned int*)set)[0];
		dwECX = ((unsigned int*)set)[1];
		dwEBX = ((unsigned int*)set)[2];
		dwEAX = ((unsigned int*)set)[3];

		dwEAX += dwEBX + dwECX + dwEDX;
		dwEBX ^= dwEAX ^ dwECX ^ dwEDX;
		if (dwEBX)		// zero division
			dwECX -= dwEAX;
			// dwEDX:dwEAX = dwEAX * dwECX;
			uLong = (unsigned long long)dwEAX * (unsigned long long)dwECX; // MUL
			dwEAX = uLong & 0xFFFFFFFF;
			dwEDX = uLong >> 32;

			dwEAX ^= dwEDX;
			dwEDX += dwECX;
			dwESI = dwEAX % dwEBX;	// DIV
			dwEDI = dwESI * 3;
			dwEAX += dwESI;
			dwECX ^= dwEDX;
			dwEDX ^= dwESI;
			dwEAX = dwEAX * dwEDX;	// MUL
			dwECX ^= dwEDI;
			dwECX += dwESI;

			if (dwECX)			// zero division
				// dwEDI = dwEAX % dwECX;	// IDIV
				long long llEAX = dwEAX;
				dwEDI = llEAX % int(dwECX);	// required for mimic x86 IDIV instruction
				dwECX = dwEDI * 0xE4;

				if (dwECX == 0xAD249F04)		// if passed ECX test write it to the output buffer
					*(uint4*)&out_buffer [*(unsigned int*)out_buffer] = *(uint4*)set;
					*(unsigned int*)out_buffer += 16;

		// next permutation
		if (i & 1)
			j = p[i];
			j = 0;
		tmp = set[j];
		set[j] = set[i];
		set[i] = tmp;

		for (i=1; !p[i]; i++)
			p[i] = i;

		if (++intIterations >= PERM_CYCLES)		// cycle breaker

	// perm_state[index] = state; back last permutation state
	for (j=0; j < CUDA_PERMUTE_SIZE; j++) {
		perm_state[index].set[j] = set[j];
		perm_state[index].p[j] = p[j];
	perm_state[index].p[CUDA_PERMUTE_SIZE] = p[CUDA_PERMUTE_SIZE];
	perm_state[index].i = i;

void CallKernel_Generate_Permutations (CUDA_PERM_STATE *perm_state, unsigned char *out_buffer)
	CUDA_Generate_Permutations_Inlined <<<PERM_NUMBER_OF_BLOCKS, PERM_THREADS_PER_BLOCK>>> (perm_state, out_buffer);

What kernel does:
It generates 16-char permutations, compute custom 16B hash from it and if 4B of hash is equal to given constant writes this permutation to output buffer. Probability of finding hash match is very low (1:4bil). Permutation states are in CUDA_PERM_STATE structures for save/resume.

The problem:
Global memory access is not a problem. Kernel access global mem only twice at start and at the end. Main cycle (runs 10.000 times) practically do not touch global mem.

NVIDIA Profiler shows 45% occupancy on 560 and great 85% occupancy on 960. Also shows that kernel uses 41 registers on 560 but only 21 registers on 960. I repeat that same exe was used. So I assumed that problem is in used fields which was moved somehow from fast registers to slow local memory.

So I rewrite code so that fields set and p was moved to shared memory with coalesced access. Code works properly and profiler shows Shared_Memory_Efficiency = 99,9% so coalescing was OK. But code was even slower than without shared mem. So probably this is not the bottleneck.

I use Visual Studio 2010 on Windows 7 64-bit and Cuda SDK 7.0.28. Target architecture changed to compute_52,sm_52 with no speed improvement.

Here is full source code (Visual Studio 2010):
Speed is 177 mil/s on my GTX 960. And was 880 mil/s on old GTX 560Ti. Program requires file permutation_states.dat. Included in archive. It’s databaze of 14.200 starting combinations states for save/resume. Generated sequences should start with:
%+,}){_/-(>< %+,-]*}){_/[(><

Now I don’t know how to proceed further. Has someone any idea what can be wrong with this code?
What should be inspected is inner cycle which run for 10.000 times. Code outside the cycle has practically zero impact on performance.

The first time this binary ran the PTX JIT compiler had to compile the kernel for Compute 5.x architecture. This may have skewed your measurement.

Does it run faster when you run it again? This time, the kernel can be pulled from the JIT cache.

[Side remark: Multiple exclamation marks are only warranted in case of life-threatening emergency]

To avoid the JIT compilation you would want to build the app as a fat binary that incorporates the machine code for any architecture you intend to run on, plus PTX for the latest architecture. For example, to cover all platforms supported by CUDA 7.0, you would build with:

-gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52

That’s the maximum list, and compiling for this many architectures will significantly increase compile time, so you would want to pare this down and only compile for platforms you need support for.

You were right. I compile _20 instead of _52. But after change there’s no performance advance. Code is still slow as before.

I’m using Cuda SDK 7.0.28 in MS Visul Studio 2010.
I assume that target architecture is adjusted here:
Project Properties -> CUDA C/C++ -> Device -> Code Generation=compute_52,sm_52

So what to do now?

You might want to post self-contained buildable code that other people could run. I assume the program runs error-free under cuda-memcheck and does in fact deliver identical results on both GPUs? You can easily check for use of local memory, by disassembling the binary. What makes you believe that is the problem? For what its worth, if I compare the use of LDL and STL inside the main loop it seems largely identical between between the sm_20 and sm_52 versions. Register use is similar, the sm_20 version uses 44 registers, the sm_52 version uses 46 registers.

Glancing over the code, it seems to rely heavily on integer division. Integer division is emulated on GPUs, relying heavily on integer multiplication. On Maxwell, most integer multiplications themselves are emulated. While I do not see how this could affect performance to the tune of a factor of 5 it might have some negative impact. Looking at the 64-bit division subroutine in the disassemble it seems to consist of 69 instructions on sm_20 but 198 instructions on sm_52, however the throughput of the individual instructions in the sm_52 version is much higher so in terms of clock-per-clock performance it should be about equal.

In general I would avoid the use of ‘char’ variables, the natural operand size of the hardware is 32 bits, so ‘int’ or ‘unsigned int’ is preferred. Unlike x86 processors, GPUs require natural alignment of all data, so simply converting a pointer to ‘char’ into a pointer to ‘int’ and then de-referencing that can lead to undefined behavior. The same applies to the later conversion to a ‘uint4’ pointer.

If you want to stay with ‘char’, consider turning set into a union of a uint4 and an array of ‘unsigned char’, this should take care of alignment. BTW, the signed-ness of the type ‘char’ is implementation defined in C/C++, so where ‘char’ is used as an minimum-size integer one would always explicitly make it ‘unsigned char’ or ‘signed char’.

While this posted use case of generated permutations is application specific, in general using factorial decomposition is a much better approach for parallel architectures.

I (with optimization help from Norbert) posted this code about a year ago which generates and evaluates all permutations of an array;

The code is posted as well, with a specific example.

Have yet to find a faster implementation, as it generates and evaluates all 13! array permutations in just over 1.3 second on a single reference Titan X GPU.

The code is verified by a serial CPU version using the STD:next_permutation() which takes about 129 seconds on an overclocked 4.5 Ghz CPU for the exact same problem of permutation and evaluation 13 elements.

The highest number of elements I have tested was 17 which took a couple of hours.

NOTE: will only run on compute 3.5 and up due to use of __shfl(). Optimized for Maxwell.

Also for those interested in other Brute Force CUDA implementations I am very proud of my Magic Square solver application, which can generate and evaluate 7^16 (33,232,930,569,601) possible arrangements of a game board in about 197.9 seconds on a single reference GTX Titan X.

Working on the 5x5 version of that problem currently as a hobby project.

I think __shfl() works on cc3.0 also.

Ok, so here is complete source code. Speed is 177 mil/s on my GTX 960. And was 880 mil/s on old GTX 560Ti.
Program requires file permutation_states.dat. Included in archive. It’s databaze of 14.200 starting combinations states for save/resume. Generated sequences should start with:
%+,}){_/-(>< %+,-]*}){_/[(><

Please look at it someone.
Download here: deleted

Keep in mind that code which was optimized for a particular architecture, will be probably need to be updated for newer architectures.

Also it is risky to download a .rar from a non-secure site, so I recommend putting up a temporary post on pastebin or Github with the source.

Here is another download link: deleted

Anyone tried the source code? What is your speed and on what GPU? Please help.

Yes I tried the source code. I got about 199mil/s on a Quadro1000m. In that case I had a CUDA 7 system on VS2013 and the project imported correctly. I was in the process of testing it on a CUDA 7.5RC system with VS2013 and GTX460m but the project didn’t import correctly. So I need to recreate the project. I don’t have a GTX 560 so the GTX460m is probably the closest I can come to it.

Then other priorities emerged. Unfortunately I don’t have hours and hours to spend looking at this. If time permits I will revisit it. Also, I do not have a GTX 960 to test on, closest I could come to it would be a GTX 970. So I don’t have the data, but even if I can get some measurements, I don’t know how meaningful they would be.

Also, based on my attempt it seemed to only compile correctly as a 32-bit project. I tried x64 but it choked during compile and I didn’t bother to investigate why.

If you made a standalone command line executable for benchmarking purposes that would run either on windows or linux I could make more rapid progress. The GUI is not relevant to this discussion, I don’t think. But nevertheless I’m not making any promises.

It’s solved. It suddenly boosted up. I deleted cached kernels from hard drive, also makes some changes to the project properties (don’t remember which) and code speeds up significantly.

Guys thanks anyway for the effort.