I have problem with this kernel. When I run it (same .exe) it is 5x slower on 960 than on 560Ti. Code was optimized for 560 hw parameters but anyway this slowdown I did not expect.

Here is code:

```
/*
#define PERM_CYCLES 10000
#define NUM_OF_COMBINATIONS (3550*4)
#define PERM_NUMBER_OF_BLOCKS ((NUM_OF_COMBINATIONS + PERM_THREADS_PER_BLOCK-1) / PERM_THREADS_PER_BLOCK)
#define PERM_THREADS_PER_BLOCK 256
#define CUDA_PERMUTE_SIZE 16
struct CUDA_PERM_STATE
{
char set[CUDA_PERMUTE_SIZE];
int p[CUDA_PERMUTE_SIZE + 1];
int i;
};
*/
// permutations based on code from http://www.quickperm.org/
//---------------------------------------
__global__ void CUDA_Generate_Permutations_Inlined (CUDA_PERM_STATE *perm_state, unsigned char *out_buffer)
{
char set[CUDA_PERMUTE_SIZE]; // our permutated 16B sequence
int p[CUDA_PERMUTE_SIZE + 1];
int i;
char tmp;
unsigned int dwEAX, dwEBX, dwECX, dwEDX, dwEDI, dwESI;
unsigned long long uLong;
int j, intIterations = 0;
int index = blockIdx.x * blockDim.x + threadIdx.x; // index
if (index >= NUM_OF_COMBINATIONS) return; // out of data bounds
if (*(unsigned int*)out_buffer >= OUT_BUFFER_SIZE) return; // out of range of output buffer (20 MB)
// state = perm_state[index]; ... reload last permutation state
for (j=0; j < CUDA_PERMUTE_SIZE; j++) {
set[j] = perm_state[index].set[j];
p[j] = perm_state[index].p[j];
}
p[CUDA_PERMUTE_SIZE] = perm_state[index].p[CUDA_PERMUTE_SIZE];
i = perm_state[index].i;
if (i >= CUDA_PERMUTE_SIZE)
i = 1;
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// create permutations - start of main cycle
while (i < CUDA_PERMUTE_SIZE)
{
// fill registers and compute hash
dwEDX = ((unsigned int*)set)[0];
dwECX = ((unsigned int*)set)[1];
dwEBX = ((unsigned int*)set)[2];
dwEAX = ((unsigned int*)set)[3];
dwEAX += dwEBX + dwECX + dwEDX;
dwEBX ^= dwEAX ^ dwECX ^ dwEDX;
if (dwEBX) // zero division
{
dwECX -= dwEAX;
// dwEDX:dwEAX = dwEAX * dwECX;
uLong = (unsigned long long)dwEAX * (unsigned long long)dwECX; // MUL
dwEAX = uLong & 0xFFFFFFFF;
dwEDX = uLong >> 32;
dwEAX ^= dwEDX;
dwEDX += dwECX;
dwESI = dwEAX % dwEBX; // DIV
dwEDI = dwESI * 3;
dwEAX += dwESI;
dwECX ^= dwEDX;
dwEDX ^= dwESI;
dwEAX = dwEAX * dwEDX; // MUL
dwECX ^= dwEDI;
dwECX += dwESI;
if (dwECX) // zero division
{
// dwEDI = dwEAX % dwECX; // IDIV
long long llEAX = dwEAX;
dwEDI = llEAX % int(dwECX); // required for mimic x86 IDIV instruction
dwECX = dwEDI * 0xE4;
if (dwECX == 0xAD249F04) // if passed ECX test write it to the output buffer
{
*(uint4*)&out_buffer [*(unsigned int*)out_buffer] = *(uint4*)set;
*(unsigned int*)out_buffer += 16;
}
}
}
// next permutation
p[i]--;
if (i & 1)
j = p[i];
else
j = 0;
tmp = set[j];
set[j] = set[i];
set[i] = tmp;
for (i=1; !p[i]; i++)
p[i] = i;
if (++intIterations >= PERM_CYCLES) // cycle breaker
break;
}
// perm_state[index] = state; ...save back last permutation state
for (j=0; j < CUDA_PERMUTE_SIZE; j++) {
perm_state[index].set[j] = set[j];
perm_state[index].p[j] = p[j];
}
perm_state[index].p[CUDA_PERMUTE_SIZE] = p[CUDA_PERMUTE_SIZE];
perm_state[index].i = i;
}
//---------------------------------------
void CallKernel_Generate_Permutations (CUDA_PERM_STATE *perm_state, unsigned char *out_buffer)
{
CUDA_Generate_Permutations_Inlined <<<PERM_NUMBER_OF_BLOCKS, PERM_THREADS_PER_BLOCK>>> (perm_state, out_buffer);
}
```

**What kernel does:**

It generates 16-char permutations, compute custom 16B hash from it and if 4B of hash is equal to given constant writes this permutation to output buffer. Probability of finding hash match is very low (1:4bil). Permutation states are in CUDA_PERM_STATE structures for save/resume.

**The problem:**

Global memory access is not a problem. Kernel access global mem only twice at start and at the end. Main cycle (runs 10.000 times) practically do not touch global mem.

NVIDIA Profiler shows 45% occupancy on 560 and great 85% occupancy on 960. Also shows that kernel uses 41 registers on 560 but only 21 registers on 960. I repeat that same exe was used. So I assumed that problem is in used fields which was moved somehow from fast registers to slow local memory.

**Solution?:**

So I rewrite code so that fields set and p was moved to shared memory with coalesced access. Code works properly and profiler shows Shared_Memory_Efficiency = 99,9% so coalescing was OK. But code was even slower than without shared mem. So probably this is not the bottleneck.

I use Visual Studio 2010 on Windows 7 64-bit and Cuda SDK 7.0.28. Target architecture changed to compute_52,sm_52 with no speed improvement.

Here is full source code (Visual Studio 2010):

deleted

Speed is 177 mil/s on my GTX 960. And was 880 mil/s on old GTX 560Ti. Program requires file permutation_states.dat. Included in archive. It’s databaze of 14.200 starting combinations states for save/resume. Generated sequences should start with:

%+,*}){_/-(><
%+,-]*}){_/[(><
@.*#(]

*^/)[±>~<*

@.*+(]^/)[#->~<

@.*+(]

etc.

Now I don’t know how to proceed further. Has someone any idea what can be wrong with this code?

What should be inspected is inner cycle which run for 10.000 times. Code outside the cycle has practically zero impact on performance.