Performance optimization?

I have a piece of code with a for loop.

for (i = m_nPlainLenMax - 1; i >= m_nPlainLenMin - 1; i--)

	{

  if (m_nIndex >= nPlainSpaceUpToX[i])

  {

  	m_nPlainLen = i + 1;

  	break;

  }

	}

nPlainSpaceUpToX is a uint64* of 17 elements which is passed to the kernel function from the C code so the data is the same for all threads.

Since the variable is accessed so much, i guess it would be better to copy it to shared memory?

But will it be slower when all threads are trying to access the same piece of shared memory simultaneously?

What would be the best way to implement this?

How about keeping it in constant memory? It is automatically cached. The GPU doesn’t handle uint64 very fast. Do you really need to use uint64 or can you find a better alternative?

This small piece of code doesn’t look like it can be optimized very much, especially when the array is only 17 elements. It would be better to look at the rest of your algorithm and try to optimize that.

Accessing the same place in shared memory by all threads is very fast, as the value is broadcast over all the processors (this is in the programmers guide somewhere)

As the previous poster mentions, using constant memory is an even better idea. It is just as fast, and saves the overhead of copying it at the beginning of the kernel.

What seems to be the most time consuming in my kernel code is the following:

unsigned int nIndexOfX32 = (unsigned int)nIndexOfX;

	for (; i >= 0; i--)

	{

  m_Plain[i] = m_PlainCharset[nIndexOfX32 & (m_nPlainCharsetLen-1)];

  nIndexOfX32 /= m_nPlainCharsetLen;

	}

i is a int and is = m_nPlainCharsetLen

m_PlainCharset is a “unsigned char *” which is passed from the calling C code

m_nPlainCharsetLen is a “unsigned int” passed from the calling C code

m_Plain is a local “unsigned char m_Plain[16]” passed to this device function as a “unsigned char *” from the main kernel function

The full kernel code takes ~4400ms to run.

If i comment out the line below, it takes ~1925 ms to run the code

 

m_Plain[i] = m_PlainCharset[nIndexOfX32 & (m_nPlainCharsetLen-1)];

I did some more testing and tried to replace the code with the code below. It resultet in a time of ~3365 ms

unsigned int nIndexOfX32 = (unsigned int)nIndexOfX;

	for (; i >= 0; i--)

	{

  m_Plain[i] = 0x55;

  nIndexOfX32 /= m_nPlainCharsetLen;

	}

Finally i tried to swap out the other variable, as seen below. It resulted in a time of ~1920 ms

unsigned int nIndexOfX32 = (unsigned int)nIndexOfX;

	for (; i >= 0; i--)

	{

  unsigned char test = m_PlainCharset[nIndexOfX32 & [m_nPlainCharsetLen-1)];

  nIndexOfX32 /= m_nPlainCharsetLen;

	}

Do anyone have any clue why that single line of code takes 50% of the execution time?

What can i do to help it out?

Be careful when you comment out innocent looking lines of code. nvcc (and many compilers of today) are quite smart at detecting and removing unused code. So in your case, if it detects that m_Plain is not used, it may optimize away a lot more of your kernel than you expect, causing this speedup.

The other thing to check is if any of your registers are being spilled into local memory. You have many arrays and under some situations, some of these could get put in local memory. This is very bad for performance. You can check if this is happening by getting nvcc to output the cubin file and examining this file.

In the original code I’m trying to port, there is some assembler code as a replacement for the code shown (shown below)

Is there any way to port this assembler code too?

for (; i >= 0; i--)

	{

  //m_Plain[i] = m_PlainCharset[nIndexOfX32 % m_nPlainCharsetLen];

  //nIndexOfX32 /= m_nPlainCharsetLen;

 unsigned int nPlainCharsetLen = m_nPlainCharsetLen;

  unsigned int nTemp;

#ifdef _WIN32

  __asm

  {

  	mov eax, nIndexOfX32

  	xor edx, edx

  	div nPlainCharsetLen

  	mov nIndexOfX32, eax

  	mov nTemp, edx

  }

#else

  __asm__ __volatile__ (	"mov %2, %%eax;"

        "xor %%edx, %%edx;"

        "divl %3;"

        "mov %%eax, %0;"

        "mov %%edx, %1;"

        : "=m"(nIndexOfX32), "=m"(nTemp)

        : "m"(nIndexOfX32), "m"(nPlainCharsetLen)

        : "%eax", "%edx"

        );

#endif

  m_Plain[i] = m_PlainCharset[nTemp];

	}

Integer division is extremely slow in CUDA, avoid at all costs inside loops. Can you write a way around it?