I have a piece of code with a for loop.
for (i = m_nPlainLenMax - 1; i >= m_nPlainLenMin - 1; i--)
{
if (m_nIndex >= nPlainSpaceUpToX[i])
{
m_nPlainLen = i + 1;
break;
}
}
nPlainSpaceUpToX is a uint64* of 17 elements which is passed to the kernel function from the C code so the data is the same for all threads.
Since the variable is accessed so much, i guess it would be better to copy it to shared memory?
But will it be slower when all threads are trying to access the same piece of shared memory simultaneously?
What would be the best way to implement this?
How about keeping it in constant memory? It is automatically cached. The GPU doesn’t handle uint64 very fast. Do you really need to use uint64 or can you find a better alternative?
This small piece of code doesn’t look like it can be optimized very much, especially when the array is only 17 elements. It would be better to look at the rest of your algorithm and try to optimize that.
wumpus
October 26, 2007, 1:18pm
3
nPlainSpaceUpToX is a uint64* of 17 elements which is passed to the kernel function from the C code so the data is the same for all threads.
Since the variable is accessed so much, i guess it would be better to copy it to shared memory?
But will it be slower when all threads are trying to access the same piece of shared memory simultaneously?
[snapback]270323[/snapback]
Accessing the same place in shared memory by all threads is very fast, as the value is broadcast over all the processors (this is in the programmers guide somewhere)
As the previous poster mentions, using constant memory is an even better idea. It is just as fast, and saves the overhead of copying it at the beginning of the kernel.
What seems to be the most time consuming in my kernel code is the following:
unsigned int nIndexOfX32 = (unsigned int)nIndexOfX;
for (; i >= 0; i--)
{
m_Plain[i] = m_PlainCharset[nIndexOfX32 & (m_nPlainCharsetLen-1)];
nIndexOfX32 /= m_nPlainCharsetLen;
}
i is a int and is = m_nPlainCharsetLen
m_PlainCharset is a “unsigned char *” which is passed from the calling C code
m_nPlainCharsetLen is a “unsigned int” passed from the calling C code
m_Plain is a local “unsigned char m_Plain[16]” passed to this device function as a “unsigned char *” from the main kernel function
The full kernel code takes ~4400ms to run.
If i comment out the line below, it takes ~1925 ms to run the code
m_Plain[i] = m_PlainCharset[nIndexOfX32 & (m_nPlainCharsetLen-1)];
I did some more testing and tried to replace the code with the code below. It resultet in a time of ~3365 ms
unsigned int nIndexOfX32 = (unsigned int)nIndexOfX;
for (; i >= 0; i--)
{
m_Plain[i] = 0x55;
nIndexOfX32 /= m_nPlainCharsetLen;
}
Finally i tried to swap out the other variable, as seen below. It resulted in a time of ~1920 ms
unsigned int nIndexOfX32 = (unsigned int)nIndexOfX;
for (; i >= 0; i--)
{
unsigned char test = m_PlainCharset[nIndexOfX32 & [m_nPlainCharsetLen-1)];
nIndexOfX32 /= m_nPlainCharsetLen;
}
Do anyone have any clue why that single line of code takes 50% of the execution time?
What can i do to help it out?
Do anyone have any clue why that single line of code takes 50% of the execution time?
What can i do to help it out?
[snapback]270459[/snapback]
Be careful when you comment out innocent looking lines of code. nvcc (and many compilers of today) are quite smart at detecting and removing unused code. So in your case, if it detects that m_Plain is not used, it may optimize away a lot more of your kernel than you expect, causing this speedup.
The other thing to check is if any of your registers are being spilled into local memory. You have many arrays and under some situations, some of these could get put in local memory. This is very bad for performance. You can check if this is happening by getting nvcc to output the cubin file and examining this file.
Be careful when you comment out innocent looking lines of code. nvcc (and many compilers of today) are quite smart at detecting and removing unused code. So in your case, if it detects that m_Plain is not used, it may optimize away a lot more of your kernel than you expect, causing this speedup.
The other thing to check is if any of your registers are being spilled into local memory. You have many arrays and under some situations, some of these could get put in local memory. This is very bad for performance. You can check if this is happening by getting nvcc to output the cubin file and examining this file.
[snapback]270473[/snapback]
In the original code I’m trying to port, there is some assembler code as a replacement for the code shown (shown below)
Is there any way to port this assembler code too?
for (; i >= 0; i--)
{
//m_Plain[i] = m_PlainCharset[nIndexOfX32 % m_nPlainCharsetLen];
//nIndexOfX32 /= m_nPlainCharsetLen;
unsigned int nPlainCharsetLen = m_nPlainCharsetLen;
unsigned int nTemp;
#ifdef _WIN32
__asm
{
mov eax, nIndexOfX32
xor edx, edx
div nPlainCharsetLen
mov nIndexOfX32, eax
mov nTemp, edx
}
#else
__asm__ __volatile__ ( "mov %2, %%eax;"
"xor %%edx, %%edx;"
"divl %3;"
"mov %%eax, %0;"
"mov %%edx, %1;"
: "=m"(nIndexOfX32), "=m"(nTemp)
: "m"(nIndexOfX32), "m"(nPlainCharsetLen)
: "%eax", "%edx"
);
#endif
m_Plain[i] = m_PlainCharset[nTemp];
}
wumpus
October 28, 2007, 10:46am
7
Integer division is extremely slow in CUDA, avoid at all costs inside loops. Can you write a way around it?