My Fermi kernel is failing to execute properly, but when I insert a printf() or compile with -G to use cuda-gdb the problem goes away (but the kernel runs too slowly).
When I run without the printf() and without compiling with -G, I get the uninformative ‘unspecified launch error’ message.
When I run under cuda-gdb (despite not compiling with -G) it bails with the complaint ‘Program received signal CUDA_EXCEPTION_6, Warp Misaligned Address.’ This is probably the root issue, although I don’t understand exactly what the complaint means, and without the -G the debugger will give me only a program counter where the error was encountered and no way that I know of to track that back to a location in my source code.
Any help is appreciated.
-Steve Fischer
You should simply align your accesses on the GPU Global memory. The aligned address is a multiple of the size of the object
your are reading or writing, e.g. if you want to read or write an integer, the address should be a multiple of 4. And, reading
or writing a char is always aligned.
Suppose you have a big space allocated with cudaMalloc called dummySpace, this pseudo kernel code would probably results in
CUDA_EXCEPTION_6 Warp Misaligned Address:
__global__ void func (char* stringInput, int stringSize, int* integerInput, char* dummySpace) //input: a string, an integer, output: a big space with that string and integer in it
{
//dummySpace is created by cudaMalloc, so it is aligned to at least 256 bytes
int counter = 0;
for(int i=0;i<stringSize;i++)
dummySpace[counter++] = stringInput[i]; //==>this is copying several chars, sizeof(char) is one, so they are always aligned
for(int i=0;i<sizeof(int);i++)
dummySpace[counter++] = ((char*)integerInput)[i]; //==> this is going to be a problem because the first for has advanced the counter by stringSize which is unknown an can make the address unaligned
}
The fixed one:
__global__ void func (char* stringInput, int stringSize, int* integerInput, char* dummySpace)
{
int counter = 0;
for(int i=0;i<stringSize;i++)
dummySpace[counter++] = stringInput[i];
int sub = counter % 4; //or 8 or 16..
counter += (4-sub);
for(int i=0;i<sizeof(int);i++)
dummySpace[counter++] = ((char*)integerInput)[i]; //==> everything is ok as you are saving an integer in an aligned address
}
Hope it helps.