I noticed an issue while debugging kernel code and I have narrowed it to the following.
int idx = blockDim.x * blockIdx.x + threadIdx.x; // idx is the Thread ID
char buf[] = "abcdefghijkl"; // Temporary Variable
inputBuffer = buf; // inputBuffer & outputBuffer are of type char*
// NOTE: Endian : LITTLE
if ( idx == 0 ) // only Thread 0
{
uint32_t threeBytes = *((uint32_t*) &inputBuffer[3]); // Temporary Variable
outputBuffer[0] = threeBytes & 0xFF; // OUTPUT: a (This is WRONG)
}
else if ( idx == 1 ) // only Thread 1
{
outputBuffer[1] = (*((uint32_t*) &inputBuffer[3])) & 0xFF; // OUTPUT: d (This is CORRECT)
}
After executing the above code, I am left with outputBuffer[0] with ‘a’ and outputBuffer[1] with ‘d’. The only difference in them is the use of a temporary variable.
Any idea on why such a difference and what goes on underneath ?
Also, Instead of using the temporary variable ‘buf’, if we assign the value “abcdefghijkl” to inputBuffer directly then both outputBuffer[0] and outputBuffer[1] have the same value ‘d’. This is again wierd, what changed ?
Additional details from PTX code,
.version 1.4
.target sm_12, map_f64_to_f32
// compiled with /usr/local/cuda/open64/lib//be
// nvopencc 3.1 built on 2010-06-08
//-----------------------------------------------------------
// Compiling /tmp/tmpxft_0000461d_00000000-7_gpu1.cpp3.i (/tmp/ccBI#.aP0shD)
//-----------------------------------------------------------
//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
// Target:ptx, ISA:sm_12, Endian:little, Pointer Size:32
// -O0 (Optimization level)
// -g2 (Debug level)
// -m2 (Report advisories)
//-----------------------------------------------------------