I have come across the following issue when using pointer arithmetic and increment operators on the device. In C compiled on the host the following line of code has the result of doing nothing. It simply assigns the location pointed to by p by the value already stored at that location. then increments the pointer and proceeds…
for(i = 0; i < length_of_p_data;i++){
*p++ = *p;
}
where p is a pointer to some data of length length_of_p_data.
However, when code like this is executed on the device it has the following effect.
for(i = 0; i < length_of_p_data;i++){
p[i] = p[i+1];
}
If the device code is compiled in Emu, however, the code has the expected effect.
for(i = 0; i < length_of_p_data;i++){
p[i] = p[i];
}
The following is a test program that illustrates the problem in a somewhat roundabout way. The program should simply assign the elements of g_odata with their index. The problem only occurs when the same line both reads a memory location and assigns to the same location in the same line of code.
This code simply calls a kernel with 1 thread and one block and has it assign an array of 32 elements with their corresponding index.
I’m not sure if this is supposed to be defined behavior in C or if it is undefined in the C standard but the results are always the same when run on the CPU and are always consistently different when run on the GPU.
[codebox]
// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
// includes, project
#include <cutil.h>
////////////////////////////////////////////////////////////////////////////////
// declaration, forward
void runTest( int argc, char** argv);
global void
testKernel(float* g_odata) // We launch 32 threads and want each threadto process several elements in a loop
{
int i;
float temp_data[32];
//initialize locla memory;
for(i = 0; i < 32; i++){
temp_data[i] = i;
}
float *p_odata = &temp_data[0];
for( i = 0; i < 32; ++i)
{
*p_odata++ = *p_odata;
}
p_odata = &temp_data[0];
for(i = 0; i < 32; ++i)
{
g_odata[i] = p_odata[i];
}
}
void
computeGold( float* reference)
{
//perform the computation with pointer arithmatic
int i;
float temp_data[32];
for(i = 0; i < 32; i++){
temp_data[i] = (float)i;
}
float *p_odata = &temp_data[0];
for( i = 0; i < 32; ++i)
{
*p_odata++ = *p_odata;
}
p_odata = &temp_data[0];
for(i = 0; i < 32; ++i)
{
reference[i] = p_odata[i];
}
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main( int argc, char** argv)
{
runTest( argc, argv);
CUT_EXIT(argc, argv);
}
////////////////////////////////////////////////////////////////////////////////
//! Run a simple test for CUDA
////////////////////////////////////////////////////////////////////////////////
void
runTest( int argc, char** argv)
{
CUT_DEVICE_INIT(argc, argv);
unsigned int num_threads = 1;
unsigned int num_elements_per_thread = 32;
unsigned int mem_size = num_elements_per_thread*num_threads*sizeof(float);
// allocate device memory for result
float* d_odata;
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));
// setup execution parameters
dim3 grid( 1, 1, 1);
dim3 threads( num_threads, 1, 1);
// execute the kernel
testKernel<<< grid, threads>>>(d_odata);
// check if kernel execution generated and error
CUT_CHECK_ERROR("Kernel execution failed");
// allocate mem for the result on host side
float* h_odata = (float*) malloc( mem_size);
// copy result from device to host
CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size,
cudaMemcpyDeviceToHost) );
// compute reference solution
float* reference = (float*) malloc( mem_size);
computeGold( reference);
// check if the result is equivalent to the expected soluion
CUTBoolean res = cutComparef( reference, h_odata, num_threads*num_elements_per_thread);
printf( "Test %s\n", (1 == res) ? "PASSED" : "FAILED");
// cleanup memory
free( h_odata);
free( reference);
CUDA_SAFE_CALL(cudaFree(d_odata));
}
[/codebox]
I am using cuda 2.0 in Windows XP 32bit SP3 on the following device :
[codebox]There is 1 device supporting CUDA
Device 0: “Quadro FX 3600M”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536543232 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.25 GHz
Concurrent copy and execution: Yes
Test PASSED
Press ENTER to exit…[/codebox]