simple program does not work - why?

Hi,

I’ve written a simple program to extract a byte from a constant integer. Here is the code:

#include <stdio.h>

#define CHECK_ERROR() do {                                                                 \

  cudaError_t err = cudaGetLastError();                                                    \

   if(err != cudaSuccess) {                                                                       \

       fprintf(stderr, "Cuda error detected in file '%s' in line %i : %s.\n",        \

               __FILE__, __LINE__, cudaGetErrorString( err) );                         \

   exit(1);                                                                                                \

   } } while (0)

__global__ void my_kernel(unsigned char*);

int main() {

 dim3 grid_dim,block_dim;

 unsigned char* host_char;

 unsigned char* device_char;

cudaMallocHost((void**)&host_char,1);

 CHECK_ERROR();

cudaMalloc((void**)&device_char,1);

 CHECK_ERROR(); 

grid_dim.x = 1;

 grid_dim.y = 1;

 grid_dim.z = 1;

block_dim.x = 1;

 block_dim.y = 1;

 block_dim.z = 1;

my_kernel<<<grid_dim,block_dim>>>(device_char);

 CHECK_ERROR();

cudaMemcpy((void**)host_char,(void**)device_char,1,cudaMemcpyDeviceToHost);

 CHECK_ERROR();

printf("result: %X\n",*host_char);

}

__global__ void my_kernel(unsigned char* result) {

 unsigned int a = 0x01234567;

 unsigned char* c = (unsigned char*)&a;

 *result = c[1];

}

I would expect the output “result: 45” (for a little endian machine) or “result: 23” (for a big endian machine). However running this on the GPU gives me “result: 0”. Running it in device emulation mode gives me the output “result: 45”, as I would expect. Does anyone see anything I’m doing wrong?

You have missed “cudaThreadSynchronize()” after your kernel call. So, you are copying out the results without ensuring that the kernel has completed execution. Thats y.

Constantinople

I don’t think this is true. From the Guide:
“Any kernel launch, memory set, or memory copy for which a zero stream parameter has been specified begins only after all preceding operations are done, including operations that are part of other streams, and no subsequent operation may begin
until it is done.”

So the cudaMemcpy() only proceed after the kernel has finished executing.

Yes, Sarnath statement is incorrect.
There is an implicit cudaThreadSynchronize() in cudaMemcpy.

To be safe I added the cudaThreadSynchronize() call in after the kernel call, but I’m still getting the same results. So updated code is:

#include <stdio.h>

#define CHECK_ERROR() do { \

  cudaError_t err = cudaGetLastError();                                   \

   if(err != cudaSuccess) {                                                \

       fprintf(stderr, "Cuda error detected in file '%s' in line %i : %s.\n",        \

               __FILE__, __LINE__, cudaGetErrorString( err) );              \

   exit(1);                                                      \

   } } while (0)

__global__ void my_kernel(unsigned char*);

int main() {

 dim3 grid_dim,block_dim;

 unsigned char* host_char;

 unsigned char* device_char;

cudaMallocHost((void**)&host_char,1);

 CHECK_ERROR();

cudaMalloc((void**)&device_char,1);

 CHECK_ERROR(); 

grid_dim.x = 1;

 grid_dim.y = 1;

 grid_dim.z = 1;

block_dim.x = 1;

 block_dim.y = 1;

 block_dim.z = 1;

my_kernel<<<grid_dim,block_dim>>>(device_char);

 CHECK_ERROR();

cudaThreadSynchronize();

 CHECK_ERROR();

cudaMemcpy((void**)host_char,(void**)device_char,1,cudaMemcpyDeviceToHost);

 CHECK_ERROR();

printf("result: %X\n",*host_char);

}

__global__ void my_kernel(unsigned char* result) {

 unsigned int a = 0x01234567;

 unsigned char* c = (unsigned char*)&a;

 *result = c[1];

}

Interesting! THanks for the corrections guys…

Is this a CUDA 2.0 feature? Sorry about my ignorance…

Implicit thread synchronization has been around since the 0.8 beta (and earlier, I would guess).

Any time you do something in CUDA that touches global memory (even async operations) will wait in a queue on the GPU and run after the previous operation finishes. If that operation involves copying to the host, there is an implicit cudaThreadSynchronize().

I think the problem is that you are using pointer arithmetic inside a kernel. However, the local variables are actually stored in registers so you can’t actually do pointer arithmetic on them.