device-based memcpy()

Hi,

I would like to code up a function devStrncpy, similar to strncpy, but that runs ON THE DEVICE. For my purposes, the devStrncpy() can be very primitive, and ineffective, e.g. it can run in a single thread. The primary goal of this devStrncpy() is to report errors to the host code from the device, in an assert-like manner. On the basis of this devStrncpy() I would like to define WITHIN THE DEVICE CODE:

#define STRINGIFY(x) #x
#define TO_STRING(x) STRINGIFY(x)

#define REPORT_ERROR(errormsg, pError) devStrncpy(pError, errmsg FILE ", line " TO_STRING(LINE), MAX_ERROR_STRLEN);

In my approach the host code:

  1. allocates device memory for pError, initializes it with zeros;
  2. invokes the kernel;
  3. copies to the host the contents of pError from the device;
  4. examines pError and reports error if pError[0] !=0 (i.e. the device code has set the error string).

while the device code will do smth like:

if (i<0)
REPORT_ERROR(“i must be positive here”, pError);

I coded up a very simple devStrncpy, which “kind of” works. However, depending on what’s AROUND the invocation of my REPORT_ERROR macro, the compiler can generate the following error:

Error: Unaligned memory accesses not supported

This is besides numerous

Advisory: Cannot tell what pointer points to, assuming global memory space

which show up when I’m trying to cast char* to (unsigned*) within the devStrncpy(), in order to avoid unaligned memory access.

I’ve done some forum searching, and found no memcpy()/strncpy() to be used on the DEVICE. Could some cuda expert please provide some pointers (better yet, an example code) of how any of these functions can be implemented?

Thank you a lot in advance!

The very tricky part is the fact that your character constant “i must be positive here” is not characters, nor an array.

It’s a POINTER to a character array. And worse, it’s a pointer to HOST memory!

But the CUDA compiler realizes that and seems to replace the pointer with 0.

BUT…

There’s one hack. “hello” is a pointer to a constant array. But “hello”[0] is an character of a constant array, and the CUDA compiler is smart enough to be able to do the evaluation at compile time, so char c= “hello”[0]; is effectively the same as char c=‘h’;

Now if we unroll loops, and make sure we know the character array is at least some minimum length, we can do something like:

for (int i=0; i<MAX_ERROR_LEN; i++) error[i]="This is an error"[i];

But we must make strings a minimum length or the compiler gags at “hello”[7] which can’t be evaluated.

So… we can pad all strings with lots of tail 0s. And we can do that with the nice auto-concatenation of string constants in C.

we just append “\0\0\0\0\0\0\0\0\0\0” to the end, guaranteeing the character string is nice and long.

Wrap it up in a macro:

#define MAX_ERROR_LEN 40

#define PAD_NULL "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"

#define ERROR_REPORT(errptr, errmsg) \

  do { for (int i=0; i<MAX_ERROR_LEN; i++) errptr[i]=(errmsg PAD_NULL)[i]; } while (0)

You can do even more than this and pass back variables by value as well, perhaps useful, but then you have to start coming up with rules for identifying them… one simple format could include exactly one integer, one float, and one character string… crude but enough to send yourself diagnostic error messages like

1.000222 47 Exceeded tolerance after loops

You could easily get fancier if you needed more info.

There are lots of problems (what if more than one thread reports an error, their reports will overwrite).

You could just live with this, or perhaps use global atomics to guarantee unique reports.

Here’s some working ready to compile code for you:

#include <stdlib.h>

#include <stdio.h>

#include <cutil.h>

#define MAX_ERROR_LEN 40

#define PAD_NULL "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"

#define ERROR_REPORT(errptr, errmsg) \

  do { for (int i=0; i<MAX_ERROR_LEN; i++) errptr[i]=(errmsg PAD_NULL)[i]; } while (0)

__global__ void

testKernel( float* g_idata, float* g_odata, char *errormsg) 

{

  const unsigned int tid = threadIdx.x;

if (tid==9) ERROR_REPORT(errormsg, "Howdy There"); 

}

int

main( int argc, char** argv) 

{

    CUT_DEVICE_INIT(argc, argv);

unsigned int num_threads = 32;

  unsigned int mem_size = sizeof( float) * num_threads;

float* h_idata = (float*) malloc( mem_size);

  for( unsigned int i = 0; i < num_threads; ++i)

    h_idata[i] = (float) i;

float* d_idata;

  CUDA_SAFE_CALL( cudaMalloc( (void**) &d_idata, mem_size));

  CUDA_SAFE_CALL( cudaMemcpy( d_idata, h_idata, mem_size,

                cudaMemcpyHostToDevice) );

float* d_odata;

  CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));

char *h_errormsg=(char *) calloc(MAX_ERROR_LEN, 1);

  char *d_errormsg;

  CUDA_SAFE_CALL( cudaMalloc( (void**) &d_errormsg, MAX_ERROR_LEN));

  CUDA_SAFE_CALL( cudaMemcpy( d_errormsg, h_errormsg, MAX_ERROR_LEN,

                cudaMemcpyHostToDevice) );  

// setup execution parameters

  dim3  grid( 1, 1, 1);

  dim3  threads( num_threads, 1, 1);

// execute the kernel

  testKernel<<< grid, threads, mem_size >>>( d_idata, d_odata, d_errormsg);

// check if kernel execution generated and error

  CUT_CHECK_ERROR("Kernel execution failed");

// allocate mem for the result on host side

  float* h_odata = (float*) malloc( mem_size);

  // copy result from device to host

  CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, sizeof( float) * num_threads,

                cudaMemcpyDeviceToHost) );

CUDA_SAFE_CALL( cudaMemcpy( h_errormsg, d_errormsg, MAX_ERROR_LEN,

                cudaMemcpyDeviceToHost) );

printf("Error msg: %s\n", h_errormsg);

CUT_EXIT(argc, argv);

}

SPWorley,

Thank you for a very informative reply!

I didn’t have a chance to try your code yet, but I almost certainly would.

2 questions:

  1. Is there a way to turn your ERROR_REPORT macro into a function (obviously, a device one)? I know, that nvcc inlines device functions, so there’s not too much difference between a macro and a device function, but from the style perspective, a function would look nicer (IMHO).

  2. Regarding the usage of PAD_NUL in ERROR_REPORT macro, why not just add the following as the second statement to the loop over i:

if (! errptr[i]) break;

? With the above if(…) the padding seems no longer necessary.

Thank you again!

Probably you could, since it’s all inlined and unrolled anyway. Give it a try!

Give it a try, but this likely won’t work. Remember on the device you don’t really have an array you’re reading from!

I believe the above hack works only because the loop is unrolled, and becomes a bunch of simple statements like

errorpntr[0]='H';

errorpntr[1]='e';

errorpntr[2]='l';

...

This is all depending on the compiler, at compile time, to do the array indexing that likely won’t work at runtime.

But give it a try, you have working code right there to start with.

This is definitely undocumented compiler-specific behavior, certainly it’s not promised by CUDA what a device function should do when given a host pointer to a constant array. (is it even really a host pointer?).