Problems on 650M, code works on C2050.

This may not be quite the right place for this problem; if not, please let me know so I can post to the right subforum.

I’m trying to develop a framework for maximum-likelihood fits, in which the event probability evaluations are done on a GPU. I have some code that works on two separate C2050 boxes; I’ve also had a chance to test it on the new Kepler, and it worked there as well.

However, when I try running the same code on a laptop with a 650M, it crashes - both on my own laptop with Ubuntu 12.04, and on my colleague’s MacBook. Does anyone know of a difference between the laptop and desktop cards that might account for this?

I’ve had a look with cuda-gdb, but I must confess I don’t find the information very enlightening:

Error code 13 (invalid device symbol) at /home/rolfa/release_09Nov2012/FPOINTER/ThrustPdfFunctor.cu, 663
========= Program hit error 13 on CUDA API call to cudaMemcpyFromSymbol
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/nvidia-current/libcuda.so [0x24e199]
=========     Host Frame:/usr/local/cuda/lib/libcudart.so.5.0 (cudaMemcpyFromSymbol + 0x31a) [0x3a0ca]

The code which causes this invalid symbol error looks like so:

metricIndex = num_device_functions;
   void* dummy[1];
   std::cout << "Copying to " << localPtr << std::endl;
   cutilSafeCall(cudaMemcpyFromSymbol(dummy, localPtr.c_str(), sizeof(void*))); // Line 663
   host_function_table[num_device_functions] = dummy[0];
   functionNameToDeviceIndexMap[localPtr] = num_device_functions;
   num_device_functions++;
   cutilSafeCall(cudaMemcpyToSymbol(device_function_table, host_function_table, num_device_functions*sizeof(void*)));

where ‘localPtr’ is a string naming the symbol to be copied. In this case it has the value “ptr_to_NLL”, and this variable is declared earlier in the same file:

typedef fptype (*device_metric_ptr) (fptype, fptype*, unsigned int);
__device__ fptype calculateNLL (fptype rawPdf, fptype* evtVal, unsigned int par) {
  rawPdf *= normalisationFactors[par];
  return rawPdf > 0 ? -LOG(rawPdf) : 0;
}
// (...)
__device__ device_metric_ptr ptr_to_NLL          = calculateNLL;

cuda-gdb reports that the value of this pointer is zero. Could there be some difference in the way that global device variables are treated between the mobile and desktop versions of the drivers?

If anyone would like to try to reproduce this on their own systems, here is the code, for Mac and Ubuntu:

http://www.physics.uc.edu/~rolfa/GooFit_05Dec2012_standalone_Mac.tar.gz
http://www.physics.uc.edu/~rolfa/GooFit_05Dec2012_standalone.tar.gz

To install, just unpack, possibly edit the Makefile so ‘CUDALOCATION’ points to your install, run ‘gmake’, set LD_LIBRARY_PATH to include the subdirectory ‘rootstuff’, and run ‘gtest’.

Are all these platforms running the same version of CUDA? I notice that the cudaMemcpyFromSymbol() call is using a string argument. This was deprecated at least as far back as CUDA 4.2, and is no longer supported at all in CUDA 5.0. I am a bit puzzled that you write the code works on Kepler, as I would assume Kepler to run with CUDA 5.0.

In any event, the removal of string argument support in cudaMemcpyFromSymbol() should result in a compile time error rather than a run time issue. “error 13” looks like a segfault to me, i.e. an out-of-bounds access on the host side.

Does the code check the status of every CUDA API call leading up to the cudaMemcpyFromSymbol() ? In particular, did the allocation of the device variable that you are trying to copy from succeed? From what you write above, the code fails because it is trying to copy from a null pointer, which would indicate that the device-side variable was never allocated.

Hello,

Please post also the cuda toolkit versions.

Sorry, I was reading too fast. Apparently the error code 13 is a CUDA error code (“invalid device symbol”), which would seem to point to the string argument to cudaMemcpyToSymbol() as the source of the problem.

I am wondering: Is there a particular reason to use cudaMemcpytoSymbol() here? For memory allocated with cudaMalloc() etc, a simple cudaMemcpy() on the pointer would suffice. The only instance in which I have ever used cudaMemcpyToSymbol() is for updating constant data.

A quick look at your Makefile reveals:

CXXFLAGS = -O3 -arch=sm_20

That’s going to be a problem for the Kepler sm_30 650M.

Shouldn’t there be backwards compatibility?

Thanks for the tips; sorry for my slow response.

@pasoleatis: It is 5.0 for the laptop (where I see the crash); 4.1 for the desktop (where it works).

@allanmac: I tried changing the target architecture to sm_30. This changed the character of the crash. Instead of an error message, I now get a black screen requiring a full reboot of the laptop. Impressive! I’m actually somewhat happy with this, because now my Ubuntu laptop’s behaviour matches that of the MacBook belonging to my colleague.

@njuffa: I’m using cudaMemcpyFromSymbol because my target is in constant memory - it wasn’t cudaMalloc’ed.

On the removal of string support for CUDA 5.0, that’s kind of a problem. I have an architecture that relies on this kind of code:

__device__ fptype device_Gaussian // ...

__device__ device_function_ptr ptr_to_Gaussian = device_Gaussian;

// ...

  initialise(pindices, "ptr_to_Gaussian");

The call to initialise occurs in many different places; I would prefer not to have to do lookups of the device-side pointer before every such call. That’s what the initialise method is for! Is there a reasonable way to duplicate this functionality without using the string? (That aside, I also need to check whether this is in fact what’s causing the problem. If not, I may need to change my code, but I’d like to fix the crash first.)

The API variants with string arguments were removed in CUDA 5.0, and I have also had to adjust some code to account for this. In general the workarounds will differ on a case-by-case basis, I am not aware of a universal recipe. In one app that created the strings dynamically, but was limited to a finite number of strings, I used what amounts to a big switch statement as I recall.

You may want to consider filing a bug regarding your use case. I assume there were strong reasons to remove the string-based API variants, so not sure whether this would help. I don’t know how the string-based APIs were implemented, but presumably they did a string-based lookup of a pointer, so I would not expect a noticable performance difference to doing the same thing at app level (as part of a wrapper function maybe).

After some testing I can say with pretty high confidence that those strings are the issue. Thanks for the help. :)