CUDA 8 - Thrust bug(?)

Since CUDA 8, I am having trouble with Thrust. I am using “sort” kernel from driver API, to sort a buffer of data according to some keys. The whole kernel is as follows:

#include <thrust/sort.h>
#include <thrust/execution_policy.h>

#include "CudaCommon.h" //some defines, basic math, etc... MyDataType is defined there

extern "C"
__global__ void sortKernel(
   uint64_t* keys,
   MyDataType* data,
   unsigned int dataSize
)
{
   thrust::sort_by_key(thrust::device, keys, keys + dataSize, data);
}

size of MyDataType is 48 bytes (12 ints).

The kernel itself is launched just as one instance (groups & block size equaling to 1).

checkCudaErrors(cuLaunchKernel(_sortKernel, 1, 1, 1, 1, 1, 1, 0, 0, sortArgs, nullptr));

     checkCudaErrors(cuEventRecord(_kernelSyncEvent, 0));
     checkCudaErrors(cuEventSynchronize(_kernelSyncEvent));

This code works OK on CUDA 7.5, on CUDA 8 (RC and Release) it causes CUDA_ERROR_UNKNOWN (on the cuEventSynchronize).

System specs: W10 x64, i7 4770K, VS2015 (without any further update), 16GB RAM, GTX 780, drivers 369.30 (shipped with CUDA 8), CUDA installer “cuda_8.0.44_win10.exe”.

What is wrong? An error in my code, or a bug in Thrust?

Thanks in advance.

If you want to provide a complete app that I can compile and run and see the error, I’ll take a look as time permits.

What is the purpose of calling sort from the kernel as you have?

Why not call it directly from host code?

Sorry, cannot provide the whole app. I think the data I’ve provided is pretty much sufficient, the kernel in dispute is running just the thrust code, nothing else. And the data to be sorted is literally just 12 ints wrapped together.

BTW I have not found a way how to call Thrust::sort from driver API directly, that’s why it’s wrapped in a kernel…

Thrust depends on the runtime API.

You can intermix driver API and runtime API into the same project.

You cannot use thrust from host code in a driver API-only project. It requires linking with the runtime API libraries, at a minimum.

Well, that’s why I am calling it from a kernel, that seems to work OK… at least did until now…

Yeah, how are you compiling your code, Jofo? I had weird and seemingly random Thrust linking errors because I wasn’t linking against cudadevrt.

In CMAKE (debug info turned on…)

CUDA_COMPILE_PTX(PTX_FILES ${SRC_CUDA} OPTIONS -arch compute_35 -lineinfo -G)

SRC_CUDA contains all the kernels I have in my project. This is then followed by a bunch of scripts copying the kernels where I need em, wrapped in a custom target.

agree with MutantJohn, calling thrust from device code may necessitate additional compile requirements.

You can look at any CUDA sample project that calls kernels from device code to get an idea of the needed switches.

This code was working well in CUDA 7.5, but no longer does in CUDA 8… so my linking must be fine (I guess)

OK I’ve run into a weird situation… I’ve hacked together a simple demo project which uses similar mechanics as my production code, but now it throws CU_ERROR_ILLEGAL_INSTRUCTION even on CUDA 7.5, whereas my production code works… although in debug, the sorting kernel runs really slow even on let’s say 35000 items.

Screw it, I will probably switch to a different library for sorting, any ideas?

Side remark: For CUDA debug builds, the compiler turns off all optimizations, and the resulting code may run ten times slower than the code from a release build, which defaults to -O3, that is, maximum optimization.

I wouldn’t just abandon thrust. Looking at what you have now, that’s not a strong candidate for dynamic parallelism. Just call it from the host.

I cannot call it from the host because I use driver API…

Yeah, I forgot about the debug flag, that sorts the speed…

Moreover - correct me if I am wrong, but I think Thrust manages the kernel execution (“dyamic parallelism”) by itself in terms of grid/block size, so it should be sufficient to just call 1 thread which calls the whole subkernel. Or not?

When you call thrust algorithms from device code using thrust::device execution policy, you are leaving the decision up to thrust as to how best to run this code. It means that:

  1. the implementation thrust chooses may vary depending on target architecture and compilation switches
  2. the thrust implementation may vary from thrust version to thrust version

(note that you can override this ambiguity using other thrust execution policies)

A little more update

I’ve put this as a bug on nV support, and the reponse so far

  • first of all, if I want to use dynamic parallelism, I need to compile the ptx with -rdc=true flag, otherwise it would run sequentially (I supposed it means in a single GPU thread?). But doing this requires linking against cudadevrt.lib using cuLink, which would be insanely stupid - I would then need a path to the library in run time. An example is actually in the documentation here:

http://docs.nvidia.com/cuda/nvrtc/index.html#example-dynamic-parallelism

So I wonder, if there is a possible solution using CMAKE and doing it in program compile time, I will try to experiment with it…

  • secondly, it is possible to mix runtime and driver API

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#interoperability-between-runtime-and-driver-apis

but from my initial experiments I’ve found out that this will not work with thrust, because when I modified the code from the example I had provided, it will not compile.

I’ve included thrust and cuda_runtime_api.h into host, then called

thrust::sort(thrust::device, (uint64_t*)g_bufferKeys, ((uint64_t*)g_bufferKeys) + numElements, (MyDataType*)g_bufferData);

nVidia guys even left a note in thrust/system/cuda/detail/sort.inl(203) that the code has to be NVCC’ed first. So this is probably a no-go.

So I’ve altered the example case and put the -rtc=true flag to the PTX compilation, then linked with cudadevrt.lib in run time. It actually linked, but the kernel won’t run, the error being CUDA_ERROR_LAUNCH_FAILED and there was a message in the console “temporary_buffer::allocate: get_temporary_buffer failed”. Now what?