CUDA 8 - Thrust bug(?)

Jofo · September 29, 2016, 3:02pm

Since CUDA 8, I am having trouble with Thrust. I am using “sort” kernel from driver API, to sort a buffer of data according to some keys. The whole kernel is as follows:

#include <thrust/sort.h>
#include <thrust/execution_policy.h>

#include "CudaCommon.h" //some defines, basic math, etc... MyDataType is defined there

extern "C"
__global__ void sortKernel(
   uint64_t* keys,
   MyDataType* data,
   unsigned int dataSize
)
{
   thrust::sort_by_key(thrust::device, keys, keys + dataSize, data);
}

size of MyDataType is 48 bytes (12 ints).

The kernel itself is launched just as one instance (groups & block size equaling to 1).

checkCudaErrors(cuLaunchKernel(_sortKernel, 1, 1, 1, 1, 1, 1, 0, 0, sortArgs, nullptr));

     checkCudaErrors(cuEventRecord(_kernelSyncEvent, 0));
     checkCudaErrors(cuEventSynchronize(_kernelSyncEvent));

This code works OK on CUDA 7.5, on CUDA 8 (RC and Release) it causes CUDA_ERROR_UNKNOWN (on the cuEventSynchronize).

System specs: W10 x64, i7 4770K, VS2015 (without any further update), 16GB RAM, GTX 780, drivers 369.30 (shipped with CUDA 8), CUDA installer “cuda_8.0.44_win10.exe”.

What is wrong? An error in my code, or a bug in Thrust?

Thanks in advance.

Robert_Crovella · September 29, 2016, 4:52pm

If you want to provide a complete app that I can compile and run and see the error, I’ll take a look as time permits.

What is the purpose of calling sort from the kernel as you have?

Why not call it directly from host code?

Jofo · September 29, 2016, 5:05pm

Sorry, cannot provide the whole app. I think the data I’ve provided is pretty much sufficient, the kernel in dispute is running just the thrust code, nothing else. And the data to be sorted is literally just 12 ints wrapped together.

BTW I have not found a way how to call Thrust::sort from driver API directly, that’s why it’s wrapped in a kernel…

Robert_Crovella · September 29, 2016, 5:26pm

Thrust depends on the runtime API.

You can intermix driver API and runtime API into the same project.

You cannot use thrust from host code in a driver API-only project. It requires linking with the runtime API libraries, at a minimum.

Jofo · September 29, 2016, 5:36pm

Well, that’s why I am calling it from a kernel, that seems to work OK… at least did until now…

MutantJohn · September 29, 2016, 10:15pm

Yeah, how are you compiling your code, Jofo? I had weird and seemingly random Thrust linking errors because I wasn’t linking against cudadevrt.

Jofo · September 30, 2016, 6:03am

In CMAKE (debug info turned on…)

CUDA_COMPILE_PTX(PTX_FILES ${SRC_CUDA} OPTIONS -arch compute_35 -lineinfo -G)

SRC_CUDA contains all the kernels I have in my project. This is then followed by a bunch of scripts copying the kernels where I need em, wrapped in a custom target.

Robert_Crovella · September 30, 2016, 1:18pm

agree with MutantJohn, calling thrust from device code may necessitate additional compile requirements.

You can look at any CUDA sample project that calls kernels from device code to get an idea of the needed switches.

Jofo · September 30, 2016, 11:28pm

This code was working well in CUDA 7.5, but no longer does in CUDA 8… so my linking must be fine (I guess)

Jofo · October 1, 2016, 1:21pm

OK I’ve run into a weird situation… I’ve hacked together a simple demo project which uses similar mechanics as my production code, but now it throws CU_ERROR_ILLEGAL_INSTRUCTION even on CUDA 7.5, whereas my production code works… although in debug, the sorting kernel runs really slow even on let’s say 35000 items.

Screw it, I will probably switch to a different library for sorting, any ideas?

njuffa · October 1, 2016, 1:58pm

Side remark: For CUDA debug builds, the compiler turns off all optimizations, and the resulting code may run ten times slower than the code from a release build, which defaults to -O3, that is, maximum optimization.

MutantJohn · October 1, 2016, 4:00pm

I wouldn’t just abandon thrust. Looking at what you have now, that’s not a strong candidate for dynamic parallelism. Just call it from the host.

Jofo · October 1, 2016, 4:22pm

I cannot call it from the host because I use driver API…

Jofo · October 1, 2016, 4:31pm

Yeah, I forgot about the debug flag, that sorts the speed…

Moreover - correct me if I am wrong, but I think Thrust manages the kernel execution (“dyamic parallelism”) by itself in terms of grid/block size, so it should be sufficient to just call 1 thread which calls the whole subkernel. Or not?

Robert_Crovella · October 1, 2016, 4:50pm

When you call thrust algorithms from device code using thrust::device execution policy, you are leaving the decision up to thrust as to how best to run this code. It means that:

the implementation thrust chooses may vary depending on target architecture and compilation switches
the thrust implementation may vary from thrust version to thrust version

(note that you can override this ambiguity using other thrust execution policies)

Jofo · October 2, 2016, 11:16am

A little more update

I’ve put this as a bug on nV support, and the reponse so far

first of all, if I want to use dynamic parallelism, I need to compile the ptx with -rdc=true flag, otherwise it would run sequentially (I supposed it means in a single GPU thread?). But doing this requires linking against cudadevrt.lib using cuLink, which would be insanely stupid - I would then need a path to the library in run time. An example is actually in the documentation here:

http://docs.nvidia.com/cuda/nvrtc/index.html#example-dynamic-parallelism

So I wonder, if there is a possible solution using CMAKE and doing it in program compile time, I will try to experiment with it…

secondly, it is possible to mix runtime and driver API

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#interoperability-between-runtime-and-driver-apis

but from my initial experiments I’ve found out that this will not work with thrust, because when I modified the code from the example I had provided, it will not compile.

I’ve included thrust and cuda_runtime_api.h into host, then called

thrust::sort(thrust::device, (uint64_t*)g_bufferKeys, ((uint64_t*)g_bufferKeys) + numElements, (MyDataType*)g_bufferData);

nVidia guys even left a note in thrust/system/cuda/detail/sort.inl(203) that the code has to be NVCC’ed first. So this is probably a no-go.

Jofo · October 2, 2016, 12:58pm

So I’ve altered the example case and put the -rtc=true flag to the PTX compilation, then linked with cudadevrt.lib in run time. It actually linked, but the kernel won’t run, the error being CUDA_ERROR_LAUNCH_FAILED and there was a message in the console “temporary_buffer::allocate: get_temporary_buffer failed”. Now what?

Topic		Replies	Views
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5094	May 7, 2015
Thrust v1.1 release A high-level C++ template library for CUDA CUDA Programming and Performance	6	13800	September 18, 2009
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16780	May 30, 2009
Thrust (cuda version 8) compiling with lots of noise on Windows 10? CUDA Programming and Performance	3	842	December 3, 2016
thrust issue? please help me! someone familiar with thrust. CUDA Programming and Performance	4	1887	July 6, 2018
Complication issues when using Thrust (CUDA C language) in CUDA Fortran nvc, nvc++ and nvfortran	3	908	March 2, 2023
Unspecified launch failure error when thrust::device is used in transform_reduce CUDA Programming and Performance	3	1624	October 12, 2021
VS2012 C++, CUDA 5.5, Thrust CUDA Setup and Installation	1	2885	September 3, 2013
How to use thrust::async::for_each with cuda streams? CUDA Programming and Performance cuda	13	3706	May 12, 2021
Thrust and streams CUDA Programming and Performance	4	4909	September 7, 2017

CUDA 8 - Thrust bug(?)

Related topics