CUDA thrust compilation error

julien.livet · February 17, 2025, 8:17am

Hello. I am new with CUDA. I wrote a C++ code that is working and I want to use it with CUDA. The GitHub of the project is here: https://github.com/Julien-Livet/TestIntegerCuda. The compilation gives me three errors (cf. below) that I don’t understand, so I don’t arrive to solve them. Can you help me please?

/usr/include/thrust/system/cuda/detail/internal/copy_cross_system.h(107): error: reinterpret_cast cannot cast away const or other type qualifiers
          detected during:
            instantiation of "OutputIt thrust::cuda_cub::__copy::cross_system_copy_n(thrust::execution_policy<System1> &, thrust::execution_policy<System2> &, InputIt, Size, OutputIt, thrust::detail::true_type) [with System1=thrust::system::cpp::detail::tag, System2=thrust::cuda_cub::tag, InputIt=const unsigned long long *, Size=std::ptrdiff_t, OutputIt=thrust::device_ptr<const unsigned long long>]" 
(222): here
            instantiation of "OutputIt thrust::cuda_cub::__copy::cross_system_copy_n(thrust::cuda_cub::cross_system<System1, System2>, InputIt, Size, OutputIt) [with System1=thrust::system::cpp::detail::tag, System2=thrust::cuda_cub::tag, InputIt=const unsigned long long *, Size=std::ptrdiff_t, OutputIt=thrust::device_ptr<const unsigned long long>]" 
(238): here
            instantiation of "OutputIterator thrust::cuda_cub::__copy::cross_system_copy(thrust::cuda_cub::cross_system<System1, System2>, InputIterator, InputIterator, OutputIterator) [with System1=thrust::system::cpp::detail::tag, System2=thrust::cuda_cub::tag, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/system/cuda/detail/copy.h(162): here
            instantiation of "OutputIterator thrust::cuda_cub::copy(thrust::cuda_cub::cross_system<System1, System2>, InputIterator, InputIterator, OutputIterator) [with System1=thrust::system::cpp::detail::tag, System2=thrust::cuda_cub::tag, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/system/cuda/detail/assign_value.h(72): here
            instantiation of "void thrust::cuda_cub::assign_value(thrust::cuda_cub::cross_system<System1, System2> &, Pointer1, Pointer2) [with System1=thrust::cuda_cub::tag, System2=thrust::system::cpp::detail::tag, Pointer1=thrust::device_ptr<const unsigned long long>, Pointer2=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(370): here
            [ 7 instantiation contexts not shown ]
            instantiation of "Integer<T, Vector, void> operator*(Integer<T, Vector, void>, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(3454): here
            instantiation of "std::pair<Integer<T, Vector, void>, Integer<T, Vector, void>> computeQrBurnikelZiegler(const Integer<T, Vector, void> &, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(978): here
            instantiation of "Integer<T, Vector, std::enable_if<<expression>, void>::type> &Integer<T, Vector, std::enable_if<<expression>, void>::type>::operator%=(const Integer<T, Vector, std::enable_if<<expression>, void>::type> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(2544): here
            instantiation of "Integer<T, Vector, void> operator%(Integer<T, Vector, void>, const S &) [with T=unsigned long long, S=unsigned int, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(116): here
            instantiation of "void Integer_isPrime_trialDivision(const unsigned int *, size_t, const T *, size_t, const T *, size_t, __nv_bool *) [with T=unsigned long long]" 
Integer.cuh(1868): here

/usr/include/thrust/system/detail/sequential/copy.inl(61): error: no instance of function template "thrust::system::detail::sequential::trivial_copy_n" matches the argument list
            argument types are: (const unsigned long long *, const Size, const unsigned long long *)
          detected during:
            instantiation of "OutputIterator thrust::system::detail::sequential::copy_detail::copy(InputIterator, InputIterator, OutputIterator, thrust::detail::true_type) [with InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
(122): here
            instantiation of "OutputIterator thrust::system::detail::sequential::copy(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, InputIterator, OutputIterator) [with DerivedPolicy=thrust::detail::seq_t, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/detail/copy.inl(36): here
            instantiation of "OutputIterator thrust::copy(const thrust::detail::execution_policy_base<System> &, InputIterator, InputIterator, OutputIterator) [with System=thrust::detail::seq_t, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/system/cuda/detail/copy.h(123): here
            instantiation of "OutputIterator thrust::cuda_cub::copy(thrust::cuda_cub::execution_policy<System> &, InputIterator, InputIterator, OutputIterator) [with System=thrust::cuda_cub::tag, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/system/cuda/detail/assign_value.h(42): here
            instantiation of "void thrust::cuda_cub::assign_value(thrust::cuda_cub::execution_policy<DerivedPolicy> &, Pointer1, Pointer2) [with DerivedPolicy=thrust::cuda_cub::tag, Pointer1=thrust::device_ptr<const unsigned long long>, Pointer2=const unsigned long long *]" 
/usr/include/thrust/system/cuda/detail/assign_value.h(80): here
            [ 8 instantiation contexts not shown ]
            instantiation of "Integer<T, Vector, void> operator*(Integer<T, Vector, void>, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(3454): here
            instantiation of "std::pair<Integer<T, Vector, void>, Integer<T, Vector, void>> computeQrBurnikelZiegler(const Integer<T, Vector, void> &, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(978): here
            instantiation of "Integer<T, Vector, std::enable_if<<expression>, void>::type> &Integer<T, Vector, std::enable_if<<expression>, void>::type>::operator%=(const Integer<T, Vector, std::enable_if<<expression>, void>::type> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(2544): here
            instantiation of "Integer<T, Vector, void> operator%(Integer<T, Vector, void>, const S &) [with T=unsigned long long, S=unsigned int, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(116): here
            instantiation of "void Integer_isPrime_trialDivision(const unsigned int *, size_t, const T *, size_t, const T *, size_t, __nv_bool *) [with T=unsigned long long]" 
Integer.cuh(1868): here

/usr/include/thrust/system/cuda/detail/assign_value.h(47): error: expression must be a modifiable lvalue
          detected during:
            instantiation of "void thrust::cuda_cub::assign_value(thrust::cuda_cub::execution_policy<DerivedPolicy> &, Pointer1, Pointer2) [with DerivedPolicy=thrust::cuda_cub::tag, Pointer1=thrust::device_ptr<const unsigned long long>, Pointer2=const unsigned long long *]" 
(80): here
            instantiation of "void thrust::cuda_cub::assign_value(thrust::cuda_cub::cross_system<System1, System2> &, Pointer1, Pointer2) [with System1=thrust::cuda_cub::tag, System2=thrust::system::cpp::detail::tag, Pointer1=thrust::device_ptr<const unsigned long long>, Pointer2=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(370): here
            instantiation of "void thrust::reference<Element, Pointer, Derived>::strip_const_assign_value(const System &, OtherPointer) [with Element=const unsigned long long, Pointer=thrust::device_ptr<const unsigned long long>, Derived=thrust::device_reference<const unsigned long long>, System=thrust::cuda_cub::cross_system<thrust::cuda_cub::tag, thrust::system::cpp::detail::tag>, OtherPointer=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(348): here
            instantiation of "void thrust::reference<Element, Pointer, Derived>::assign_from(System0 *, System1 *, OtherPointer) [with Element=const unsigned long long, Pointer=thrust::device_ptr<const unsigned long long>, Derived=thrust::device_reference<const unsigned long long>, System0=thrust::device_system_tag, System1=thrust::host_system_tag, OtherPointer=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(360): here
            instantiation of "void thrust::reference<Element, Pointer, Derived>::assign_from(OtherPointer) [with Element=const unsigned long long, Pointer=thrust::device_ptr<const unsigned long long>, Derived=thrust::device_reference<const unsigned long long>, OtherPointer=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(156): here
            [ 4 instantiation contexts not shown ]
            instantiation of "Integer<T, Vector, void> operator*(Integer<T, Vector, void>, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(3454): here
            instantiation of "std::pair<Integer<T, Vector, void>, Integer<T, Vector, void>> computeQrBurnikelZiegler(const Integer<T, Vector, void> &, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(978): here
            instantiation of "Integer<T, Vector, std::enable_if<<expression>, void>::type> &Integer<T, Vector, std::enable_if<<expression>, void>::type>::operator%=(const Integer<T, Vector, std::enable_if<<expression>, void>::type> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(2544): here
            instantiation of "Integer<T, Vector, void> operator%(Integer<T, Vector, void>, const S &) [with T=unsigned long long, S=unsigned int, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(116): here
            instantiation of "void Integer_isPrime_trialDivision(const unsigned int *, size_t, const T *, size_t, const T *, size_t, __nv_bool *) [with T=unsigned long long]" 
Integer.cuh(1868): here

3 errors detected in the compilation of "main.cu".

Robert_Crovella · February 17, 2025, 3:30pm

thrust::device_vector<...> is generally not usable in device code.

julien.livet · February 17, 2025, 3:42pm

Ok but is there an alternative to manage vector in device code? Should I adapt my code to use plain array instead of this? In fact, my operator* should be used by device code and it uses Karatsuba algorithm that splits bits array into others. So I’m wondering how to do. If needed, the original C++ code is here: https://github.com/Julien-Livet/TestInteger.

Robert_Crovella · February 17, 2025, 6:49pm

I’m not aware of any class infrastructure provided by NVIDIA as part of CUDA toolkit to provide a pseudo-vector class in device code. I can’t really advise how best to adapt your code, but using arrays is certainly feasible in CUDA.

julien.livet · February 18, 2025, 7:56pm

I finally wrote my own custom device vector and it works well now.

julien.livet · February 19, 2025, 3:35pm

I made some mistakes and I solved them. I have two warnings that I don’t understand :

Integer.cuh(270): warning #179-D: right operand of "%" is zero
Integer.cuh(271): warning #39-D: division by zero

I don’t understand why the compiler compute a value of zero for the right operand.

Robert_Crovella · February 19, 2025, 3:39pm

If you’re asking for interpretation of code you haven’t shown, I won’t be able to help with that. When I look at lines 277/278 of the Integer.cuh in your github, I don’t see anything lining up with those messages.

julien.livet · February 19, 2025, 4:08pm

Sorry, I correct the number of lines, it was a mistake.
It is about:

                    bits_.push_back(n % shift);
                    n /= shift;

Robert_Crovella · February 19, 2025, 4:23pm

It certainly seems like the compiler is telling you it can determine that shift is zero, and that shouldn’t be the case either for modulo or division. Since it seems like shift is calculated here, the approach that I would take is to see what would be necessary if the compiler is not lying to you. Then see if/how that is possible. It would probably depend on the instantiated types of T and S, which I cannot see from the two lines of compiler output you have excerpted.

If that doesn’t seem fruitful, since these are just warnings, you could print out the value of shift, and see what it evaluates to at runtime, to see if it lines up with what the compiler is saying.

I think at this point the debugging strategy is whatever you would use for C++. There is nothing specific to CUDA that I can see here.

AFAIK, formally std::min is not supported in CUDA device code, but someone else may point out to me that I am wrong. If I were really grasping, I might swap that out for another min-finding construct.

julien.livet · February 19, 2025, 5:27pm

Ok. For information T is unsigned long long and S is longest_type=uintmax_t.

            instantiation of "Integer<T, Vector, std::enable_if<<expression>, void>::type>::Integer(S) [with T=unsigned long long, Vector=cu::vector<unsigned long long>, S=longest_type, <unnamed>=(void *)nullptr]"

At the runtime, shift has a value of 4294967296.
I notice too that the compilation is pretty long (about 5-15 minutes [I don’t take a watch to measure exactly]).

Robert_Crovella · February 19, 2025, 5:47pm

so you are taking the value of 1 (in an 8-byte unsigned integer type) and shifting it left 64 times (8x8)?

I think that produces a value of zero (in my brain it does, anyway). Seems to me the compiler is not lying to you.

In C++, a left shift by an amount greater than or equal to the width of the underlying type results in UB.

Therefore we can discount other observations.

julien.livet · February 19, 2025, 6:02pm

Yes, it is exact. Except that I put an if before else about types are equal but I forgot constexpr. Thanks for your comment!

julien.livet · February 21, 2025, 10:40am

So, I reimplement some std uses with device functionality and it seems to work. Except that I get an error that I don’t understand (below is the backtrace of cuda-gdb).

CUDA Exception: Lane User Stack Overflow

Thread 1 "TestIntegerCuda" received signal CUDA_EXCEPTION_2, Lane User Stack Overflow.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 1, lane 0]
0x000001000031b178 in Integer<unsigned long, void>::isPrime (this=<unavailable>, primes=<unavailable>, primesSize=<unavailable>, reps=<unavailable>) at /home/julien/Documents/GitHub/TestIntegerCuda/Integer.cuh:1761
1761	        __device__ CONSTEXPR int isPrime(unsigned int const* primes, size_t primesSize, size_t reps = 25) const
(cuda-gdb) bt
#0  0x000001000031b178 in Integer<unsigned long, void>::isPrime (
    this=<unavailable>, primes=<unavailable>, primesSize=<unavailable>, 
    reps=<unavailable>)
    at /home/julien/Documents/GitHub/TestIntegerCuda/Integer.cuh:1761
#1  0x000001000030b3a8 in isPrime<unsigned long><<<(1,1,1),(1,1,1)>>> (
    nData=0x7fffd1200400, nDataSize=1, p=0x7fffd1200000, primesSize=100, 
    isPrime=0x7fffd1200200, reps=25) at main.cu:10

main.cu

#include <iostream>

#include "Integer.cuh"

template <typename T>
__global__ void isPrime(T const* nData, size_t nDataSize, unsigned int const* p, size_t primesSize, int* isPrime, size_t reps = 25)
{printf("yo0\n");
    Integer<T> const n(nData, nData + nDataSize);
printf("yo1\n");
    *isPrime = n.isPrime(p, primesSize, reps);printf("yo2\n");
}

int main()
{
    unsigned int* p;
    cudaMalloc(&p, sizeof(unsigned int) * primes.size());
    cudaMemcpy(p, primes.data(), sizeof(unsigned int) * primes.size(), cudaMemcpyHostToDevice);

    int* prime;
    cudaMalloc(&prime, sizeof(int));

    using T = uint64_t;
    
    {
        auto const n(23 * 29_z);

        auto t{std::chrono::steady_clock::now()};
        std::cout << n.isPrime() << std::endl;
        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t).count() << " ms" << std::endl;

        T* nData(nullptr);    
        cudaMalloc(&nData, sizeof(T) * n.bits().size());
        cudaMemcpy(nData, n.bits().data(), sizeof(T) * n.bits().size(), cudaMemcpyHostToDevice);
        
        t = std::chrono::steady_clock::now();std::cout << "hey6" << std::endl;
        isPrime<T><<<1, 1>>>(nData, n.bits().size(), p, primes.size(), prime);
std::cout << "hey7" << std::endl;
        cudaDeviceSynchronize();
std::cout << "hey8" << std::endl;
        int pr;
        cudaMemcpy(&pr, prime, sizeof(int), cudaMemcpyDeviceToHost);

        std::cout << pr << std::endl;
        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t).count() << " ms" << std::endl;

        cudaFree(nData);

        return 0;
    }

    cudaFree(p);
    cudaFree(prime);

    return 0;
}

It seems that it crashes line 10 of main.cu but I don’t understand what it is wrong. Some idea?

Robert_Crovella · February 21, 2025, 4:23pm

probably debugging is in order. A “stack overflow” report is not unique or specific to CUDA in terms of its general meaning. The (each) CUDA thread has a stack, just as you might make the same observation with each CPU thread. “stack overflow” means that somehow the management of the stack has become problematic.

taking a look at isPrime (and perhaps more generally in the code in that file) I note usage of CDP as well as usage of the device runtime (e.g. cudaMalloc). Here are a few suggestions that I would apply if it were my code:

Any time you are having trouble with a CUDA code, make sure you are doing proper (correctly detect the error, make some immediate indication of it) rigorous (at every opportunity to do so) cuda error checking on all host CUDA runtime API usage and all kernel calls from host code.
Similarly, if you are using the CUDA device runtime, and/or launching kernels from device code, do the same kind of error checking on those (previous link in item 1 has an example).
Any time you are using a device allocator (e.g. cudaMalloc, or new or malloc()) in device code, I suggest testing the returned pointer for NULL. This is a typical method to signal a device allocation error (e.g. out of memory).

You might also get some clues by running your code with compute-sanitizer including various sub-tools. A longer treatment of CUDA debugging is contained in unit 12 of this online tutorial series.

stack overflow in CUDA is sometimes associated with recursion. Recursion in general requires appropriate handling of the stack; that is it presents a possible stack hazard, not unique or specific to CUDA. I didn’t spot any obvious indications of recursion, but I may have missed something, and in general the use of CDP may lead to something akin to recursion, although not necessarily with the exact same stack hazard.

In studying your code, I also note patterns like this:

            Integer_isPrime_trialDivision<T><<<gridSize, blockSize>>>(primes, primesSize,
                                                                      numberData, bits_.size(),
                                                                      sqrtLimitData, sqrtLimit.bits_.size(),
                                                                      divisible);
            
            cudaFree(numberData);
            cudaFree(sqrtLimitData);

            if (*divisible)

If you were expecting *divisible to have been modified by the kernel at the point at which you test it after the kernel launch, that is incorrect thinking. A kernel launch, even in device code, is asynchronous. So without getting too far into the weeds here, that code pattern looks curious to me. But I haven’t tried to study your code extensively or the algorithm in general. (in a bit more depth here: *divisible may have been modified by the point at which you test it. Given that, testing it at that point seems incomprehensible to me, but again, have not studied this in depth. I may have missed something or just be confused.)

julien.livet · February 21, 2025, 4:34pm

Ok. I will investigate more with checking. About *divisible, how to wait the asynchronous launch? It is about Sieve algorithm.

Robert_Crovella · February 21, 2025, 5:32pm

read the links I provided under CDP: Refactoring CDP1 code to use CDP2 - #3 by Robert_Crovella

julien.livet · February 23, 2025, 4:39pm

Ok, I add some asserts (all passes) and I have the same error. I don’t arrive to put in place the synchronization. Have you a minimal example with cudaStreamTailLaunch() using in device code that approaches mine?

Robert_Crovella · February 23, 2025, 7:42pm

I don’t have any examples other than what I have linked to already. I don’t know if any of them approach yours.

julien.livet · February 24, 2025, 8:38am

Ok. So there is no solution for my problem?
The kernel Integer_isPrime_trialDivision does not seem to be launched (in fact, yes but it is “!(n % primes[idx])” that seems to crash) and I have no idea why.
I have also linking warnings. Is there any relation with my problem? On my researches, I saw that it deals with recursion.

nvlink warning : Stack size for entry function '_Z29Integer_isPrime_trialDivisionImEvPKjmPKT_mS4_mPb' cannot be statically determined
nvlink warning : Stack size for entry function '_Z27Integer_isPrime_millerRabinImEvPKT_mS2_mS2_mS2_mS2_mPbm' cannot be statically determined
nvlink warning : Stack size for entry function '_Z27Integer_isPrime_millerRabinImEvPKT_mS2_mS2_mS2_mS2_mPbm' cannot be statically determined
nvlink warning : Stack size for entry function '_Z27Integer_isPrime_millerRabinImEvPKT_mS2_mS2_mS2_mS2_mPbm' cannot be statically determined
nvlink warning : Stack size for entry function '_Z27Integer_isPrime_millerRabinImEvPKT_mS2_mS2_mS2_mS2_mPbm' cannot be statically determined
nvlink warning : Stack size for entry function '_Z7isPrimeImEvPKT_mPKjmPim' cannot be statically determined

I add too these following lines and the CUDA stack overflow error disappears.

    r = cudaDeviceSetLimit(cudaLimitStackSize, 4096 * 2);
    assert(r == cudaSuccess);

njuffa · February 24, 2025, 9:12am

That compiler message is simply a warning that alerts the programmer that they are solely responsible for setting the appropriate stack size to accommodate whatever maximum recursion depth (= call stack depth) they anticipate. Obviously bad things will happen once the allocated stack is exceeded.

It appears you have already realized your responsibility as the programmer and configured an appropriate stack size.

Topic		Replies	Views
Passing dynamically allocated memory in kernel to sub kernel via dynamic parallelism CUDA Programming and Performance	6	693	May 3, 2019
Cuda code performance CUDA Programming and Performance	14	3156	December 16, 2014
Using unified memory causes system crash CUDA Programming and Performance	28	5889	February 4, 2019
multi dimension array CUDA Programming and Performance	26	32777	February 12, 2010
Bad performance or bad coding? CUDA Programming and Performance	21	768	October 12, 2021
Serial vs Parallel Performance CUDA Programming and Performance	5	908	April 3, 2017
Effective Parallelisation of CUDA C code CUDA Programming and Performance	38	1923	December 27, 2021
An Easy Introduction to CUDA C and C++ Technical Blog	48	1255	July 19, 2018
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5119	May 7, 2015
Unified Memory Behavior... CUDA Programming and Performance	4	3421	August 11, 2014

CUDA thrust compilation error

Related topics