CUDA thrust compilation error

Hello. I am new with CUDA. I wrote a C++ code that is working and I want to use it with CUDA. The GitHub of the project is here: https://github.com/Julien-Livet/TestIntegerCuda. The compilation gives me three errors (cf. below) that I don’t understand, so I don’t arrive to solve them. Can you help me please?

/usr/include/thrust/system/cuda/detail/internal/copy_cross_system.h(107): error: reinterpret_cast cannot cast away const or other type qualifiers
          detected during:
            instantiation of "OutputIt thrust::cuda_cub::__copy::cross_system_copy_n(thrust::execution_policy<System1> &, thrust::execution_policy<System2> &, InputIt, Size, OutputIt, thrust::detail::true_type) [with System1=thrust::system::cpp::detail::tag, System2=thrust::cuda_cub::tag, InputIt=const unsigned long long *, Size=std::ptrdiff_t, OutputIt=thrust::device_ptr<const unsigned long long>]" 
(222): here
            instantiation of "OutputIt thrust::cuda_cub::__copy::cross_system_copy_n(thrust::cuda_cub::cross_system<System1, System2>, InputIt, Size, OutputIt) [with System1=thrust::system::cpp::detail::tag, System2=thrust::cuda_cub::tag, InputIt=const unsigned long long *, Size=std::ptrdiff_t, OutputIt=thrust::device_ptr<const unsigned long long>]" 
(238): here
            instantiation of "OutputIterator thrust::cuda_cub::__copy::cross_system_copy(thrust::cuda_cub::cross_system<System1, System2>, InputIterator, InputIterator, OutputIterator) [with System1=thrust::system::cpp::detail::tag, System2=thrust::cuda_cub::tag, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/system/cuda/detail/copy.h(162): here
            instantiation of "OutputIterator thrust::cuda_cub::copy(thrust::cuda_cub::cross_system<System1, System2>, InputIterator, InputIterator, OutputIterator) [with System1=thrust::system::cpp::detail::tag, System2=thrust::cuda_cub::tag, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/system/cuda/detail/assign_value.h(72): here
            instantiation of "void thrust::cuda_cub::assign_value(thrust::cuda_cub::cross_system<System1, System2> &, Pointer1, Pointer2) [with System1=thrust::cuda_cub::tag, System2=thrust::system::cpp::detail::tag, Pointer1=thrust::device_ptr<const unsigned long long>, Pointer2=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(370): here
            [ 7 instantiation contexts not shown ]
            instantiation of "Integer<T, Vector, void> operator*(Integer<T, Vector, void>, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(3454): here
            instantiation of "std::pair<Integer<T, Vector, void>, Integer<T, Vector, void>> computeQrBurnikelZiegler(const Integer<T, Vector, void> &, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(978): here
            instantiation of "Integer<T, Vector, std::enable_if<<expression>, void>::type> &Integer<T, Vector, std::enable_if<<expression>, void>::type>::operator%=(const Integer<T, Vector, std::enable_if<<expression>, void>::type> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(2544): here
            instantiation of "Integer<T, Vector, void> operator%(Integer<T, Vector, void>, const S &) [with T=unsigned long long, S=unsigned int, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(116): here
            instantiation of "void Integer_isPrime_trialDivision(const unsigned int *, size_t, const T *, size_t, const T *, size_t, __nv_bool *) [with T=unsigned long long]" 
Integer.cuh(1868): here

/usr/include/thrust/system/detail/sequential/copy.inl(61): error: no instance of function template "thrust::system::detail::sequential::trivial_copy_n" matches the argument list
            argument types are: (const unsigned long long *, const Size, const unsigned long long *)
          detected during:
            instantiation of "OutputIterator thrust::system::detail::sequential::copy_detail::copy(InputIterator, InputIterator, OutputIterator, thrust::detail::true_type) [with InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
(122): here
            instantiation of "OutputIterator thrust::system::detail::sequential::copy(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, InputIterator, OutputIterator) [with DerivedPolicy=thrust::detail::seq_t, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/detail/copy.inl(36): here
            instantiation of "OutputIterator thrust::copy(const thrust::detail::execution_policy_base<System> &, InputIterator, InputIterator, OutputIterator) [with System=thrust::detail::seq_t, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/system/cuda/detail/copy.h(123): here
            instantiation of "OutputIterator thrust::cuda_cub::copy(thrust::cuda_cub::execution_policy<System> &, InputIterator, InputIterator, OutputIterator) [with System=thrust::cuda_cub::tag, InputIterator=const unsigned long long *, OutputIterator=thrust::device_ptr<const unsigned long long>]" 
/usr/include/thrust/system/cuda/detail/assign_value.h(42): here
            instantiation of "void thrust::cuda_cub::assign_value(thrust::cuda_cub::execution_policy<DerivedPolicy> &, Pointer1, Pointer2) [with DerivedPolicy=thrust::cuda_cub::tag, Pointer1=thrust::device_ptr<const unsigned long long>, Pointer2=const unsigned long long *]" 
/usr/include/thrust/system/cuda/detail/assign_value.h(80): here
            [ 8 instantiation contexts not shown ]
            instantiation of "Integer<T, Vector, void> operator*(Integer<T, Vector, void>, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(3454): here
            instantiation of "std::pair<Integer<T, Vector, void>, Integer<T, Vector, void>> computeQrBurnikelZiegler(const Integer<T, Vector, void> &, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(978): here
            instantiation of "Integer<T, Vector, std::enable_if<<expression>, void>::type> &Integer<T, Vector, std::enable_if<<expression>, void>::type>::operator%=(const Integer<T, Vector, std::enable_if<<expression>, void>::type> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(2544): here
            instantiation of "Integer<T, Vector, void> operator%(Integer<T, Vector, void>, const S &) [with T=unsigned long long, S=unsigned int, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(116): here
            instantiation of "void Integer_isPrime_trialDivision(const unsigned int *, size_t, const T *, size_t, const T *, size_t, __nv_bool *) [with T=unsigned long long]" 
Integer.cuh(1868): here

/usr/include/thrust/system/cuda/detail/assign_value.h(47): error: expression must be a modifiable lvalue
          detected during:
            instantiation of "void thrust::cuda_cub::assign_value(thrust::cuda_cub::execution_policy<DerivedPolicy> &, Pointer1, Pointer2) [with DerivedPolicy=thrust::cuda_cub::tag, Pointer1=thrust::device_ptr<const unsigned long long>, Pointer2=const unsigned long long *]" 
(80): here
            instantiation of "void thrust::cuda_cub::assign_value(thrust::cuda_cub::cross_system<System1, System2> &, Pointer1, Pointer2) [with System1=thrust::cuda_cub::tag, System2=thrust::system::cpp::detail::tag, Pointer1=thrust::device_ptr<const unsigned long long>, Pointer2=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(370): here
            instantiation of "void thrust::reference<Element, Pointer, Derived>::strip_const_assign_value(const System &, OtherPointer) [with Element=const unsigned long long, Pointer=thrust::device_ptr<const unsigned long long>, Derived=thrust::device_reference<const unsigned long long>, System=thrust::cuda_cub::cross_system<thrust::cuda_cub::tag, thrust::system::cpp::detail::tag>, OtherPointer=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(348): here
            instantiation of "void thrust::reference<Element, Pointer, Derived>::assign_from(System0 *, System1 *, OtherPointer) [with Element=const unsigned long long, Pointer=thrust::device_ptr<const unsigned long long>, Derived=thrust::device_reference<const unsigned long long>, System0=thrust::device_system_tag, System1=thrust::host_system_tag, OtherPointer=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(360): here
            instantiation of "void thrust::reference<Element, Pointer, Derived>::assign_from(OtherPointer) [with Element=const unsigned long long, Pointer=thrust::device_ptr<const unsigned long long>, Derived=thrust::device_reference<const unsigned long long>, OtherPointer=const unsigned long long *]" 
/usr/include/thrust/detail/reference.h(156): here
            [ 4 instantiation contexts not shown ]
            instantiation of "Integer<T, Vector, void> operator*(Integer<T, Vector, void>, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(3454): here
            instantiation of "std::pair<Integer<T, Vector, void>, Integer<T, Vector, void>> computeQrBurnikelZiegler(const Integer<T, Vector, void> &, const Integer<T, Vector, void> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(978): here
            instantiation of "Integer<T, Vector, std::enable_if<<expression>, void>::type> &Integer<T, Vector, std::enable_if<<expression>, void>::type>::operator%=(const Integer<T, Vector, std::enable_if<<expression>, void>::type> &) [with T=unsigned long long, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(2544): here
            instantiation of "Integer<T, Vector, void> operator%(Integer<T, Vector, void>, const S &) [with T=unsigned long long, S=unsigned int, Vector=thrust::device_vector<unsigned long long, thrust::device_allocator<unsigned long long>>]" 
Integer.cuh(116): here
            instantiation of "void Integer_isPrime_trialDivision(const unsigned int *, size_t, const T *, size_t, const T *, size_t, __nv_bool *) [with T=unsigned long long]" 
Integer.cuh(1868): here

3 errors detected in the compilation of "main.cu".

thrust::device_vector<...> is generally not usable in device code.

Ok but is there an alternative to manage vector in device code? Should I adapt my code to use plain array instead of this? In fact, my operator* should be used by device code and it uses Karatsuba algorithm that splits bits array into others. So I’m wondering how to do. If needed, the original C++ code is here: https://github.com/Julien-Livet/TestInteger.

I’m not aware of any class infrastructure provided by NVIDIA as part of CUDA toolkit to provide a pseudo-vector class in device code. I can’t really advise how best to adapt your code, but using arrays is certainly feasible in CUDA.

1 Like

I finally wrote my own custom device vector and it works well now.

I made some mistakes and I solved them. I have two warnings that I don’t understand :

Integer.cuh(270): warning #179-D: right operand of "%" is zero
Integer.cuh(271): warning #39-D: division by zero

I don’t understand why the compiler compute a value of zero for the right operand.

If you’re asking for interpretation of code you haven’t shown, I won’t be able to help with that. When I look at lines 277/278 of the Integer.cuh in your github, I don’t see anything lining up with those messages.

Sorry, I correct the number of lines, it was a mistake.
It is about:

                    bits_.push_back(n % shift);
                    n /= shift;

It certainly seems like the compiler is telling you it can determine that shift is zero, and that shouldn’t be the case either for modulo or division. Since it seems like shift is calculated here, the approach that I would take is to see what would be necessary if the compiler is not lying to you. Then see if/how that is possible. It would probably depend on the instantiated types of T and S, which I cannot see from the two lines of compiler output you have excerpted.

If that doesn’t seem fruitful, since these are just warnings, you could print out the value of shift, and see what it evaluates to at runtime, to see if it lines up with what the compiler is saying.

I think at this point the debugging strategy is whatever you would use for C++. There is nothing specific to CUDA that I can see here.

AFAIK, formally std::min is not supported in CUDA device code, but someone else may point out to me that I am wrong. If I were really grasping, I might swap that out for another min-finding construct.

Ok. For information T is unsigned long long and S is longest_type=uintmax_t.

            instantiation of "Integer<T, Vector, std::enable_if<<expression>, void>::type>::Integer(S) [with T=unsigned long long, Vector=cu::vector<unsigned long long>, S=longest_type, <unnamed>=(void *)nullptr]" 

At the runtime, shift has a value of 4294967296.
I notice too that the compilation is pretty long (about 5-15 minutes [I don’t take a watch to measure exactly]).

so you are taking the value of 1 (in an 8-byte unsigned integer type) and shifting it left 64 times (8x8)?

I think that produces a value of zero (in my brain it does, anyway). Seems to me the compiler is not lying to you.

In C++, a left shift by an amount greater than or equal to the width of the underlying type results in UB.

Therefore we can discount other observations.

1 Like

Yes, it is exact. Except that I put an if before else about types are equal but I forgot constexpr. Thanks for your comment!

So, I reimplement some std uses with device functionality and it seems to work. Except that I get an error that I don’t understand (below is the backtrace of cuda-gdb).

CUDA Exception: Lane User Stack Overflow

Thread 1 "TestIntegerCuda" received signal CUDA_EXCEPTION_2, Lane User Stack Overflow.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 1, lane 0]
0x000001000031b178 in Integer<unsigned long, void>::isPrime (this=<unavailable>, primes=<unavailable>, primesSize=<unavailable>, reps=<unavailable>) at /home/julien/Documents/GitHub/TestIntegerCuda/Integer.cuh:1761
1761	        __device__ CONSTEXPR int isPrime(unsigned int const* primes, size_t primesSize, size_t reps = 25) const
(cuda-gdb) bt
#0  0x000001000031b178 in Integer<unsigned long, void>::isPrime (
    this=<unavailable>, primes=<unavailable>, primesSize=<unavailable>, 
    reps=<unavailable>)
    at /home/julien/Documents/GitHub/TestIntegerCuda/Integer.cuh:1761
#1  0x000001000030b3a8 in isPrime<unsigned long><<<(1,1,1),(1,1,1)>>> (
    nData=0x7fffd1200400, nDataSize=1, p=0x7fffd1200000, primesSize=100, 
    isPrime=0x7fffd1200200, reps=25) at main.cu:10

main.cu

#include <iostream>

#include "Integer.cuh"

template <typename T>
__global__ void isPrime(T const* nData, size_t nDataSize, unsigned int const* p, size_t primesSize, int* isPrime, size_t reps = 25)
{printf("yo0\n");
    Integer<T> const n(nData, nData + nDataSize);
printf("yo1\n");
    *isPrime = n.isPrime(p, primesSize, reps);printf("yo2\n");
}

int main()
{
    unsigned int* p;
    cudaMalloc(&p, sizeof(unsigned int) * primes.size());
    cudaMemcpy(p, primes.data(), sizeof(unsigned int) * primes.size(), cudaMemcpyHostToDevice);

    int* prime;
    cudaMalloc(&prime, sizeof(int));

    using T = uint64_t;
    
    {
        auto const n(23 * 29_z);

        auto t{std::chrono::steady_clock::now()};
        std::cout << n.isPrime() << std::endl;
        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t).count() << " ms" << std::endl;

        T* nData(nullptr);    
        cudaMalloc(&nData, sizeof(T) * n.bits().size());
        cudaMemcpy(nData, n.bits().data(), sizeof(T) * n.bits().size(), cudaMemcpyHostToDevice);
        
        t = std::chrono::steady_clock::now();std::cout << "hey6" << std::endl;
        isPrime<T><<<1, 1>>>(nData, n.bits().size(), p, primes.size(), prime);
std::cout << "hey7" << std::endl;
        cudaDeviceSynchronize();
std::cout << "hey8" << std::endl;
        int pr;
        cudaMemcpy(&pr, prime, sizeof(int), cudaMemcpyDeviceToHost);

        std::cout << pr << std::endl;
        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t).count() << " ms" << std::endl;

        cudaFree(nData);

        return 0;
    }

    cudaFree(p);
    cudaFree(prime);

    return 0;
}

It seems that it crashes line 10 of main.cu but I don’t understand what it is wrong. Some idea?

probably debugging is in order. A “stack overflow” report is not unique or specific to CUDA in terms of its general meaning. The (each) CUDA thread has a stack, just as you might make the same observation with each CPU thread. “stack overflow” means that somehow the management of the stack has become problematic.

taking a look at isPrime (and perhaps more generally in the code in that file) I note usage of CDP as well as usage of the device runtime (e.g. cudaMalloc). Here are a few suggestions that I would apply if it were my code:

  1. Any time you are having trouble with a CUDA code, make sure you are doing proper (correctly detect the error, make some immediate indication of it) rigorous (at every opportunity to do so) cuda error checking on all host CUDA runtime API usage and all kernel calls from host code.
  2. Similarly, if you are using the CUDA device runtime, and/or launching kernels from device code, do the same kind of error checking on those (previous link in item 1 has an example).
  3. Any time you are using a device allocator (e.g. cudaMalloc, or new or malloc()) in device code, I suggest testing the returned pointer for NULL. This is a typical method to signal a device allocation error (e.g. out of memory).

You might also get some clues by running your code with compute-sanitizer including various sub-tools. A longer treatment of CUDA debugging is contained in unit 12 of this online tutorial series.

stack overflow in CUDA is sometimes associated with recursion. Recursion in general requires appropriate handling of the stack; that is it presents a possible stack hazard, not unique or specific to CUDA. I didn’t spot any obvious indications of recursion, but I may have missed something, and in general the use of CDP may lead to something akin to recursion, although not necessarily with the exact same stack hazard.

In studying your code, I also note patterns like this:

            Integer_isPrime_trialDivision<T><<<gridSize, blockSize>>>(primes, primesSize,
                                                                      numberData, bits_.size(),
                                                                      sqrtLimitData, sqrtLimit.bits_.size(),
                                                                      divisible);
            
            cudaFree(numberData);
            cudaFree(sqrtLimitData);

            if (*divisible)

If you were expecting *divisible to have been modified by the kernel at the point at which you test it after the kernel launch, that is incorrect thinking. A kernel launch, even in device code, is asynchronous. So without getting too far into the weeds here, that code pattern looks curious to me. But I haven’t tried to study your code extensively or the algorithm in general. (in a bit more depth here: *divisible may have been modified by the point at which you test it. Given that, testing it at that point seems incomprehensible to me, but again, have not studied this in depth. I may have missed something or just be confused.)

Ok. I will investigate more with checking. About *divisible, how to wait the asynchronous launch? It is about Sieve algorithm.

read the links I provided under CDP: Refactoring CDP1 code to use CDP2 - #3 by Robert_Crovella