Thrust::lower_bound on device_ptr always returns first element?

Cross post from stackoverflow.

When using thrust::lower_bound on a device_vector it returns the correct element, but when using it on a device_ptr I always get the first element. Am I using it incorrectly?

struct myLowerBound
{
    __host__ __device__ myLowerBound(const thrust::device_ptr<float> input) : m_input(input) {}
    __host__ __device__ int operator()(const int& idx) const
    {
        printf("Enter\n");
        auto it = thrust::lower_bound(m_input, m_input + 100, 25);
        printf("after1\n");
        int dist = thrust::distance(m_input, it);
        printf("after2\n");
        printf("dist: %d\n", dist);

        return dist;
    }

    thrust::device_ptr<float> m_input;
};

thrust::device_vector<float> input(100);
thrust::sequence(thrust::device, input.begin(), input.end());
float tmp = input[25];
printf("input[25]: %f\n", tmp);

thrust::device_vector<int> out(input.size());
thrust::transform(thrust::device,
                    thrust::counting_iterator<int>(0),
                    thrust::counting_iterator<int>(input.size()),
                    out.begin(),
                    myLowerBound(input.data()));

int out4 = out[4];
printf("out4: %d\n", out4);

The output I get is:

...
Enter
Enter
input[25]: 25.000000
out4: 0

“after1”, “after2”, and “dist” are not printed.

Its preferable to provide a complete code. When I run the code you have now posted on your SO question under edit3: , I get the following output on CUDA 12.0 on A100:

$ ./t4
input[25]: 25.000000
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
after1
after1
after1
after1
after2
after2
after2
after2
dist: 0
dist: 0
dist: 0
dist: 0
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after1
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
after2
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
dist: 0
out4: 0
$

So if your complaint is:

I don’t seem to be able to reproduce that observation. If I run the code under compute-sanitizer in the same setup, there are also no errors reported.

I haven’t studied your code carefully to draw my own conclusions about what the expected behavior should be, but I don’t seem to be able to reproduce your claim.

Here is the complete code.

// testThrust.cuh
#pragma once

class __declspec(dllexport) testThrust
{
public:
    testThrust();
    ~testThrust();

    void runTest2();

};

// testThrust.cu
#include "testThrust.cuh"

#include <thrust/binary_search.h>
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include<thrust/distance.h>

testThrust::testThrust(){}

testThrust::~testThrust(){}

struct myLowerBound
{
    __host__ __device__ myLowerBound(const thrust::device_ptr<float> input) : m_input(input) {}
    __host__ __device__ int operator()(const int& idx) const
    {
        printf("Enter\n");
        auto it = thrust::lower_bound(m_input, m_input + 100, 25);
        printf("after1\n");
        int dist = thrust::distance(m_input, it);
        printf("after2\n");
        printf("dist: %d\n", dist);

        return dist;
    }

    thrust::device_ptr<float> m_input;
};

void testThrust::runTest2()
{
        thrust::device_vector<float> input(100);
        thrust::sequence(thrust::device, input.begin(), input.end());
        float tmp = input[25];
        printf("input[25]: %f\n", tmp);

        thrust::device_vector<int> out(input.size());
        thrust::transform(thrust::device,
                    thrust::counting_iterator<int>(0),
                    thrust::counting_iterator<int>(input.size()),
                    out.begin(),
                    myLowerBound(input.data()));

        int out4 = out[4];
        printf("out4: %d\n", out4);
}

// main.cpp
#include "testThrust.cuh"

int main(int argc, char *argv[])
{
    testThrust t;
    t.runTest2();
    return 0;
}

The output I get using CUDA 11.5, Thrust: 11.4, Video card: Geforce GTX 1650, when running the executable directly.

PS D:\mytest\build-thrust-Debug> .\thrust.exe
input[25]: 25.000000
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
out4: 0

The expected output would print dist: 25 and out4: 25 (which is what I get when thrust::lower_bound is provided a device_vector), not dist: 0 and out4: 0.

My question is, why is thrust::lower_bound return the first element when I pass it a device_ptr, but the 25th element when I pass it a device_vector?

It seems evident that the code is not running properly in your setup - the missing printf output. I would diagnose that aspect first, before trying to move on to explaining the results.

I don’t work on windows much, but I would start by trying to run the code from the command line as you have but with compute-sanitizer, and see what the output is.

You’ve drawn a distinction between device_ptr and device_vector, but I suspect another difference is calling lower_bound from the host in the device_vector case (since a device_vector is not directly usable in device code) and from the device in the device_ptr case. If I had to guess, I would guess that is the more important distinction.

Aha. When I build a debug version of the code (-G) then I get a problem. If you are building a debug project or using debug switches, I suggest switching to a release project. Long ago thrust had a published warning not to build debug executables (I know, weird) but I had thought that was merely a historical thing. I’m not sure why the difference in behavior here. There may be a thrust mode switch for device dispatch that behaves different in debug vs. release compilation settings.

It appears when I add thrust::seq to the thrust::lower_bound call, the issue is resolved. All print statements are printed, and the output is 25 as expected.

The commenter on the stackoverflow post said

If I understand the internal Thrust code correctly, simple binary searches are basically sequential anyway. That is why there are vectorized searches. You are pretty much doing your own custom vectorized search which would be hard to implement with the Thrust ones. But using thrust::seq should be ideal as there is no parallelism in simple binary search.

Perhaps the issue was because the thrust::transform call had thrust::device as the execution policy, which made thrust::lower_bound default to thrust::device, then the vectorized search failed because I gave it a single search value when it was expecting a list of search values (as shown in the examples in the linked docs).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.