Using Thrust for stream compaction

X-FrEaK · May 25, 2011, 12:02pm

Hi all.

Very briefly, I have 3 vectors from my CUDA C program and then I call an extern C function which does stream compaction using thrust. My problem starts when I finish the stream compaction algorithm and have to copy the results back to normal C arrays. So I need some help to find a way to do this. Here is the code

extern "C" void sort(float*intx,float*inty,float*intz){

thrust::device_vector<float> x(112*88*1408);

  thrust::device_vector<float> y(112*88*1408);

  thrust::device_vector<float> z(112*88*1408);

	thrust::copy(intx,intx+112*88*1408,x.begin());

	thrust::copy(inty,inty+112*88*1408,y.begin());

	thrust::copy(intz,intz+112*88*1408,z.begin());

	thrust::device_vector<float> x_out(112*88*1408);

	thrust::device_vector<float> y_out(112*88*1408);

	thrust::device_vector<float> z_out(112*88*1408);

	typedef thrust::device_vector<float>::iterator      Iterator;

	typedef thrust::tuple<Iterator, Iterator, Iterator> IteratorTuple;

	typedef thrust::zip_iterator<IteratorTuple>         ZipIterator;

	ZipIterator begin(thrust::make_tuple(x.begin(), y.begin(), z.begin()));

	ZipIterator end  (thrust::make_tuple(x.end(),   y.end(),   z.end()));

	ZipIterator output_begin(thrust::make_tuple(x_out.begin(), y_out.begin(), z_out.begin()));

	ZipIterator output_end = thrust::remove_copy_if(begin, end, output_begin, less_than_zero<float>());

	// compute size of output

	size_t N = output_end - output_begin;

	// resize output

	x_out.resize(N);

	y_out.resize(N);

	z_out.resize(N);

	//Must have a function here to alloc new C output arrays

for(size_t i = 0; i < x_out.size(); i++)

    std::cout << x_out[i] << " " << y_out[i] << " " << z_out[i] << std::endl;

}

Thanks in advance External Image

avidday · May 25, 2011, 12:16pm

Are the inputs to this function device pointers? And what do you mean by “normal C arrays”?

X-FrEaK · May 25, 2011, 12:46pm

No the inputs are host arrays, which I obtain from a CUDA kernel(can I just pass the device arrays?)

What I mean by normal C arrays, is that I want to work with normal C arrays and not thrust::host vectors, I don’t know if I am explaining this right…

devkec · May 25, 2011, 12:55pm

Use either &x[0] or thrust::raw_pointer_cast(x.data())

X-FrEaK · May 25, 2011, 1:27pm

I don’t find that thrust function in the API details =/. Could you give me a simple example?

EDIT: I found it in the QuickstarGuide

// extract raw pointer from device_ptr

int * raw_ptr = thrust::raw_pointer_cast(dev_ptr)

So I just have to do float*intx_out=thrust::raw_pointer_cast(x_out.data()) ?

X-FrEaK · May 25, 2011, 1:36pm

I tested the code with manually initialized arrays and i get this error:

“./thrust/detail/internal_functional.h(48): error: calling a host function from a device/global function is not allowed”

which refers to this part of the code

template <typename T>

struct less_than_zero

{

  bool operator()(const thrust::tuple<T,T,T>& t)

  {

    return thrust::get<0>(t) < 0 || 

           thrust::get<1>(t) < 0 || 

           thrust::get<2>(t) < 0;

  }

};

I saw some thrust examples that did something similar (they must have used host vectors), how can I get around this issue?

EDIT: Just got it, I had to add “host device” to the boolean operator

devkec · May 26, 2011, 5:02am

If you want the pointer to the device memory, yes. If you want to use the data in “plain C” then copy it to a host_vector first, and then take the raw_pointer. I made a little wrapper function for this, I’ll reply it later (it’s not at my home computer).

Exactly.

X-FrEaK · May 26, 2011, 9:47am

Thank you davkec and yes I want the the array in plain C. Just one thing, I can use this stream compaction algorithm with device vector input instead of host vector? If so, i just have to pass the pointer to the device arrays?

apostglen46 · May 26, 2011, 10:47am

also i think that you can do the following if you dont use the functionalities of a host vecrors.
1.Create your plain C arrays.
2.Copy them to a device vector using copy with a raw pointer cast.
3.Perform stream compaction.
4.Copy back the data using the raw pointers of the host vector from step 2.

I hope i understood your problem correctly.

Apostolis

X-FrEaK · May 26, 2011, 11:39am

So you are saying that instead of

thrust::device_vector<float> x(112*88*1408);

	thrust::copy(intx,intx+112*88*1408,x.begin());

I should have

thrust::device_vector<float> x(112*88*1408);

	thrust::copy(thrust::raw_pointer_cast(intx),thrust::raw_pointer_cast(intx+112*88*1408),x.begin());

?

apostglen46 · May 26, 2011, 3:33pm

nope,i meant that in the final copy (from device to host) you can simply use &x[0] as the begining of the array you want to end the data to .
You can even write your own function to copy data starting from &x_out[0] to a new array xout_new[N]
It doesn’t seems that complex to me. (unless i missunderstood your problem)

X-FrEaK · May 26, 2011, 5:38pm

Sorry, I assume it is an easy problem, but I am no programmer =/ so basic things, sometimes are hard for me to understand

I ended up doing this:

thrust::host_vector<float> x_outh=xout;

intxc=thrust::raw_pointer_cast(x_outh.data()) //the intxc pointer is the same as the input one

I print array intxc(INSIDE the thrust function) and the values are correct, however if I print the array with printf in the main function afterwards, the values that are printed are the ones before the stream compaction algorithm =/

EDIT: @apostglen46 I did as you said(at least I think it was what you said), and the results are the same

Basically what i did was

thrust::host_vector<float> x_outh=xout;

intxc=&x_outh[0];

But the intxc array is only modified within the function

apostglen46 · May 27, 2011, 9:12am

could you please write a simple test case (very small input size) so we can help you further?
I thought that what you wrote would work,it was more or less what i suggested.

Apostolis

X-FrEaK · May 27, 2011, 12:12pm

Here is a simple test case: I have 3 arrays(x,y,z) with 10 elements each. The first 5 elements are 0, and the other 5 elements are 5’s. According to my boolean operator, it is supposed to delete the lines which meet the criteria x[i]=y[i]=z[i]=0;

The print function inside the thrust function outputs this:

5 5 5

5 5 5

5 5 5

5 5 5

5 5 5

0 0 0

6.86636e-44 6.86636e-44 6.86636e-44

0 0 0

0 6.12427e-37 6.12417e-37

0 0 0

Array Size = 5

Which is correct. However the printf in the main function outputs this:

Here is the code I used:

#include <stdlib.h>

#include <stdio.h>

#include <thrust/remove.h>

#include <thrust/device_vector.h>

#define blocksize 8

#define iiterations 0

#define jiterations 0

template <typename T>

struct deletekey

{

  __host__ __device__ bool operator()(const thrust::tuple<T,T,T>& t)

  {

    return thrust::get<0>(t) < 0 || thrust::get<0>(t) > 1408 ||

	   thrust::get<1>(t) < 0 || thrust::get<1>(t) > 2*1792 ||

           thrust::get<2>(t) < 0 || thrust::get<2>(t) > 60 ||

	   (thrust::get<0>(t) == 0 && thrust::get<1>(t) == 0 && thrust::get<2>(t) == 0);

  }

};

extern "C" void compact(float*intxc,float*intyc,float*intzc){

	thrust::device_vector<float> x(10);

	thrust::device_vector<float> y(10);

	thrust::device_vector<float> z(10);

	thrust::copy(intxc,intxc+10,x.begin());

	thrust::copy(intyc,intyc+10,y.begin());

	thrust::copy(intzc,intzc+10,z.begin());

	thrust::device_vector<float> x_out(10);

	thrust::device_vector<float> y_out(10);

	thrust::device_vector<float> z_out(10);

	typedef thrust::device_vector<float>::iterator      Iterator;

	typedef thrust::tuple<Iterator, Iterator, Iterator> IteratorTuple;

	typedef thrust::zip_iterator<IteratorTuple>         ZipIterator;

	ZipIterator begin(thrust::make_tuple(x.begin(), y.begin(), z.begin()));

	ZipIterator end  (thrust::make_tuple(x.end(),   y.end(),   z.end()));

	ZipIterator output_begin(thrust::make_tuple(x_out.begin(), y_out.begin(), z_out.begin()));

	ZipIterator output_end = thrust::remove_copy_if(begin, end, output_begin, deletekey<float>());

	// compute size of output

	size_t N = output_end - output_begin;

	// resize output

	x_out.resize(N);

	y_out.resize(N);

	z_out.resize(N);

	thrust::host_vector<float> x_outh = x_out;

	thrust::host_vector<float> y_outh = y_out;

	thrust::host_vector<float> z_outh = z_out;

	intxc=&x_outh[0]; //thrust::raw_pointer_cast(x_outh.data());

	intyc=thrust::raw_pointer_cast(y_outh.data());

	intzc=thrust::raw_pointer_cast(z_outh.data());

	for(size_t i = 0; i < 10; i++)

	std::cout << intxc[i] << " " << intyc[i] << " " << intzc[i] << std::endl;

	std::cout << "Array Size = " << N << std::endl;

}

void printMat(float*P,int uWP,int uHP){

	int i,j;

	for(i=0;i<uHP;i++){

		printf("\n");

		for(j=0;j<uWP;j++)

			printf("%f ",P[i*uWP+j]);

	}

}

int main(){

	

	float *intersectionsx_h=(float*)malloc(10*sizeof(float));

	float *intersectionsy_h=(float*)malloc(10*sizeof(float));

	float *intersectionsz_h=(float*)malloc(10*sizeof(float));

	if (intersectionsx_h == NULL) 

		printf ( "!!!! host memory allocation error (interx)\n");

	if (intersectionsy_h == NULL) 

		printf ( "!!!! host memory allocation error (intery)\n");

	if (intersectionsz_h == NULL) 

		printf ( "!!!! host memory allocation error (interz)\n");

	for(int i=0;i<5;i++){

		intersectionsx_h[i]=0;

		intersectionsy_h[i]=0;

		intersectionsz_h[i]=0;

	}

		for(int i=5;i<10;i++){

		intersectionsx_h[i]=5;

		intersectionsy_h[i]=5;

		intersectionsz_h[i]=5;

	}

	printf("\n Matrix x\n");

	printMat(intersectionsx_h,1,10);

	printf("\n Matrix y \n");

	printMat(intersectionsy_h,1,10);

	printf("\n Matrix z \n");

	printMat(intersectionsz_h,1,10);

	//CALL SORT AND STREAM COMPACTION THRUST FUNCTIONS

	compact(intersectionsx_h,intersectionsy_h,intersectionsz_h);

	printf("\n After thrust function:\n");

	printf("\n Matrix x \n");

	printMat(intersectionsx_h,1,10);

	printf("\n Matrix y \n");

	printMat(intersectionsy_h,1,10);

	printf("\n Matrix z \n");

	printMat(intersectionsz_h,1,10);

return 0;

}

I think you are right, it should be working, I did some tests with CPU functions and yes you copy an array by assigning the array’s first element address

Thank you for helping me External Image

apostglen46 · May 29, 2011, 8:00pm

got it to work,i told you it was simple:

#include <stdlib.h>

#include <stdio.h>

#include <thrust/remove.h>

#include <thrust/device_vector.h>

#define blocksize 8

#define iiterations 0

#define jiterations 0

template <typename T>

struct deletekey

{

  __host__ __device__ bool operator()(const thrust::tuple<T,T,T>& t)

  {

    return thrust::get<0>(t) < 0 || thrust::get<0>(t) > 1408 ||

           thrust::get<1>(t) < 0 || thrust::get<1>(t) > 2*1792 ||

           thrust::get<2>(t) < 0 || thrust::get<2>(t) > 60 ||

           (thrust::get<0>(t) == 0 && thrust::get<1>(t) == 0 && thrust::get<2>(t) == 0);

  }

};

extern "C" void compact(float*intxc,float*intyc,float*intzc){

thrust::device_vector<float> x(10);

        thrust::device_vector<float> y(10);

        thrust::device_vector<float> z(10);

thrust::copy(intxc,intxc+10,x.begin());

        thrust::copy(intyc,intyc+10,y.begin());

        thrust::copy(intzc,intzc+10,z.begin());

thrust::device_vector<float> x_out(10);

        thrust::device_vector<float> y_out(10);

        thrust::device_vector<float> z_out(10);

typedef thrust::device_vector<float>::iterator      Iterator;

        typedef thrust::tuple<Iterator, Iterator, Iterator> IteratorTuple;

        typedef thrust::zip_iterator<IteratorTuple>         ZipIterator;

ZipIterator begin(thrust::make_tuple(x.begin(), y.begin(), z.begin()));

        ZipIterator end  (thrust::make_tuple(x.end(),   y.end(),   z.end()));

ZipIterator output_begin(thrust::make_tuple(x_out.begin(), y_out.begin(), z_out.begin()));

ZipIterator output_end = thrust::remove_copy_if(begin, end, output_begin, deletekey<float>());

// compute size of output

        size_t N = output_end - output_begin;

// resize output

        x_out.resize(N);

        y_out.resize(N);

        z_out.resize(N);

thrust::copy(x_out.begin(),x_out.end(),intxc);

        thrust::copy(y_out.begin(),y_out.end(),intyc);

       thrust::copy(z_out.begin(),z_out.end(),intzc);

for(size_t i = 0; i < 10; i++)

        std::cout << intxc[i] << " " << intyc[i] << " " << intzc[i] << std::endl;

        std::cout << "Array Size = " << N << std::endl;

}

void printMat(float*P,int uWP,int uHP){

        int i,j;

        for(i=0;i<uHP;i++){

printf("\n");

for(j=0;j<uWP;j++)

                        printf("%f ",P[i*uWP+j]);

        }

}

int main(){

float *intersectionsx_h=(float*)malloc(10*sizeof(float));

        float *intersectionsy_h=(float*)malloc(10*sizeof(float));

        float *intersectionsz_h=(float*)malloc(10*sizeof(float));

if (intersectionsx_h == NULL) 

                printf ( "!!!! host memory allocation error (interx)\n");

        if (intersectionsy_h == NULL) 

                printf ( "!!!! host memory allocation error (intery)\n");

        if (intersectionsz_h == NULL) 

                printf ( "!!!! host memory allocation error (interz)\n");

for(int i=0;i<5;i++){

                intersectionsx_h[i]=0;

                intersectionsy_h[i]=0;

                intersectionsz_h[i]=0;

        }

                for(int i=5;i<10;i++){

                intersectionsx_h[i]=5;

                intersectionsy_h[i]=5;

                intersectionsz_h[i]=5;

        }

printf("\n Matrix x\n");

        printMat(intersectionsx_h,1,10);

        printf("\n Matrix y \n");

        printMat(intersectionsy_h,1,10);

        printf("\n Matrix z \n");

        printMat(intersectionsz_h,1,10);

//CALL SORT AND STREAM COMPACTION THRUST FUNCTIONS

        compact(intersectionsx_h,intersectionsy_h,intersectionsz_h);

printf("\n After thrust function:\n");

        printf("\n Matrix x \n");

        printMat(intersectionsx_h,1,10);

        printf("\n Matrix y \n");

        printMat(intersectionsy_h,1,10);

        printf("\n Matrix z \n");

        printMat(intersectionsz_h,1,10);

return 0;

}

You don’t even need to resize the arrays,just change x_out.end() with x_out.begin()+ArraySize.

Apostolis

X-FrEaK · May 30, 2011, 10:08am

You are absolutely right on both things you did External Image. I am really grateful for the help you gave me!

Topic		Replies	Views
Async thrust operation launches appear serially processed in nsight systems CUDA Programming and Performance cuda , performance , parallel-computing	5	89	August 29, 2024
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5061	May 7, 2015
Using device pointer in thrust algorithm CUDA Programming and Performance	2	28706	December 17, 2021
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1315	October 30, 2022
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1162	June 19, 2015
Thrust question CUDA Programming and Performance	3	10676	January 22, 2011
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16773	May 30, 2009
Visual Studio 2012 + OptiX + Thrust = compile errors CUDA Programming and Performance	8	9754	February 17, 2016
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2072	February 5, 2020
thrust::exclusive_scan with thrust::zip_iterator? CUDA Programming and Performance	9	1542	November 24, 2014

Using Thrust for stream compaction

Related topics