Using Thrust for stream compaction

Hi all.

Very briefly, I have 3 vectors from my CUDA C program and then I call an extern C function which does stream compaction using thrust. My problem starts when I finish the stream compaction algorithm and have to copy the results back to normal C arrays. So I need some help to find a way to do this. Here is the code

extern "C" void sort(float*intx,float*inty,float*intz){

thrust::device_vector<float> x(112*88*1408);

  thrust::device_vector<float> y(112*88*1408);

  thrust::device_vector<float> z(112*88*1408);

	thrust::copy(intx,intx+112*88*1408,x.begin());

	thrust::copy(inty,inty+112*88*1408,y.begin());

	thrust::copy(intz,intz+112*88*1408,z.begin());

	thrust::device_vector<float> x_out(112*88*1408);

	thrust::device_vector<float> y_out(112*88*1408);

	thrust::device_vector<float> z_out(112*88*1408);

	typedef thrust::device_vector<float>::iterator      Iterator;

	typedef thrust::tuple<Iterator, Iterator, Iterator> IteratorTuple;

	typedef thrust::zip_iterator<IteratorTuple>         ZipIterator;

	ZipIterator begin(thrust::make_tuple(x.begin(), y.begin(), z.begin()));

	ZipIterator end  (thrust::make_tuple(x.end(),   y.end(),   z.end()));

	ZipIterator output_begin(thrust::make_tuple(x_out.begin(), y_out.begin(), z_out.begin()));

	ZipIterator output_end = thrust::remove_copy_if(begin, end, output_begin, less_than_zero<float>());

	// compute size of output

	size_t N = output_end - output_begin;

	// resize output

	x_out.resize(N);

	y_out.resize(N);

	z_out.resize(N);

	//Must have a function here to alloc new C output arrays

for(size_t i = 0; i < x_out.size(); i++)

    std::cout << x_out[i] << " " << y_out[i] << " " << z_out[i] << std::endl;

}

Thanks in advance External Image

Are the inputs to this function device pointers? And what do you mean by “normal C arrays”?

No the inputs are host arrays, which I obtain from a CUDA kernel(can I just pass the device arrays?)

What I mean by normal C arrays, is that I want to work with normal C arrays and not thrust::host vectors, I don’t know if I am explaining this right…

Use either &x[0] or thrust::raw_pointer_cast(x.data())

I don’t find that thrust function in the API details =/. Could you give me a simple example?

EDIT: I found it in the QuickstarGuide

// extract raw pointer from device_ptr

int * raw_ptr = thrust::raw_pointer_cast(dev_ptr)

So I just have to do float*intx_out=thrust::raw_pointer_cast(x_out.data()) ?

I tested the code with manually initialized arrays and i get this error:

“./thrust/detail/internal_functional.h(48): error: calling a host function from a device/global function is not allowed”

which refers to this part of the code

template <typename T>

struct less_than_zero

{

  bool operator()(const thrust::tuple<T,T,T>& t)

  {

    return thrust::get<0>(t) < 0 || 

           thrust::get<1>(t) < 0 || 

           thrust::get<2>(t) < 0;

  }

};

I saw some thrust examples that did something similar (they must have used host vectors), how can I get around this issue?

EDIT: Just got it, I had to add “host device” to the boolean operator

If you want the pointer to the device memory, yes. If you want to use the data in “plain C” then copy it to a host_vector first, and then take the raw_pointer. I made a little wrapper function for this, I’ll reply it later (it’s not at my home computer).

Exactly.

Thank you davkec and yes I want the the array in plain C. Just one thing, I can use this stream compaction algorithm with device vector input instead of host vector? If so, i just have to pass the pointer to the device arrays?

also i think that you can do the following if you dont use the functionalities of a host vecrors.
1.Create your plain C arrays.
2.Copy them to a device vector using copy with a raw pointer cast.
3.Perform stream compaction.
4.Copy back the data using the raw pointers of the host vector from step 2.

I hope i understood your problem correctly.

Apostolis

So you are saying that instead of

thrust::device_vector<float> x(112*88*1408);

	thrust::copy(intx,intx+112*88*1408,x.begin());

I should have

thrust::device_vector<float> x(112*88*1408);

	thrust::copy(thrust::raw_pointer_cast(intx),thrust::raw_pointer_cast(intx+112*88*1408),x.begin());

?

nope,i meant that in the final copy (from device to host) you can simply use &x[0] as the begining of the array you want to end the data to .
You can even write your own function to copy data starting from &x_out[0] to a new array xout_new[N]
It doesn’t seems that complex to me. (unless i missunderstood your problem)

Sorry, I assume it is an easy problem, but I am no programmer =/ so basic things, sometimes are hard for me to understand

I ended up doing this:

thrust::host_vector<float> x_outh=xout;

intxc=thrust::raw_pointer_cast(x_outh.data()) //the intxc pointer is the same as the input one

I print array intxc(INSIDE the thrust function) and the values are correct, however if I print the array with printf in the main function afterwards, the values that are printed are the ones before the stream compaction algorithm =/

EDIT: @apostglen46 I did as you said(at least I think it was what you said), and the results are the same

Basically what i did was

thrust::host_vector<float> x_outh=xout;

intxc=&x_outh[0];

But the intxc array is only modified within the function

could you please write a simple test case (very small input size) so we can help you further?
I thought that what you wrote would work,it was more or less what i suggested.

Apostolis

Here is a simple test case: I have 3 arrays(x,y,z) with 10 elements each. The first 5 elements are 0, and the other 5 elements are 5’s. According to my boolean operator, it is supposed to delete the lines which meet the criteria x[i]=y[i]=z[i]=0;

The print function inside the thrust function outputs this:

5 5 5

5 5 5

5 5 5

5 5 5

5 5 5

0 0 0

6.86636e-44 6.86636e-44 6.86636e-44

0 0 0

0 6.12427e-37 6.12417e-37

0 0 0

Array Size = 5

Which is correct. However the printf in the main function outputs this:

X array

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

5.000000 

5.000000 

5.000000 

5.000000 

5.000000

Here is the code I used:

#include <stdlib.h>

#include <stdio.h>

#include <thrust/remove.h>

#include <thrust/device_vector.h>

#define blocksize 8

#define iiterations 0

#define jiterations 0

template <typename T>

struct deletekey

{

  __host__ __device__ bool operator()(const thrust::tuple<T,T,T>& t)

  {

    return thrust::get<0>(t) < 0 || thrust::get<0>(t) > 1408 ||

	   thrust::get<1>(t) < 0 || thrust::get<1>(t) > 2*1792 ||

           thrust::get<2>(t) < 0 || thrust::get<2>(t) > 60 ||

	   (thrust::get<0>(t) == 0 && thrust::get<1>(t) == 0 && thrust::get<2>(t) == 0);

  }

};

extern "C" void compact(float*intxc,float*intyc,float*intzc){

	thrust::device_vector<float> x(10);

	thrust::device_vector<float> y(10);

	thrust::device_vector<float> z(10);

	thrust::copy(intxc,intxc+10,x.begin());

	thrust::copy(intyc,intyc+10,y.begin());

	thrust::copy(intzc,intzc+10,z.begin());

	thrust::device_vector<float> x_out(10);

	thrust::device_vector<float> y_out(10);

	thrust::device_vector<float> z_out(10);

	typedef thrust::device_vector<float>::iterator      Iterator;

	typedef thrust::tuple<Iterator, Iterator, Iterator> IteratorTuple;

	typedef thrust::zip_iterator<IteratorTuple>         ZipIterator;

	ZipIterator begin(thrust::make_tuple(x.begin(), y.begin(), z.begin()));

	ZipIterator end  (thrust::make_tuple(x.end(),   y.end(),   z.end()));

	ZipIterator output_begin(thrust::make_tuple(x_out.begin(), y_out.begin(), z_out.begin()));

	ZipIterator output_end = thrust::remove_copy_if(begin, end, output_begin, deletekey<float>());

	// compute size of output

	size_t N = output_end - output_begin;

	// resize output

	x_out.resize(N);

	y_out.resize(N);

	z_out.resize(N);

	thrust::host_vector<float> x_outh = x_out;

	thrust::host_vector<float> y_outh = y_out;

	thrust::host_vector<float> z_outh = z_out;

	intxc=&x_outh[0]; //thrust::raw_pointer_cast(x_outh.data());

	intyc=thrust::raw_pointer_cast(y_outh.data());

	intzc=thrust::raw_pointer_cast(z_outh.data());

	for(size_t i = 0; i < 10; i++)

	std::cout << intxc[i] << " " << intyc[i] << " " << intzc[i] << std::endl;

	std::cout << "Array Size = " << N << std::endl;

}

void printMat(float*P,int uWP,int uHP){

	int i,j;

	for(i=0;i<uHP;i++){

		printf("\n");

		for(j=0;j<uWP;j++)

			printf("%f ",P[i*uWP+j]);

	}

}

int main(){

	

	float *intersectionsx_h=(float*)malloc(10*sizeof(float));

	float *intersectionsy_h=(float*)malloc(10*sizeof(float));

	float *intersectionsz_h=(float*)malloc(10*sizeof(float));

	if (intersectionsx_h == NULL) 

		printf ( "!!!! host memory allocation error (interx)\n");

	if (intersectionsy_h == NULL) 

		printf ( "!!!! host memory allocation error (intery)\n");

	if (intersectionsz_h == NULL) 

		printf ( "!!!! host memory allocation error (interz)\n");

	for(int i=0;i<5;i++){

		intersectionsx_h[i]=0;

		intersectionsy_h[i]=0;

		intersectionsz_h[i]=0;

	}

		for(int i=5;i<10;i++){

		intersectionsx_h[i]=5;

		intersectionsy_h[i]=5;

		intersectionsz_h[i]=5;

	}

	printf("\n Matrix x\n");

	printMat(intersectionsx_h,1,10);

	printf("\n Matrix y \n");

	printMat(intersectionsy_h,1,10);

	printf("\n Matrix z \n");

	printMat(intersectionsz_h,1,10);

	//CALL SORT AND STREAM COMPACTION THRUST FUNCTIONS

	compact(intersectionsx_h,intersectionsy_h,intersectionsz_h);

	printf("\n After thrust function:\n");

	printf("\n Matrix x \n");

	printMat(intersectionsx_h,1,10);

	printf("\n Matrix y \n");

	printMat(intersectionsy_h,1,10);

	printf("\n Matrix z \n");

	printMat(intersectionsz_h,1,10);

return 0;

}

I think you are right, it should be working, I did some tests with CPU functions and yes you copy an array by assigning the array’s first element address

Thank you for helping me External Image

got it to work,i told you it was simple:

#include <stdlib.h>

#include <stdio.h>

#include <thrust/remove.h>

#include <thrust/device_vector.h>

#define blocksize 8

#define iiterations 0

#define jiterations 0

template <typename T>

struct deletekey

{

  __host__ __device__ bool operator()(const thrust::tuple<T,T,T>& t)

  {

    return thrust::get<0>(t) < 0 || thrust::get<0>(t) > 1408 ||

           thrust::get<1>(t) < 0 || thrust::get<1>(t) > 2*1792 ||

           thrust::get<2>(t) < 0 || thrust::get<2>(t) > 60 ||

           (thrust::get<0>(t) == 0 && thrust::get<1>(t) == 0 && thrust::get<2>(t) == 0);

  }

};

extern "C" void compact(float*intxc,float*intyc,float*intzc){

thrust::device_vector<float> x(10);

        thrust::device_vector<float> y(10);

        thrust::device_vector<float> z(10);

thrust::copy(intxc,intxc+10,x.begin());

        thrust::copy(intyc,intyc+10,y.begin());

        thrust::copy(intzc,intzc+10,z.begin());

thrust::device_vector<float> x_out(10);

        thrust::device_vector<float> y_out(10);

        thrust::device_vector<float> z_out(10);

typedef thrust::device_vector<float>::iterator      Iterator;

        typedef thrust::tuple<Iterator, Iterator, Iterator> IteratorTuple;

        typedef thrust::zip_iterator<IteratorTuple>         ZipIterator;

ZipIterator begin(thrust::make_tuple(x.begin(), y.begin(), z.begin()));

        ZipIterator end  (thrust::make_tuple(x.end(),   y.end(),   z.end()));

ZipIterator output_begin(thrust::make_tuple(x_out.begin(), y_out.begin(), z_out.begin()));

ZipIterator output_end = thrust::remove_copy_if(begin, end, output_begin, deletekey<float>());

// compute size of output

        size_t N = output_end - output_begin;

// resize output

        x_out.resize(N);

        y_out.resize(N);

        z_out.resize(N);

thrust::copy(x_out.begin(),x_out.end(),intxc);

        thrust::copy(y_out.begin(),y_out.end(),intyc);

       thrust::copy(z_out.begin(),z_out.end(),intzc);

for(size_t i = 0; i < 10; i++)

        std::cout << intxc[i] << " " << intyc[i] << " " << intzc[i] << std::endl;

        std::cout << "Array Size = " << N << std::endl;

}

void printMat(float*P,int uWP,int uHP){

        int i,j;

        for(i=0;i<uHP;i++){

printf("\n");

for(j=0;j<uWP;j++)

                        printf("%f ",P[i*uWP+j]);

        }

}

int main(){

float *intersectionsx_h=(float*)malloc(10*sizeof(float));

        float *intersectionsy_h=(float*)malloc(10*sizeof(float));

        float *intersectionsz_h=(float*)malloc(10*sizeof(float));

if (intersectionsx_h == NULL) 

                printf ( "!!!! host memory allocation error (interx)\n");

        if (intersectionsy_h == NULL) 

                printf ( "!!!! host memory allocation error (intery)\n");

        if (intersectionsz_h == NULL) 

                printf ( "!!!! host memory allocation error (interz)\n");

for(int i=0;i<5;i++){

                intersectionsx_h[i]=0;

                intersectionsy_h[i]=0;

                intersectionsz_h[i]=0;

        }

                for(int i=5;i<10;i++){

                intersectionsx_h[i]=5;

                intersectionsy_h[i]=5;

                intersectionsz_h[i]=5;

        }

printf("\n Matrix x\n");

        printMat(intersectionsx_h,1,10);

        printf("\n Matrix y \n");

        printMat(intersectionsy_h,1,10);

        printf("\n Matrix z \n");

        printMat(intersectionsz_h,1,10);

//CALL SORT AND STREAM COMPACTION THRUST FUNCTIONS

        compact(intersectionsx_h,intersectionsy_h,intersectionsz_h);

printf("\n After thrust function:\n");

        printf("\n Matrix x \n");

        printMat(intersectionsx_h,1,10);

        printf("\n Matrix y \n");

        printMat(intersectionsy_h,1,10);

        printf("\n Matrix z \n");

        printMat(intersectionsz_h,1,10);

return 0;

}

You don’t even need to resize the arrays,just change x_out.end() with x_out.begin()+ArraySize.

Apostolis

You are absolutely right on both things you did External Image. I am really grateful for the help you gave me!