an easy-to-use CUDA library

JaredHoberock · January 13, 2012, 7:11pm

Actually you can also easily build this sort of thing on top of thrust. Simply wrap begin+end iterators in a range and you can define all transformations and operations on it. Using fancy iterators provides the lazy evaluation and you can simply transplant all existing thrust algorithms to them.

Something like:
DeviceVector<int> a = sequence(0, n); // Uses counting_iterator.

DeviceVector<int> b = n - sequence(0, n);

DeviceVector<int> c = a + b; // Short for zip(a, b).transform(wrap(thrust::plus<int>())).

int sum = reduce(a * b + c);

bool comp = all(c == n);
But I’m unsure, how well the compiler actually handles this level of abstraction.

This is definitely an interesting topic. I’d also appreciate your work.

This sort of approach is definitely viable. The compiler can handle it. Here’s a work in-progress: Google Code Archive - Long-term storage for Google Code Project Hosting.

JaredHoberock · January 13, 2012, 7:11pm

Actually you can also easily build this sort of thing on top of thrust. Simply wrap begin+end iterators in a range and you can define all transformations and operations on it. Using fancy iterators provides the lazy evaluation and you can simply transplant all existing thrust algorithms to them.

Something like:
DeviceVector<int> a = sequence(0, n); // Uses counting_iterator.

DeviceVector<int> b = n - sequence(0, n);

DeviceVector<int> c = a + b; // Short for zip(a, b).transform(wrap(thrust::plus<int>())).

int sum = reduce(a * b + c);

bool comp = all(c == n);
But I’m unsure, how well the compiler actually handles this level of abstraction.

This is definitely an interesting topic. I’d also appreciate your work.

This sort of approach is definitely viable. The compiler can handle it. Here’s a work in-progress: Google Code Archive - Long-term storage for Google Code Project Hosting.

ArrayF · January 17, 2012, 11:55pm

Hey Guys,

I have been using ArrayFire for most of my CUDA projects. It is free and very powerful for doing stuff from basic math to complex operations like FFTs, convolutions, image processing, reductions, blas, etc…

You could do the above example using ArrayFire as shown below:

#include <stdio.h>

#include <arrayfire.h>

using namespace af;

int main(int argc, char ** argv)

{

    try {

const int N = 1024;

// Generate sequence on device                                                                                                               

        // Generates 0 -> (N-1)                                                                                                                      

        array x = array(seq(N));

// Element-wise multiplication                                                                                                                

        array y = mul(x,x);

// a*x + y                                                                                                                                   

        float a = 2.0;

        y = a*x+y;

// Pull back first 10 elements                                                                                                               

        // Pull result to CPU buffer                                                                                                                 

        float *hx = (y(seq(10))).host<float>();

        for (int i=0; i< 10; i++)

            printf("%f \n", hx[i]);

lumpy_zhu · January 25, 2012, 2:04am

Hey Guys,

I have been using ArrayFire for most of my CUDA projects. It is free and very powerful for doing stuff from basic math to complex operations like FFTs, convolutions, image processing, reductions, blas, etc…

You could do the above example using ArrayFire as shown below:

#include <stdio.h>

#include <arrayfire.h>

using namespace af;

int main(int argc, char ** argv)

{

    try {

const int N = 1024;

// Generate sequence on device                                                                                                               

        // Generates 0 -> (N-1)                                                                                                                      

        array x = array(seq(N));

// Element-wise multiplication                                                                                                                

        array y = mul(x,x);

// a*x + y                                                                                                                                   

        float a = 2.0;

        y = a*x+y;

// Pull back first 10 elements                                                                                                               

        // Pull result to CPU buffer                                                                                                                 

        float *hx = (y(seq(10))).host<float>();

        for (int i=0; i< 10; i++)

            printf("%f \n", hx[i]);

good job!

seems I recreate the wheel~

lumpy_zhu · January 25, 2012, 2:26am

arrayFire not opensource,
and, I don’t think it’s very fast,

eg.
y = a*x+y;
seems run like axpy_slow in thrust,
not axpy_fast.

short · January 25, 2012, 3:52am

ArrayFire is quite fast and you can expect it to behave like axpy_fast. ArrayFire fuses such operations together into one kernel, avoids any extra reads to/from global memory.

lumpy_zhu · January 25, 2012, 5:54am

Er,
what about:
y = sin(x) + cos(x) + 3 * tan(y) ?

It access memory 3 times or more?
<load x, load y, set y>

if so, seems arrayfire use device_func pointers.

short · January 25, 2012, 6:56am

Thrust accesses memory three times because it executes that expression as one entire kernel at once. ArrayFire accesses memory at most three times because it delays execution until the result of the expression is needed, and therefore it has the opportunity to combine that expression with other expressions and possibly eliminate (defer) the last “set y” memory access. Hence, ArrayFire access memory at most three times for that expression.

apostglen46 · January 29, 2012, 11:16am

Most definetely an interesting project.
I would like to see it incorporate into thrust or at least be compatible.
For example I don’t see why you don’t use thrust’s host and device vectors that can be casted to raw_pointers for example.
Good job anyways

Topic		Replies	Views
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16829	May 30, 2009
Thrust v1.2 release A high-level C++ template library for CUDA CUDA Programming and Performance	10	9229	December 14, 2010
Using Thrust for stream compaction CUDA Programming and Performance	15	5078	May 30, 2011
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204763	April 13, 2009
Thrust v1.1 release A high-level C++ template library for CUDA CUDA Programming and Performance	6	13841	September 18, 2009
How to allocate a 3d array such that you can use the indecies to access its elements CUDA Programming and Performance	20	5423	October 24, 2009
[Thrust] Performance Array size CUDA Programming and Performance	26	3888	May 27, 2012
Polite ask for help with paralelling program using CUDA. Task: Rewrite given program so it will use CUDA Programming and Performance	13	9335	April 20, 2011
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134848	May 26, 2010
Convincing skeptical bigwigs on the future of CUDA CUDA Programming and Performance	49	8722	March 19, 2009

an easy-to-use CUDA library

Related topics