This sort of approach is definitely viable. The compiler can handle it. Here’s a work in-progress: Google Code Archive - Long-term storage for Google Code Project Hosting.
This sort of approach is definitely viable. The compiler can handle it. Here’s a work in-progress: Google Code Archive - Long-term storage for Google Code Project Hosting.
Hey Guys,
I have been using ArrayFire for most of my CUDA projects. It is free and very powerful for doing stuff from basic math to complex operations like FFTs, convolutions, image processing, reductions, blas, etc…
You could do the above example using ArrayFire as shown below:
#include <stdio.h>
#include <arrayfire.h>
using namespace af;
int main(int argc, char ** argv)
{
try {
const int N = 1024;
// Generate sequence on device
// Generates 0 -> (N-1)
array x = array(seq(N));
// Element-wise multiplication
array y = mul(x,x);
// a*x + y
float a = 2.0;
y = a*x+y;
// Pull back first 10 elements
// Pull result to CPU buffer
float *hx = (y(seq(10))).host<float>();
for (int i=0; i< 10; i++)
printf("%f \n", hx[i]);
good job!
seems I recreate the wheel~
arrayFire not opensource,
and, I don’t think it’s very fast,
eg.
y = a*x+y;
seems run like axpy_slow in thrust,
not axpy_fast.
ArrayFire is quite fast and you can expect it to behave like axpy_fast. ArrayFire fuses such operations together into one kernel, avoids any extra reads to/from global memory.
Er,
what about:
y = sin(x) + cos(x) + 3 * tan(y) ?
It access memory 3 times or more?
<load x, load y, set y>
if so, seems arrayfire use device_func pointers.
Thrust accesses memory three times because it executes that expression as one entire kernel at once. ArrayFire accesses memory at most three times because it delays execution until the result of the expression is needed, and therefore it has the opportunity to combine that expression with other expressions and possibly eliminate (defer) the last “set y” memory access. Hence, ArrayFire access memory at most three times for that expression.
Most definetely an interesting project.
I would like to see it incorporate into thrust or at least be compatible.
For example I don’t see why you don’t use thrust’s host and device vectors that can be casted to raw_pointers for example.
Good job anyways