My thrust code is 10 times slower than CPU, what did I do wrong

I practically copied the example code for getting sum and sum of squares from the thrust documentation, but doing it naively on the cpu finishes WAY faster (over at least 10x, cpu is almost instant and gpu I can count the seconds). The input array “arr” is roughly ~5000 x 6 if that means anything, I figure it should be faster with that many elements.

I’m new to GPU and my c++ is rusty (not that I was ever very great, i had one class and barely touched it since). I turned debug GPU (-G flag) off in compile options.

If it’s not an issue in the code/use case, the only other thing worth mentioning is I’m using a laptop w/ and integrated GeForce MX330, which I could not find reference to anywhere in the website stating which smxx I should use for compiling (the default said 50).

I wrote a small main program to show off timing. For me I got “Time taken CPU: 2845100 nanosTime taken GPU: 639698400 nanos” and in the actual program it’s even worse.

#include "cuda_runtime.h"
#include <omp.h>
#include <cmath>
#include <random>
#include <algorithm>
#include <iterator>
#include <chrono>

template <typename T>
struct square
{
    __host__ __device__
        T operator()(const T& x) const {
        return x * x;
    }
};

int biggerKernel(std::vector<std::vector<double>>& arr, int n1, int n2, double constXKSquared, double cosineConst, double* tx1arr) {
	square<double>        unary_op;
	thrust::plus<double> binary_op;
	double init = 0;
	int size = n1;
	for (int i = 0; i < n2; i++) {
		thrust::device_vector<double> d_x(arr[i].begin(), arr[i].end());
		// compute square of sums
		double ss1 = thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op);
		double term = thrust::reduce(d_x.begin(), d_x.end());
                //some other stuff here, i commented it out for testing, it's just standard math stuff.
	}
	return 1;
}

int main() {
	std::vector<std::vector<double>> test(6);

	for (int i = 0; i < 6; i++) {
		test[i].resize((1920 * 3), 0.0);
		std::generate(test[i].begin(), test[i].end(), []() {
			return rand() % 100;
			});
	}
	auto start = std::chrono::steady_clock::now();
	for (int i = 0; i < 6; i++) {
		double sum = 0;
		double sos = 0;
		for (int j = 0; j < test[i].size(); j++) {
			sos += pow(test[i][j], 2);
			sum += test[i][j];
		}
		printf("Sum = %.4f  sum-of-squares = %.4f\n", sum, sos);
	}
	auto end = std::chrono::steady_clock::now();
	auto elapsed = end - start;
	printf("Time taken CPU: %d\n", std::chrono::duration_cast<std::chrono::nanoseconds> (end - start));

	start = std::chrono::steady_clock::now();
	biggerKernel(test, 1920 * 3, 6, 0.0, 0.0, nullptr);
	end = std::chrono::steady_clock::now();
	elapsed = end - start;
	printf("Time taken GPU: %d\n", std::chrono::duration_cast<std::chrono::nanoseconds> (end - start));
}

You need to understand what you’re profiling. In your current example, you include data transfers to the GPU in your timing.
thrust::device_vector<double> d_x(arr[i].begin(), arr[i].end());

When I exclude this on my mobile RTX 2070, my GPU is ~7x faster after warmup.

I do understand it includes transfer time, however transfer time is still time in my case (the end product is supposed to be a real-time application). Unless there is a way to do it without transferring, in which case I’ve got everything wrong.

Is this test just simply too small and the array has to be bigger to utilize the GPU? For reference the end goal is to put six pictures (1920x1080 by RGB pixel) and get the sum/variance of the raster for each row. Do you think if I scale up and do all the lines at once it will work out better?

Also would it be better to slap my data into a large 1d device_vector instead of splitting it into 6? I assume the compiler might optimize something like that but this is new territory for me. Something like:

thrust::device_vector<double> d_x(arr.begin(), arr.end()); //arr will be 1d with fuax 2d indexing
for(thrust::device_vector<double>::iterator iter = d_x.begin(); iter != d_x.end(); iter += n1)  {
      double ss1 = thrust::transform_reduce(iter , iter + n1, unary_op, init, binary_op);
      double term = thrust::reduce(iter , iter + n1);
}

Like I said I’m new to this, sorry if I’m asking dumb questions.

Unless the data is generated on the GPU it must be copied over some how. And as far as how to design your workflow, it is very problem dependent. For example, your data sets are too small to saturate the GPU and you’re not doing enough work (i.e., x * x) to be compute bound (therefore memory bound).

One way you might want to do it, (I’m not saying it’s the best) is to pipeline your work with streams. Create four streams, stream one copies over data set #1 and starts work. While stream one is doing work, stream two can be copying over data set #2. Next, you copy over data set #3 in stream three, while stream two is doing work, and stream one is moving on to the next stage of the workflow.

You should also look into pinned memory for faster transfers

References

Thank you for the responses. It seems like what’s killing me is the transferring the sum for every iteration of the loop. I don’t really see how to get around that with thrust. The best case would be to transfer all the data once, crunch the numbers and output the result all in one array. I don’t know if there’s really a way to do that in thrust, I think a custom kernel may be the only way to get around it.

5000 is on the small side. 5000x6 is of course better, but you are really only issuing work in chunks of 5000 (~5700).

That’s probably just about the slowest GPU platform there is.

One of the places you can get that info is the deviceQuery sample code (the compute capability).

Yes, this is a “small” test in at least 2 ways.

  • data set sizes are small (but you’re on a small GPU at the moment)
  • amount of work per data item is small

We can easily demonstrate that with somewhat more complex arithmetic you can overcome the overhead “cost” of data movement.

It will be better, not sure how much.

Yes, that would be a better approach, and thrust has a segmented reduction (reduce_by_key) that will allow you to do all 6 reductions at once if you present the data this way. It also addresses to some degree the “movement” of results after each step - you have control over this with reduce_by_key.

It can be done with thrust.

6x1980x3 is only 35Kbytes of data. That is tiny, and the data movement cost should only be on the order of .035/3000 = 11 microseconds. It will take longer than that, but nothing like the durations you are indicating.

I think the biggest problem here is just one of benchmarking practice.

When I take your code and run it as is, I get output like this:

$ ./t24
Sum = 287413.0000  sum-of-squares = 19094901.0000
Sum = 285290.0000  sum-of-squares = 19014100.0000
Sum = 285626.0000  sum-of-squares = 18914624.0000
Sum = 286840.0000  sum-of-squares = 19079068.0000
Sum = 284501.0000  sum-of-squares = 18918199.0000
Sum = 286434.0000  sum-of-squares = 19015530.0000
Time taken CPU: 3144199
Time taken GPU: 289180592

which looks bad in comparison. However if I modify your code like this;

  biggerKernel(test, 1920 * 3, 6, 0.0, 0.0, nullptr);  // add warm up
  cudaDeviceSynchronize(); // add
  start = std::chrono::steady_clock::now();
  biggerKernel(test, 1920 * 3, 6, 0.0, 0.0, nullptr);

And recompile and run, I get output like this:

$ ./t24
Sum = 287413.0000  sum-of-squares = 19094901.0000
Sum = 285290.0000  sum-of-squares = 19014100.0000
Sum = 285626.0000  sum-of-squares = 18914624.0000
Sum = 286840.0000  sum-of-squares = 19079068.0000
Sum = 284501.0000  sum-of-squares = 18918199.0000
Sum = 286434.0000  sum-of-squares = 19015530.0000
Time taken CPU: 3143625
Time taken GPU: 1126288

So it makes a big difference. This is CUDA start-up overhead. In a “real time” or streaming application, you’ll experience it once at the beginning, but thereafter subsequent ops wont be hindered, as this demonstrates.

Reduce by key seems to be what I was missing, as well as the warmup. I’ll have to look into it, the name doesn’t make a whole lot of sense to me, but I don’ know a whole lot about reduction algorithm or gpu lingo so I’ll take your word for it.

And thank you both for the responses.

Reduce by key along with sort by key did help out quite a bit (several times faster). I think a custom kernel where I can cut out the overhead of starting separate kernels for sorting and 2 different reductions is the only way to get any better.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.