Starter Question Gpu exec time vs Cpu exec time

Hi everyone,

I’m starting with CUDA and I created a simple example of a program that computes the max value of a vector. After implemented the CUDA version using Thrust, I noticed that the total time of the CPU version is faster than GPU using diferent sizes of data,and as so, I felt quite dissapointed with CUDA.

The CUDA Implementation is:

#include <cuda.h>

#include <vector>

#include <cuda_runtime.h>

#include <thrust/reduce.h>

#include <thrust/transform.h>

#include <thrust/functional.h>

#include <thrust/device_vector.h>

#include <thrust/extrema.h>

#include <iostream>

#include <stdlib.h>

int main(int argc,char **argv){

	

	int n_data=atoi(argv[1]);

	

	std::vector<float> p(n_data);

	for(int i=0;i<n_data;i++){ //fill vector with dummy values

		p[i]=i;

	}	

	

	thrust::device_vector<float> p_device_max(n_data);

        thrust::copy(p.begin(),p.end(),p_device_max.begin());

        std::cout<<*(thrust::max_element(p_device_max.begin(),p_device_max.end()))<<std::endl;

	

	return 0;

}

And using Intel Profiler to see the total execution time, I noticed that CPU version takes around 2ms to compute with n_data=5000 and GPU takes around 71 ms.

This behaviour is normal by using cuda with single operations or low aritmetic intensity operations considerating that thrust provides several optimizations?

Thanks in advance.

I tried running your code, but am getting a runtime crash. Not sure why and didn’t spend the time to track down the reason (I am using CUDA 4.0).

This operation is very simple in ArrayFire with the seq and max functions. On my laptop, I am getting 2.2ms for the code below. This is in line with what I would expect for a 5000 element array, i.e. 5000 elements is not enough to show speedups because it is too little data, but it is enough to be more or less similar to what a CPU would do. At 50k or 500k elements, you would see great speedups, with ArrayFire and the GPU handily beating the CPU.

#include <arrayfire.h>

#include <iostream>

#include <stdlib.h>

using namespace af;

int main(int argc,char **argv){

	array p = seq(5000);

	eval(p);

	sync();

	timer::tic();

	max(p);

	sync();

	std::cout << "Elapsed Time (in seconds) " << timer::toc() << std::endl;

	print(max(p));

	return 0;

}

Good luck!