cuda is really slow - even when doing nothing

curiousj · September 2, 2010, 11:16pm

After first try cuda to do some kernel computing, I found it is really slow, so I do a little funny testing as below,

float* m = new float[1000];
float* n = new float[1000];
for(int i=0; i<200; i++)
{
float md;
float nd;
cudaMalloc( (void**) &md, 1000 * sizeof(float));
cudaMalloc( (void**) &nd, 1000 * sizeof(float));
cudaMemcpy(md, m, 1000 * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(nd, n, 1000 * sizeof(float), cudaMemcpyHostToDevice);
cudaFree(md);
cudaFree(nd);
}

This code does no meaningful thing, even does not kernel computing, just copies memory and frees it. Then I use nvcc to compile, run it, it takes 7 seconds. If I do similiar thing in complete host code, malloc and free, it takes no time. My video card is Quadro FX 4000. Not a great one, but it cannot be so slow.

The same thing happens to any of my C code, as long as I rename it to .cu and compile with nvcc, run it, it will be significantly slower than the executable created by visual studio 2008. I expect it will have same speed because they are executed on CPU only.

Any ideas?

ONeill · September 3, 2010, 7:55am

Lol @ topic title :)

ONeill · September 3, 2010, 7:55am

Lol @ topic title :)

avidday · September 3, 2010, 8:14am

Would this be on Windows Vista or 7 ?

avidday · September 3, 2010, 8:14am

Would this be on Windows Vista or 7 ?

dgwsoft · September 3, 2010, 1:19pm

are you sure you are running on your graphics card not in emulation mode?

Have you installed the SDK and compiled the examples? If so run deviceQuery and see if it reports your card or the emulator device.

Also, there is always some overhead in initialising the CUDA h/w and s/w, especially the first time after a reboot. Not 7 seconds but a

noticeable fraction of a second on the machines I have used. And copying data is an overhead. So for a very small amount of work, yes it is slower on the GPU than on the CPU.

Remember a GPU is only of benefit for specific highly parallelizable problems, it is not just a faster CPU. You need to run a realistic kernel, e.g. the matrix multiplication example, to see the advantages.

After first try cuda to do some kernel computing, I found it is really slow, so I do a little funny testing as below,

float* m = new float[1000];

float* n = new float[1000];

for(int i=0; i<200; i++)

{

float *md;

float* nd;

cudaMalloc( (void**) &md, 1000 * sizeof(float));

cudaMalloc( (void**) &nd, 1000 * sizeof(float));

cudaMemcpy(md, m, 1000 * sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(nd, n, 1000 * sizeof(float), cudaMemcpyHostToDevice);

cudaFree(md);

cudaFree(nd);

}

This code does no meaningful thing, even does not kernel computing, just copies memory and frees it. Then I use nvcc to compile, run it, it takes 7 seconds. If I do similiar thing in complete host code, malloc and free, it takes no time. My video card is Quadro FX 4000. Not a great one, but it cannot be so slow.

The same thing happens to any of my C code, as long as I rename it to .cu and compile with nvcc, run it, it will be significantly slower than the executable created by visual studio 2008. I expect it will have same speed because they are executed on CPU only.

Any ideas?

dgwsoft · September 3, 2010, 1:19pm

are you sure you are running on your graphics card not in emulation mode?

Have you installed the SDK and compiled the examples? If so run deviceQuery and see if it reports your card or the emulator device.

Also, there is always some overhead in initialising the CUDA h/w and s/w, especially the first time after a reboot. Not 7 seconds but a

noticeable fraction of a second on the machines I have used. And copying data is an overhead. So for a very small amount of work, yes it is slower on the GPU than on the CPU.

Remember a GPU is only of benefit for specific highly parallelizable problems, it is not just a faster CPU. You need to run a realistic kernel, e.g. the matrix multiplication example, to see the advantages.

After first try cuda to do some kernel computing, I found it is really slow, so I do a little funny testing as below,

float* m = new float[1000];

float* n = new float[1000];

for(int i=0; i<200; i++)

{

float *md;

float* nd;

cudaMalloc( (void**) &md, 1000 * sizeof(float));

cudaMalloc( (void**) &nd, 1000 * sizeof(float));

cudaMemcpy(md, m, 1000 * sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(nd, n, 1000 * sizeof(float), cudaMemcpyHostToDevice);

cudaFree(md);

cudaFree(nd);

}

This code does no meaningful thing, even does not kernel computing, just copies memory and frees it. Then I use nvcc to compile, run it, it takes 7 seconds. If I do similiar thing in complete host code, malloc and free, it takes no time. My video card is Quadro FX 4000. Not a great one, but it cannot be so slow.

The same thing happens to any of my C code, as long as I rename it to .cu and compile with nvcc, run it, it will be significantly slower than the executable created by visual studio 2008. I expect it will have same speed because they are executed on CPU only.

Any ideas?

curiousj · September 3, 2010, 3:10pm

I am sure it is running on GPU instead of CPU. It is on Windows XP 32bit. But I figured out what the problem is.
cudaMalloc is a really expensive job. My code looks doing nothing, just malloc and free for 200 times. It is indeed really slow, hundreds or thousands times slower than CPU/RAM malloc/free. If I just pre-malloc before the loop and then reuse the same cuda memory in the loop, it will be much faster.

curiousj · September 3, 2010, 3:10pm

I am sure it is running on GPU instead of CPU. It is on Windows XP 32bit. But I figured out what the problem is.
cudaMalloc is a really expensive job. My code looks doing nothing, just malloc and free for 200 times. It is indeed really slow, hundreds or thousands times slower than CPU/RAM malloc/free. If I just pre-malloc before the loop and then reuse the same cuda memory in the loop, it will be much faster.

dgwsoft · September 3, 2010, 4:55pm

I find it useful to put timers around operations such as cudaMalloc, cudaMemcpy and calls to the kernel, to find out what is really taking all the time. Here is a little class I wrote to help out. (This is for linux - it would need modification for windows).

[font=“Courier New”]

class Timer

{

public:

Timer() : clocks_per_sec(sysconf(_SC_CLK_TCK)) { start(); }

void start()  { m_start = times(0); }

void stop()   { m_stop  = times(0); }

double read() const { return double(m_stop - m_start)/clocks_per_sec; }

private:

long clocks_per_sec;

clock_t m_start, m_stop;

};

std::ostream& operator<<(std::ostream &os, const Timer& timer)

{

os.setf(std::ios_base::fixed, std::ios_base::floatfield);

os << std::setprecision(2) << timer.read();

os.setf((std::_Ios_Fmtflags)0, std::ios_base::floatfield);

return os;

}

[/font]

and use like

[font=“Courier New”]Timer t1;

// whatever code

t1.stop();

cout << “*** whatever code took " << t1 << " seconds” << std::endl;

[/font]

dgwsoft · September 3, 2010, 4:55pm

I find it useful to put timers around operations such as cudaMalloc, cudaMemcpy and calls to the kernel, to find out what is really taking all the time. Here is a little class I wrote to help out. (This is for linux - it would need modification for windows).

[font=“Courier New”]

class Timer

{

public:

Timer() : clocks_per_sec(sysconf(_SC_CLK_TCK)) { start(); }

void start()  { m_start = times(0); }

void stop()   { m_stop  = times(0); }

double read() const { return double(m_stop - m_start)/clocks_per_sec; }

private:

long clocks_per_sec;

clock_t m_start, m_stop;

};

std::ostream& operator<<(std::ostream &os, const Timer& timer)

{

os.setf(std::ios_base::fixed, std::ios_base::floatfield);

os << std::setprecision(2) << timer.read();

os.setf((std::_Ios_Fmtflags)0, std::ios_base::floatfield);

return os;

}

[/font]

and use like

[font=“Courier New”]Timer t1;

// whatever code

t1.stop();

cout << “*** whatever code took " << t1 << " seconds” << std::endl;

[/font]

Topic		Replies	Views
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13381	July 9, 2008
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8618	December 18, 2008
CUDA slower than CPU? CUDA Programming and Performance	7	827	August 18, 2023
Why is this slow CUDA Programming and Performance	7	3731	February 7, 2012
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6547	February 19, 2009
Cuda program taking more time. CUDA Programming and Performance	15	7059	November 21, 2010
stymied by my first cuda simple test, need help! CUDA Programming and Performance	4	3870	August 27, 2011
Is CUDA really that fast? CUDA Programming and Performance	17	11711	September 21, 2009
[Beginner]: CUDA slower than serial implementation fill Operation on entire image CUDA Programming and Performance	18	13519	September 15, 2011
Can this be sped up? CUDA Programming and Performance	8	6709	November 6, 2008

cuda is really slow - even when doing nothing

Related topics