cuda is really slow - even when doing nothing

After first try cuda to do some kernel computing, I found it is really slow, so I do a little funny testing as below,

float* m = new float[1000];
float* n = new float[1000];
for(int i=0; i<200; i++)
{
float md;
float
nd;
cudaMalloc( (void**) &md, 1000 * sizeof(float));
cudaMalloc( (void**) &nd, 1000 * sizeof(float));
cudaMemcpy(md, m, 1000 * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(nd, n, 1000 * sizeof(float), cudaMemcpyHostToDevice);
cudaFree(md);
cudaFree(nd);
}

This code does no meaningful thing, even does not kernel computing, just copies memory and frees it. Then I use nvcc to compile, run it, it takes 7 seconds. If I do similiar thing in complete host code, malloc and free, it takes no time. My video card is Quadro FX 4000. Not a great one, but it cannot be so slow.

The same thing happens to any of my C code, as long as I rename it to .cu and compile with nvcc, run it, it will be significantly slower than the executable created by visual studio 2008. I expect it will have same speed because they are executed on CPU only.

Any ideas?

Lol @ topic title :)

Lol @ topic title :)

Would this be on Windows Vista or 7 ?

Would this be on Windows Vista or 7 ?

are you sure you are running on your graphics card not in emulation mode?

Have you installed the SDK and compiled the examples? If so run deviceQuery and see if it reports your card or the emulator device.

Also, there is always some overhead in initialising the CUDA h/w and s/w, especially the first time after a reboot. Not 7 seconds but a

noticeable fraction of a second on the machines I have used. And copying data is an overhead. So for a very small amount of work, yes it is slower on the GPU than on the CPU.

Remember a GPU is only of benefit for specific highly parallelizable problems, it is not just a faster CPU. You need to run a realistic kernel, e.g. the matrix multiplication example, to see the advantages.

are you sure you are running on your graphics card not in emulation mode?

Have you installed the SDK and compiled the examples? If so run deviceQuery and see if it reports your card or the emulator device.

Also, there is always some overhead in initialising the CUDA h/w and s/w, especially the first time after a reboot. Not 7 seconds but a

noticeable fraction of a second on the machines I have used. And copying data is an overhead. So for a very small amount of work, yes it is slower on the GPU than on the CPU.

Remember a GPU is only of benefit for specific highly parallelizable problems, it is not just a faster CPU. You need to run a realistic kernel, e.g. the matrix multiplication example, to see the advantages.

I am sure it is running on GPU instead of CPU. It is on Windows XP 32bit. But I figured out what the problem is.
cudaMalloc is a really expensive job. My code looks doing nothing, just malloc and free for 200 times. It is indeed really slow, hundreds or thousands times slower than CPU/RAM malloc/free. If I just pre-malloc before the loop and then reuse the same cuda memory in the loop, it will be much faster.

I am sure it is running on GPU instead of CPU. It is on Windows XP 32bit. But I figured out what the problem is.
cudaMalloc is a really expensive job. My code looks doing nothing, just malloc and free for 200 times. It is indeed really slow, hundreds or thousands times slower than CPU/RAM malloc/free. If I just pre-malloc before the loop and then reuse the same cuda memory in the loop, it will be much faster.

I find it useful to put timers around operations such as cudaMalloc, cudaMemcpy and calls to the kernel, to find out what is really taking all the time. Here is a little class I wrote to help out. (This is for linux - it would need modification for windows).

[font=“Courier New”]

class Timer

{

public:

Timer() : clocks_per_sec(sysconf(_SC_CLK_TCK)) { start(); }

void start()  { m_start = times(0); }

void stop()   { m_stop  = times(0); }

double read() const { return double(m_stop - m_start)/clocks_per_sec; }

private:

long clocks_per_sec;

clock_t m_start, m_stop;

};

std::ostream& operator<<(std::ostream &os, const Timer& timer)

{

os.setf(std::ios_base::fixed, std::ios_base::floatfield);

os << std::setprecision(2) << timer.read();

os.setf((std::_Ios_Fmtflags)0, std::ios_base::floatfield);

return os;

}

[/font]

and use like

[font=“Courier New”]Timer t1;

// whatever code

t1.stop();

cout << “*** whatever code took " << t1 << " seconds” << std::endl;

[/font]

I find it useful to put timers around operations such as cudaMalloc, cudaMemcpy and calls to the kernel, to find out what is really taking all the time. Here is a little class I wrote to help out. (This is for linux - it would need modification for windows).

[font=“Courier New”]

class Timer

{

public:

Timer() : clocks_per_sec(sysconf(_SC_CLK_TCK)) { start(); }

void start()  { m_start = times(0); }

void stop()   { m_stop  = times(0); }

double read() const { return double(m_stop - m_start)/clocks_per_sec; }

private:

long clocks_per_sec;

clock_t m_start, m_stop;

};

std::ostream& operator<<(std::ostream &os, const Timer& timer)

{

os.setf(std::ios_base::fixed, std::ios_base::floatfield);

os << std::setprecision(2) << timer.read();

os.setf((std::_Ios_Fmtflags)0, std::ios_base::floatfield);

return os;

}

[/font]

and use like

[font=“Courier New”]Timer t1;

// whatever code

t1.stop();

cout << “*** whatever code took " << t1 << " seconds” << std::endl;

[/font]