time usage of cudamalloc

Hi!

I played a bit with cuda and several time measurements shows that the cudamalloc functions take about 60ms (independent of the size, I try 5 000 Bytes, 50 000 Bytes and 5 000 000 Bytes). This means that a vector addition of two vectors (float a[xxx] + float b[xxx]) is always slower with cuda. Is that right? Without Cuda the cpu solve the probelm in less than 60ms.

Is there any alternativ to improve the duration of malloc? Or a special technic to avoid cudamalloc?

Please help me! It’s very important for me.

Thanks in advance!

Do more things on the GPU, so that you don’t get choked by cudaMalloc (and probably cudaMemcpy as well). Adding two vectors together is trivial, so the extra overhead of the mallocs, frees and cpys will eat up any speed gain offered by the GPU. The ultimate goal is to place the data on the GPU at the start of the program, and collect the results from it at the end. Not always possible, but it’s where you want to be going.

Hi!

Thank you for this answer! But this lead me to an other question: Is there any way to load file-data DIRECT into a global or device function? I don’t want to use cudamemcpy or malloc more than required. I can’t find any advices in the documetation.

Thanks in advance!

One of the recent nVidia seminars talks about “zero copy”. It supposedly allows the GPU to read from the CPU memory - provided it is page-locked. If you try it, please let me know.

No (although the above mention of zero-copy could be a way around this). However, I think you need to describe what you’re really trying to do, so people can help you more easily. If shifting data from the disc to the GPU is your bottleneck, then you could probably dump your GPU and perform all computations using a 486 with minimal performance impact. I suspect that this isn’t the case, so… what problem are you really trying to solve?