I played a bit with cuda and several time measurements shows that the cudamalloc functions take about 60ms (independent of the size, I try 5 000 Bytes, 50 000 Bytes and 5 000 000 Bytes). This means that a vector addition of two vectors (float a[xxx] + float b[xxx]) is always slower with cuda. Is that right? Without Cuda the cpu solve the probelm in less than 60ms.
Is there any alternativ to improve the duration of malloc? Or a special technic to avoid cudamalloc?
Please help me! It’s very important for me.
Thanks in advance!