hi all, i have a big doubt. we all know that the cudaMalloc is very slow so i wondering if there is a possibility to access the dev memory directly without call the malloc. iu have a sorce of data and my focus is to put the data directly on dev memory without pass through memcopy malloc etc like in classic code…
There is something call UVA. When UVA is active you have an universal addressing space. Run the deviceQuery example from the SDK to see if your card supports it. You need cudatoolkit 4.0.
Even with UVA, you still need to malloc memory. If you just start writing to random memory address on the GPU, your program will crash. I don’t know what you mean by : “without pass through memcopy malloc etc like in classic code”. In host code, you still need to allocate memory for the same reason.
If you need higher performance allocation than cudaMalloc, then you can cudaMalloc one large chunk and split it up with your own allocator.
Though, I’m guessing that you are timing the very first call to cudaMalloc if you are finding it slow. The very first CUDA call initializes the context, which can take up to several seconds on some systems. Time later calls, and you’ll find that they are much quicker.
my code run in 10-15 milliseconds, plus 50-60 milliseconds for the cudaMalloc, so yes, for me, it’s a very expensive function… UVA seems interesting, but if I still need to use classic function, it’s just not for me… if the very first CUDA call inizialize the context, i guess there is no escape from that :)
50-60 milliseconds is a very fast context creation time! Our multi-GPU systems take 1-2 seconds+ to initialize a context. To work around this, you need to amortize the cost - create the context and allocate the memory only once in your code and reuse it throughout the entire program execution.