A while back I was pondering why a CUDA implementation of a sequential app was slower than the sequential original for small problem sizes. I did a thorough performance analysis and found that in terms of actual computation the CUDA implementation was in fact much much faster. The reason it took longer was all the house keeping associated with device management. No surprises there, the programming guide emphasizes the cost of host-device memory transfer, but when I dug deeper I found that it was device memory allocation which was taking all the time. I don’t have the numbers in front of me now but I think it was something in the region of 100ms regardless of the size of the allocation, much less than the time of the actual transfer from host to that memory.
Is this normal?