I’m trying to optimize a program with CUDA and I’m not getting the improvements I was expecting at the start.
My program is basically a convolutional neural net with lots of small 5x5 kernel convolutions over varying size images (32x32, 14x14, …) .So far, only the 5x5 convolution over 32x32 is faster than the CPU version due to overhead. (not counting data transfer time)
I’ve been using cudaprof and I’ve noticed that the CPU time is usually 20us + GPU time (I suppose it’s the overhead) but sometimes is as big as 1000us + GPU time.
Is that a normal behaviour?
Can I expect improvements using the Driver API instead of Cuda C? (I think that even 20us is overkill for a work which takes 5us)
Should I restructure my work with cuda and give different kind of works to it? (I mean, right now I’m trying to pass convolutions (batched, not individual) to CUDA, should I put the whole network into the GPU? )