optimization help

I’m trying to optimize a program with CUDA and I’m not getting the improvements I was expecting at the start.

My program is basically a convolutional neural net with lots of small 5x5 kernel convolutions over varying size images (32x32, 14x14, …) .So far, only the 5x5 convolution over 32x32 is faster than the CPU version due to overhead. (not counting data transfer time)

I’ve been using cudaprof and I’ve noticed that the CPU time is usually 20us + GPU time (I suppose it’s the overhead) but sometimes is as big as 1000us + GPU time.

Is that a normal behaviour?

Can I expect improvements using the Driver API instead of Cuda C? (I think that even 20us is overkill for a work which takes 5us)

Should I restructure my work with cuda and give different kind of works to it? (I mean, right now I’m trying to pass convolutions (batched, not individual) to CUDA, should I put the whole network into the GPU? )


There are a lot of things to check, try to enlarge work load.

Unless you run a realtime operating system (none of which are supported with CUDA), an occasional millisecond of latency should not be something to worry about.

The driver API should not really make any difference. You should always batch up enough work so that the call overhead is negligible anyway.

Have you optimized your memory access patterns to take optimal advantage of coalescing? What card are you using? Can you post some of your code?