Mutual Information on small data set

Hello ppl…

I am trying to calculate MI for small data set ( 64 X 64 X 64 ) using cuda…
But i cud find that for all my tries the CPU version outperforms GPU :(
I use 64 bin histograms for calculating MI. And CPU version takes as little as 0.5 ms.
As a newbie to cuda shud i still be hopeful of a faster cuda implementation with a better design or am i expecting too much from GPU for such small data set.

From initial analysis i cud find that even moving the joint histogram result of size 64 X 64 to global memory takes enough time that my requirement of mi calculation at < 0.5 ms is not feasible.

Lookin forward to wat u experts think abt this.

Thankzzzz

Why not calulate the mutual information on GPU rather than CPU after getting the histogramming?