My code is as followings. (simplified)
device void calcvalue(calcvariables cv, int np)
for (int i=0;i<cv.timenodenumber;i++)
do some same work
global void calcprice(calcvariables* cv, int np)
int tid = threadidx.x;
if (tid < np)
int np = number I choose;
some calculations about structure pointer calcvariables* cv
some memory allocation work about cv
int a = 1, b = np;
calcprice <<< a, b >>> (cv, np)
The problem is… as np ranges from 1 to 32, performance slows down.
if np=1, time it takes for calculation is about 19s
But if np>32, no matter how large n is, performance remains the same.
cv.timenodenumber ranges from 400 to 670.
And there are lots of memory allocation to device. (don’t know… maybe around 80 millions * sizeof(double))…
Been using GT740 that has 2Gb memory, cc 3.0.
Anyone can tell me what the problem is? or how can I solve this problem?
Thanks in advance.