My code is as followings. (simplified)

**device** void calcvalue(calcvariables cv, int np)

{

for (int i=0;i<cv.timenodenumber;i++)

{

do some same work

}

}

**global** void calcprice(calcvariables* cv, int np)

{

int tid = threadidx.x;

if (tid < np)

{

calcvalue(cv[tid],np)

}

}

int main()

{

int np = number I choose;

for (int=0;i<np;i++)

{

some calculations about structure pointer calcvariables* cv

some memory allocation work about cv

}

int a = 1, b = np;

calcprice <<< a, b >>> (cv, np)

cudadevicesynchronize();

return 0;

}

The problem is… as np ranges from 1 to 32, performance slows down.

if np=1, time it takes for calculation is about 19s

np=2, 20s

np=3, 22s

np=4, 24s

…

np=32, 77s

But if np>32, no matter how large n is, performance remains the same.

cv.timenodenumber ranges from 400 to 670.

And there are lots of memory allocation to device. (don’t know… maybe around 80 millions * sizeof(double))…

Been using GT740 that has 2Gb memory, cc 3.0.

Anyone can tell me what the problem is? or how can I solve this problem?

Thanks in advance.