My code is as followings. (simplified)
device void calcvalue(calcvariables cv, int np)
{
for (int i=0;i<cv.timenodenumber;i++)
{
do some same work
}
}
global void calcprice(calcvariables* cv, int np)
{
int tid = threadidx.x;
if (tid < np)
{
calcvalue(cv[tid],np)
}
}
int main()
{
int np = number I choose;
for (int=0;i<np;i++)
{
some calculations about structure pointer calcvariables* cv
some memory allocation work about cv
}
int a = 1, b = np;
calcprice <<< a, b >>> (cv, np)
cudadevicesynchronize();
return 0;
}
The problem is… as np ranges from 1 to 32, performance slows down.
if np=1, time it takes for calculation is about 19s
np=2, 20s
np=3, 22s
np=4, 24s
…
np=32, 77s
But if np>32, no matter how large n is, performance remains the same.
cv.timenodenumber ranges from 400 to 670.
And there are lots of memory allocation to device. (don’t know… maybe around 80 millions * sizeof(double))…
Been using GT740 that has 2Gb memory, cc 3.0.
Anyone can tell me what the problem is? or how can I solve this problem?
Thanks in advance.