Memory problem? Thread divergence?

My code is as followings. (simplified)

device void calcvalue(calcvariables cv, int np)
for (int i=0;i<cv.timenodenumber;i++)
do some same work

global void calcprice(calcvariables* cv, int np)
int tid = threadidx.x;

if (tid < np)


int main()
int np = number I choose;

for (int=0;i<np;i++)
some calculations about structure pointer calcvariables* cv
some memory allocation work about cv

int a = 1, b = np;

calcprice <<< a, b >>> (cv, np)


return 0;


The problem is… as np ranges from 1 to 32, performance slows down.

if np=1, time it takes for calculation is about 19s
np=2, 20s
np=3, 22s
np=4, 24s

np=32, 77s

But if np>32, no matter how large n is, performance remains the same.

cv.timenodenumber ranges from 400 to 670.

And there are lots of memory allocation to device. (don’t know… maybe around 80 millions * sizeof(double))…

Been using GT740 that has 2Gb memory, cc 3.0.

Anyone can tell me what the problem is? or how can I solve this problem?

Thanks in advance.

from a hypothetical point of view, make sure that the structure contains valid data, device side - i.e. consider deep-copy and alignment (in terms of copying structures from the host to the device) requirements

practically, easiest would perhaps be to place a breakpoint at the 1st line of

device void calcvalue(calcvariables cv, int np)

catch a thread > 32, step it, and note what it does/ does not do

how big is the structure itself?
are you sure the kernel even runs when n > 32?
the breakpoint should also indicate this

Thanks for reply.

CudaMalloc and CudaMemcpy all worked fine. Result comes out as I expected.
All that matters is performance speed.

And there are lots of variables in lots of structures. The code above is significantly simplified.

And yes. It also works for np > 32.
as far as I remember, cv[0].timenodenumber is about 400, and cv[31].timenodenumber is about 650. So within a wrap, As when np = 1, time it takes is about 19s, I believe when np = 32, time it takes must be about 19s * (650/400) ~ 31s if the threads in a wrap runs in parallel.
(The calculations done within the cv.timenodenumber are same for every thread… A tiny difference but ignorable)

Ah… I suspect that some threads take a little bit different path during the calculations in cv.timenodenumber…
guess I need to check that out.

from the initial post i understood that the code does not work for cases where n > 32

if the code indeed works, but is taken to lack power, i would suggest using the profiler
in particular, i would pay attention to the reports on memory utility efficiency
structures may present better, but in my mind, they pose difficulties for the device
also, you pass whole structures between functions - i wonder what the impact or footprint of that is