Hello all,

I am writing a small discrete event molecular dynamics application using CUDA, but I am not getting the speedups I was expecting.

The discrete event bit relates to the fact that the time steps for the simulation are not fixed, as in most usual simulations, but they depend on the events that happen in the system. Time just advances when something happens, so we need to know the shortest time until the next event and then we will advance the overall time by that time length.

Anyway, the main loop of the program is this one:

[codebox]

```
while (time < total_time)
{
++step;
// sort the list of collision times
radixsort.sort((float *)tc_d, values_d, n_collisions, sizeof(float) * 8, true);
// copy back the next collision time so we can update our time
cudaMemcpy(&tc_local, tc_d, sizeof(float), cudaMemcpyDeviceToHost);
time += tc_local;
// move all the particles forward
call_move_kernel();
// calculate collision dynamics
call_dynamics_kernel();
// recalculate collision times
call_collision_time_kernel();
}
```

[/codebox]

After timing each of the five steps (sorting, copying shortest time back, moving particles forward, dynamics and calculating collision times) I got the following:

[codebox]

Loop: 2871.37 (ms)

Sorting: 1466.1 (ms), 51.0594%

Copying: 1250.29 (ms), 43.5432%

Moving: 45.7717 (ms), 1.59407%

Dynamics: 0.544 (ms), 0.0165426%

Collision: 49.6603 (ms), 1.7295%

[/codebox]

Those measurements were taken for a system of about 300 hard-spheres, which is quite a small system. For bigger systems I expect the Collision and Moving part to have a more serious impact on the performance, while the copy will be the same.

Nevertheless, is there a way to do this kind of copy more efficiently? I tried using cudaMemcpyFromSymbol, but the performance was basically the same.

Thanks,

Gustavo