Copying a single value from device

Hello all,

I am writing a small discrete event molecular dynamics application using CUDA, but I am not getting the speedups I was expecting.

The discrete event bit relates to the fact that the time steps for the simulation are not fixed, as in most usual simulations, but they depend on the events that happen in the system. Time just advances when something happens, so we need to know the shortest time until the next event and then we will advance the overall time by that time length.

Anyway, the main loop of the program is this one:


while (time < total_time)



	// sort the list of collision times

	radixsort.sort((float *)tc_d, values_d, n_collisions, sizeof(float) * 8, true);

	// copy back the next collision time so we can update our time

	cudaMemcpy(&tc_local, tc_d, sizeof(float), cudaMemcpyDeviceToHost);

	time += tc_local;

	// move all the particles forward


	// calculate collision dynamics


	// recalculate collision times




After timing each of the five steps (sorting, copying shortest time back, moving particles forward, dynamics and calculating collision times) I got the following:


Loop: 2871.37 (ms)

Sorting: 1466.1 (ms), 51.0594%

Copying: 1250.29 (ms), 43.5432%

Moving: 45.7717 (ms), 1.59407%

Dynamics: 0.544 (ms), 0.0165426%

Collision: 49.6603 (ms), 1.7295%


Those measurements were taken for a system of about 300 hard-spheres, which is quite a small system. For bigger systems I expect the Collision and Moving part to have a more serious impact on the performance, while the copy will be the same.

Nevertheless, is there a way to do this kind of copy more efficiently? I tried using cudaMemcpyFromSymbol, but the performance was basically the same.



I’m not an expert in any way, but if you have a small kernel, I would expect the memory copying time to be significant.

If I understand your problem correctly, it is necessary for you to copy back very often? You can’t calculate several timesteps at once?
Sometimes rethinking the way the problem is implemented can yield great speedups.

You could also try zero-copy, and see if it makes any difference for you.

Letharion, thanks for you answer.

You are right, it is not necessary to copy the time in every cycle. I was trying only copying it once every a few hundreds cycles or so, but some surprising results showed up.

First i ran the simulation copying the time once every cycle, as I was doing before, and this is what I got (case A ):

[codebox] Total time steps: 33436

Total time: 100.004

Loop:       73251 (ms)

Sorting:    12289.5 (ms), 16.7772%

Copying:    60106.6 (ms), 82.0556%

Moving:     242.873 (ms), 0.331563%

Dynamics:   263.234 (ms), 0.359359%

Collision:  273.451 (ms), 0.373307%[/codebox]

Then I saw how many cycles it took (33436) and ran again using this number to control the loop, using a while (step <= 33435), and only copied the total time once after the simulation was finished. The timing info follows (case B ):

[codebox] Total time steps: 33436

Total time: 100.004

Loop:       73258.1 (ms)

Sorting:    40776.8 (ms), 55.6619%

Copying:    42.93 (ms), 0.058601%

Moving:     251.433 (ms), 0.343215%

Dynamics:   503.185 (ms), 0.686866%

Collision:  31623.2 (ms), 43.1669%[/codebox]

In the first case ( A ), one can see that the copying took a lot of time, while in the second case ( B ) only one copy was made, so it was very little time. The overall execution time of the loop though remained the same. The kernels took longer to execute. The results were exactly the same in both cases. What’s the explanation for this?

My uneducated guess would be that the set up time for the four kernels was hidden in the lots of transfers being made in case A, is that correct?

Edit: Those times are for a simulation of a system somewhat bigger than the one I was testing in the first post.