Slow data transfer and memory alloaction.

Hi,

I work on N-body problem. The original program is written in FORTRAN. So, what i have to do is to re-write some functions in CUDA-FORTRAN.
The program has a main loop, and inside that loop i must call a cuda-kernel for each step. The problem is that, due to data transfer(for each step) and memory allocation, the program is about 6 times slower than the original program.
In kernel for each step, i copy from host, some float and integer arrays(about 10 arrays).

I work on PGI compiler 10.6 and my graphics card is NVIDIA gt330M.

Thank you

Sotiris

Hi Sotiris,

The cost to move data between the host and device can be quite high and minimizing data transfer is critical to performance.

Can you allocate your device data and copy any data before the loop?

  • Mat

Hi and thanks for the reply Mat,

Allocation and deallocation can be placed before and after the loop. But inside the main loop my arrays need to be changed by some host functions. I also need them in my kernel to do some calculations. So i have to copy them from host to device and back. Do you think that the slow performance depends on the fact that these arrays are large(about 1000 elements)?

Also i want to ask you if the edition of my compiler(10.6) could be a possible problem.

Thans again,
Sotiris

Hi Sotiris,

Do you think that the slow performance depends on the fact that these arrays are large(about 1000 elements)?

No. (Note that 1000 elements isn’t large) The size of the arrays matters less the number of copies.

Your strategy will need to be to reduced the number of data copies, increase the computation on the device, or both. Basically, you need to increase your compute intensity which is the ratio of computation to data movement since right now you don’t have enough work to outweigh the cost of copying the data.

Are you copying sections from multi-dimensional array? If so, consider gathering on the device the section into a single contiguous array. Copying array sections often require multiple copies while a single array only requires a single copy.

Can the data size be increased? Assuming you have one thread per element, 1000 threads is quite small. Having many more threads, 10000+, would offer a more work and thus more likely the computational benefits outweigh the cost of moving data.

But inside the main loop my arrays need to be changed by some host functions. I also need them in my kernel to do some calculations

Can these functions and calculations be ported to the device? Even if the calculations are not well suited for a GPU, it may be better to move them to the GPU just so you don’t need to move data.

Also i want to ask you if the edition of my compiler(10.6) could be a possible problem.

While 10.6 is a bit old and we have made many improvements since then, in this case, no the compiler version is not the issue.

Hope this helps,
Mat

Hi Mat and thank you so much for the help,

I copy to device one-dimensional arrays. I will try to create one single contiguous array and compare the results.

My program consists of about 100 functions so it’s difficult to convert all of them into kernels.

Thank you so much for your help. If i need anything else i will contact.

Sotiris