I have been working on a fortran code and I was forced to use c in order to use cuda programming for I don’t have access to CUDA Fortran which is not for free by PGI whatsoever.
However, I’ve been having problems and questions for the fortran + C bundle.
I decided to use 1D arrays on C and CUDA for avoiding pointers arrays which generally goes to a less efficient, less secure programming.
However, most of my data structures are declared on the Fortran side and I’m using extern on the C code. Thus I think the data would still be in fortran fashion (column-major order), but my question is mainly on the cudaMemcpy function.
When I do this, is the data remapped because CUDA C compiler would rather do row-major order? If this is the case, it doesn’t really make a difference that I’m using 1D for everything since the data will not be contiguous. Should I be using 2D, 3D and 4D instead for keeping at least the same structure? (I have 4D arrays).
Does anyone know how efficient is CUDA with higher order arrays?
And, what should be using for transferring 4D arrays to Global Memory and keeping the same structure to the data?
You can use column major order storage (FORTRAN ordered) easily in CUDA. In some ways it is a little more natural than row major ordering because the natural CUDA block and grid ordering scheme is actually column major ordered, so there is a direct affinity between block and grid indices and the equivalent indices in column major ordered arrays. The only difference is zero versus one starting indices, but that is only a semantic difference, it has no real effect on anything. cudaMemcpy() just copies blocks of one-dimensional linear memory, so it works irrespective of which order the data uses. There is no “remapping” or “transformation” of data at any stage.
CUBLAS, the BLAS for CUDA, also uses column major ordered storage and is effectively a complete working example of what you are trying to achieve - CUDA device code, written in C, and working internally using column major ordered storage.
Doublechecking, so you’re saying that cudaMemcpy would only copy bytes as such, without caring anything about the data structure; therefore, it would do sequential copying (1D) regardless of how the memory has been set, in this case, since the array was declared on Fortran and the memory has a column-major ordering, this would be the way itll be done, right?
Another question is: what should I be using for higher order arrays like 4D? can I do 4D arrays in shared memory? is texture memory strictly limited for 3D?
I would recommend using appropriately pitched linear memory in every case. Integer indexing calculations in higher dimensions are simple and computationally inexpensive relative to the alternatives (like multiple levels of pointer indirection, for example). As for textures, the texture/filtering hardware only does up to 3D AFAIK (certainly the API only exposes 1D, 2D and 3D), so you are limited in what you can do there. It should be possible in some cases to use 1D textures for higher dimensional lookups if you design the layout carefully, but that can be limited in its usefulness. The GPU memory controller handles float2 and float4 reads quite well, so that offers another possibility for tensor quantities (although there are not so many native float2 and float4 arithmetic primitives as for the scalar floating point types).
cudaMemcpy does NOT change the ordering… It it is 1D, the data will be available as the same 1D array for your kernel.
Note that FORTRAN indexing starts from 1, C indexing starts from 0. All your FoRTRAN variables that are “array indices” need to be subtracted with 1 before using them to access…
We are using it in our work. We have absolutely no problems.
We do use ISO C Binding. We don’t use pinned memory or streams. But we consider using pinned memory in a short while.
Do you think, there is an issue out there with pinned memory? I plan to use it only in C-CUDA portion — no idea of interfacing it with Fortran.
My interfaces are like this: FORTRAN calls a C interface – which in turn calls a CUDA interface… FORTRAN does NOT see CUDA - just an info if that helps…