Question 2D matrix operation CUDA Programming

I have a N (=512) by N matrix. What is the best way to make all the odd column do the same operation and all the even column do the same operation (but different than the odd ones)? Thanks.
By the way, the operation is simply read the corresponding data from the matrix and do simply calculation and then put the value back.

If you use row-major order and the matrix is sufficiently aligned, the fastest option probably is to cast the pointer to float2* (or double2* or int2* …) and have each thread operate on the two adjacent values.

If you use column-major order, just have each thread operate on two adjacent values without further ado. :smile:

Thanks, Tera. I am using row-major order as in C language. How do you cast the pointer to float2*? (Suppose my data in the global memory is g_data, is it to define a float2 *g_data2 = (float2 *) g_data, and then use g_data2 to read and write data to the global memory?)

By the way, what is the best way to organize the thread? My current thought is two dimensional block. Each block will contain just one sub-row of the matrix. So if the matrix is 10241024, then my grid dimension is 21024 with each block having 512 threads (I am changing it to 256 threads to make each thread do two adjacent operations as you mentioned.). I was wondering whether this is the best way to do the grid and blocks or not?

Yes, that is exactly what I meant.

Keep in mind though that this only works if the pointer is 8-byte aligned, otherwise it will produce incorrect results. All allocations made through cudaMalloc() are properly aligned.

Yes, that seems a good way to organize it. Just make sure to have a blocksize.x that is multiple of 32 for optimal transaction size and alignment.

Thanks for your help, Tera.

Is there any way that I can use shared memory to speed up in this situation? Or just read, compute and write to the global memory directly.

Read and write global memory directly. The automatic mapping of data implied by using float2 is faster than any manual shuffling in shared memory.

Thanks so much. I get improvement by doing this. Speed almost doubled by using float2 pointer, which makes sense.

Why it works only for the 8-byte aligned? Does that mean that float3 for using one thread for 3 elements will produce incorrect results?

Because the hardware throws away the lowest bits of the unaligned address, accessing the wrong memory location instead of splitting the memory transaction into two.

There is no hardware support for float3, meaning that the compiler will just generate three single float accesses and it will work regardless of alignment. Going through shared memory should be faster there for compute capability 1.x devices. On 2.x devices the cache will probably prevent the worst.

float4 again has hardware support, meaning it will be fast but needs proper alignment again for correct results.

Thanks.

Consider that I have a mesh file which have a lot of nodes (point structure with 3 floats). What is the best way to put them in the memory and how to access it efficiently? Those data are read only. I was considering put them into either constant memory or texture memory. What do you think?

In the case of float3, you are probably better off setting up three separate arrays for x, y and z, because there is no hardware support for float3. Constant memory isn’t a good idea unless all threads in a warp read the same array element, because accesses to different elements will get serialized.

Texture memory is a good option is you want to exploit 2d locality or use caching on compute capability 1.x devices. Even on 2.x devices it sometimes turns out to be faster that global memory because it has extra ressources (separate cache and fetch units).

Thanks so much for all your advice.