Question 2D matrix operation CUDA Programming

triston · April 6, 2011, 7:22pm

I have a N (=512) by N matrix. What is the best way to make all the odd column do the same operation and all the even column do the same operation (but different than the odd ones)? Thanks.
By the way, the operation is simply read the corresponding data from the matrix and do simply calculation and then put the value back.

tera · April 6, 2011, 9:30pm

If you use row-major order and the matrix is sufficiently aligned, the fastest option probably is to cast the pointer to float2* (or double2* or int2* …) and have each thread operate on the two adjacent values.

If you use column-major order, just have each thread operate on two adjacent values without further ado. External Image

triston · April 6, 2011, 10:11pm

Thanks, Tera. I am using row-major order as in C language. How do you cast the pointer to float2*? (Suppose my data in the global memory is g_data, is it to define a float2 *g_data2 = (float2 *) g_data, and then use g_data2 to read and write data to the global memory?)

By the way, what is the best way to organize the thread? My current thought is two dimensional block. Each block will contain just one sub-row of the matrix. So if the matrix is 10241024, then my grid dimension is 21024 with each block having 512 threads (I am changing it to 256 threads to make each thread do two adjacent operations as you mentioned.). I was wondering whether this is the best way to do the grid and blocks or not?

tera · April 6, 2011, 11:26pm

Yes, that is exactly what I meant.

Keep in mind though that this only works if the pointer is 8-byte aligned, otherwise it will produce incorrect results. All allocations made through cudaMalloc() are properly aligned.

Yes, that seems a good way to organize it. Just make sure to have a blocksize.x that is multiple of 32 for optimal transaction size and alignment.

triston · April 6, 2011, 11:32pm

Thanks for your help, Tera.

Is there any way that I can use shared memory to speed up in this situation? Or just read, compute and write to the global memory directly.

tera · April 6, 2011, 11:47pm

Read and write global memory directly. The automatic mapping of data implied by using float2 is faster than any manual shuffling in shared memory.

triston · April 7, 2011, 12:25am

Thanks so much. I get improvement by doing this. Speed almost doubled by using float2 pointer, which makes sense.

Why it works only for the 8-byte aligned? Does that mean that float3 for using one thread for 3 elements will produce incorrect results?

tera · April 7, 2011, 12:38am

Because the hardware throws away the lowest bits of the unaligned address, accessing the wrong memory location instead of splitting the memory transaction into two.

There is no hardware support for float3, meaning that the compiler will just generate three single float accesses and it will work regardless of alignment. Going through shared memory should be faster there for compute capability 1.x devices. On 2.x devices the cache will probably prevent the worst.

float4 again has hardware support, meaning it will be fast but needs proper alignment again for correct results.

triston · April 7, 2011, 2:07am

Thanks.

Consider that I have a mesh file which have a lot of nodes (point structure with 3 floats). What is the best way to put them in the memory and how to access it efficiently? Those data are read only. I was considering put them into either constant memory or texture memory. What do you think?

tera · April 7, 2011, 10:25am

In the case of float3, you are probably better off setting up three separate arrays for x, y and z, because there is no hardware support for float3. Constant memory isn’t a good idea unless all threads in a warp read the same array element, because accesses to different elements will get serialized.

Texture memory is a good option is you want to exploit 2d locality or use caching on compute capability 1.x devices. Even on 2.x devices it sometimes turns out to be faster that global memory because it has extra ressources (separate cache and fetch units).

triston · April 7, 2011, 2:22pm

Thanks so much for all your advice.

Topic		Replies	Views
How you allocated a matrix on device? CUDA Programming and Performance	5	8538	November 21, 2011
2D Matrix operation CUDA Programming and Performance	5	2075	January 26, 2015
Trade offs between loading cost of loading to shared memory and working directly on global memory CUDA Programming and Performance	4	439	November 8, 2021
Help me please hai I am a noob in cuda and I have some problems CUDA Programming and Performance	10	3449	July 16, 2009
Efficient 2D memory access for row-based operations on contiguous array CUDA Programming and Performance	4	945	July 11, 2019
Advice - Complex Matrix-Vector Multiplication CUDA Programming and Performance	3	5589	May 12, 2009
Matrix Reduction CUDA Programming and Performance	7	8305	November 18, 2009
GM2=GM1 is faster than "SM=GM1; GM2=SM;" ? memory access time CUDA Programming and Performance	10	5386	April 19, 2007
Ordered Multiplication of various matrices in shared memory greater minds please help CUDA Programming and Performance	10	2387	June 22, 2009
2D matrix addition question CUDA Programming and Performance	7	12162	May 17, 2009

Question 2D matrix operation CUDA Programming

Related topics