Passing an array to the kernel in constant memory

WonwooLee · March 17, 2011, 5:16am

Hello.

I have an array, to which all the threads access simultaneously.

As I remember the efficiency of the constant memory is maximized when all threads access the same memory,

so I put them in a constant memory, like below.

__constant__ float array [12] ;

__global__ void move(

float3 *ap_src_pts, int a_src_pts_w, 

float3 *ap_dst_pts

) 

{

    int gid_x = blockIdx.x * blockDim.x + threadIdx.x ;

    int gid_y = blockIdx.y * blockDim.y + threadIdx.y ;

int l_index = gid_x + gid_y * a_src_pts_w ;

float3 l_from, l_to ;

l_from = ap_src_pts[l_index] ;

l_to.x = p_dc_trans_mat[0] * l_from.x 

           + p_dc_trans_mat[1] * l_from.y 

           + p_dc_trans_mat[2] * l_from.z 

           + p_dc_trans_mat[9] ;

l_to.y = p_dc_trans_mat[3] * l_from.x 

            + p_dc_trans_mat[4] * l_from.y 

            + p_dc_trans_mat[5] * l_from.z 

            + p_dc_trans_mat[10] ;

l_to.z = p_dc_trans_mat[6] * l_from.x 

            + p_dc_trans_mat[7] * l_from.y 

            + p_dc_trans_mat[8] * l_from.z 

            + p_dc_trans_mat[11] ;

ap_dst_pts[l_index] = l_to ;

}

However, the array’s length is very short and it has only 12 elements…

I’m wondering that how it is to pass the array as an argument to the kernel

instead of storing it in the constant memory, like below.

__global__ void move(

float3 *ap_src_pts, int a_src_pts_w, 

float3 *ap_dst_pts, 

float p_dc_trans_mat[12]

) 

{

    int gid_x = blockIdx.x * blockDim.x + threadIdx.x ;

    int gid_y = blockIdx.y * blockDim.y + threadIdx.y ;

int l_index = gid_x + gid_y * a_src_pts_w ;

float3 l_from, l_to ;

l_from = ap_src_pts[l_index] ;

l_to.x = p_dc_trans_mat[0] * l_from.x 

           + p_dc_trans_mat[1] * l_from.y 

           + p_dc_trans_mat[2] * l_from.z 

           + p_dc_trans_mat[9] ;

l_to.y = p_dc_trans_mat[3] * l_from.x 

            + p_dc_trans_mat[4] * l_from.y 

            + p_dc_trans_mat[5] * l_from.z 

            + p_dc_trans_mat[10] ;

l_to.z = p_dc_trans_mat[6] * l_from.x 

            + p_dc_trans_mat[7] * l_from.y 

            + p_dc_trans_mat[8] * l_from.z 

            + p_dc_trans_mat[11] ;

ap_dst_pts[l_index] = l_to ;

}

Which one do you think a better approach is ?

wlangdon · March 18, 2011, 11:51am

There is a discussion on contant memory at
http://forums.nvidia.com/index.php?showtopic=184724
Essentially there are many pitfals.
With 12 elements you should be ok, but you could check the warp_serialize profile counter.
Bill

hyqneuron · March 19, 2011, 3:28pm

On Compute Capability 2.0 and above, you can have at most 128 constant symbols at one time. So if this is not running out, a small size of the constant object shouldn’t bother you.

Constant memory, accessed with fixed (constant) indices, is as fast as register. As long as your constant set is no larger than 8kb, you’ll be able to have them all staying on the cache. But when the first time you access it you still have to go through global memory, because the constant data is not in the on-chip constant cache yet… so if you use the constant array only once, there wouldn’t be much difference. But if you use the constant array for many times in the same kernel launch with constant indices, you are likely to get some gain in speed.

Or… you may not, because global memory has L1 cache too. Please try your code and tell us the result!

hyqneuron · March 19, 2011, 3:46pm

Just tried it for you. I used a small array of 10 floats, which has certainly no problem fitting into either L1 cache or constant cache. Both the global array and the constant array I used were accessed with fixed indices, and the timing were exactly the same.

Though, you might want to use global memory when the indices you use to access the array are not fixed and when you do not have to write to that global array.

Topic		Replies	Views
passing an array to a kenel ? CUDA Programming and Performance	9	13242	June 10, 2009
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1410	April 16, 2009
Constant memory usage and comparison against textures CUDA Programming and Performance	9	4043	December 24, 2008
Global arrays? CUDA Programming and Performance	24	10624	August 18, 2010
Accessing/caching access to global/pinned memory CUDA Programming and Performance	10	746	May 29, 2023
Passing array CUDA Programming and Performance	4	6946	December 7, 2009
Fast constant memory access from all threads in CUDA CUDA Programming and Performance	0	381	September 11, 2019
Passing a pointer to constant memory CUDA Programming and Performance	0	6277	January 3, 2009
Optimizing a Kernel with a lot of variables - memory allocation. CUDA Programming and Performance	6	736	August 5, 2016
Am I using the constant memory properly? CUDA Programming and Performance	4	1773	May 15, 2013

Passing an array to the kernel in constant memory

Related topics