Passing an array to the kernel in constant memory

Hello.

I have an array, to which all the threads access simultaneously.

As I remember the efficiency of the constant memory is maximized when all threads access the same memory,

so I put them in a constant memory, like below.

__constant__ float array [12] ;

__global__ void move(

float3 *ap_src_pts, int a_src_pts_w, 

float3 *ap_dst_pts

) 

{

    int gid_x = blockIdx.x * blockDim.x + threadIdx.x ;

    int gid_y = blockIdx.y * blockDim.y + threadIdx.y ;

int l_index = gid_x + gid_y * a_src_pts_w ;

float3 l_from, l_to ;

l_from = ap_src_pts[l_index] ;

l_to.x = p_dc_trans_mat[0] * l_from.x 

           + p_dc_trans_mat[1] * l_from.y 

           + p_dc_trans_mat[2] * l_from.z 

           + p_dc_trans_mat[9] ;

l_to.y = p_dc_trans_mat[3] * l_from.x 

            + p_dc_trans_mat[4] * l_from.y 

            + p_dc_trans_mat[5] * l_from.z 

            + p_dc_trans_mat[10] ;

l_to.z = p_dc_trans_mat[6] * l_from.x 

            + p_dc_trans_mat[7] * l_from.y 

            + p_dc_trans_mat[8] * l_from.z 

            + p_dc_trans_mat[11] ;

ap_dst_pts[l_index] = l_to ;

}

However, the array’s length is very short and it has only 12 elements…

I’m wondering that how it is to pass the array as an argument to the kernel

instead of storing it in the constant memory, like below.

__global__ void move(

float3 *ap_src_pts, int a_src_pts_w, 

float3 *ap_dst_pts, 

float p_dc_trans_mat[12]

) 

{

    int gid_x = blockIdx.x * blockDim.x + threadIdx.x ;

    int gid_y = blockIdx.y * blockDim.y + threadIdx.y ;

int l_index = gid_x + gid_y * a_src_pts_w ;

float3 l_from, l_to ;

l_from = ap_src_pts[l_index] ;

l_to.x = p_dc_trans_mat[0] * l_from.x 

           + p_dc_trans_mat[1] * l_from.y 

           + p_dc_trans_mat[2] * l_from.z 

           + p_dc_trans_mat[9] ;

l_to.y = p_dc_trans_mat[3] * l_from.x 

            + p_dc_trans_mat[4] * l_from.y 

            + p_dc_trans_mat[5] * l_from.z 

            + p_dc_trans_mat[10] ;

l_to.z = p_dc_trans_mat[6] * l_from.x 

            + p_dc_trans_mat[7] * l_from.y 

            + p_dc_trans_mat[8] * l_from.z 

            + p_dc_trans_mat[11] ;

ap_dst_pts[l_index] = l_to ;

}

Which one do you think a better approach is ?

There is a discussion on contant memory at
http://forums.nvidia.com/index.php?showtopic=184724
Essentially there are many pitfals.
With 12 elements you should be ok, but you could check the warp_serialize profile counter.
Bill

On Compute Capability 2.0 and above, you can have at most 128 constant symbols at one time. So if this is not running out, a small size of the constant object shouldn’t bother you.

Constant memory, accessed with fixed (constant) indices, is as fast as register. As long as your constant set is no larger than 8kb, you’ll be able to have them all staying on the cache. But when the first time you access it you still have to go through global memory, because the constant data is not in the on-chip constant cache yet… so if you use the constant array only once, there wouldn’t be much difference. But if you use the constant array for many times in the same kernel launch with constant indices, you are likely to get some gain in speed.

Or… you may not, because global memory has L1 cache too. Please try your code and tell us the result!

Just tried it for you. I used a small array of 10 floats, which has certainly no problem fitting into either L1 cache or constant cache. Both the global array and the constant array I used were accessed with fixed indices, and the timing were exactly the same.

Though, you might want to use global memory when the indices you use to access the array are not fixed and when you do not have to write to that global array.