Hi everyone
I am working on a CUDA program, the visual profiler data of it is here:
[codebox]
CUDA_PROFILE_LOG_VERSION 1.5
CUDA_PROFILE_CSV 1
TIMESTAMPFACTOR 117029e935ba6826
CUDA_DEVICE 0 GeForce 8600 GT
gpustarttimestamp method gputime cputime occup gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmem staSmem register streamID sm_cta_lnch branch divergent instructs warpSerial cta_launch gld_incoh gld_coh gst_incoh gst_coh local_load local_store tlb_hit tlb_miss
11702b007fced9a0 findKey_kernel 152954 156866 0.333 8192 1 128 1 1 0 24 22 0 2048 944128 2048 50677402 597964 4096 0 0 0 0 1146880 10223616 24 4
Occupancy analysis for kernel ‘findKey_kernel’ for context ‘Session6 : Device_0 : Context_0’ :
Kernel details : Grid size: 8192 x 1, Block size: 128 x 1 x 1
Register Ratio = 0.6875 ( 5632 / 8192 ) [22 registers per thread]
Shared Memory Ratio = 0.0625 ( 1024 / 16384 ) [24 bytes per Block]
Active Blocks per SM = 2 : 8
Active threads per SM = 256 : 768
Occupancy = 0.333333 ( 8 / 24 )
Occupancy limiting factor = Registers
[/codebox]
There is a high “warp serialize” value. I have no shared memory defined or used in the program, then what could be the source of the “warp serialize” ? And why there are 24 “static shared memory” per block ?
What’s more, I have lots of constant values in the program, and each thread in a warp will retrive the same value at the same time, will this lead to time penalty or confilict ? Constant memory is visited very frequently in my program, will it be the bottle-neck of the performance ?
And in addition, “local load” and “local store” have high value too, because I am using an array of DWORD[16] in the kernel, is there any hint to increase the performace ?
some code part of the kernel here:
[codebox]constant BYTE MetaValues[MAX_META_VALUES][META_VALUE_LENGTH];
constant BYTE DictionaryItems[MAX_DICTIONARY_ITEMS][DICTIONARY_ITEM_LENGTH
];
// total dictionary item count
constant WORD CurrentIndexies[15];
global
static void findKey_kernel(WORD dictionaryItemIndex, DWORD *output)
{
float idx = blockIdx.x * blockDim.x + threadIdx.x;
DWORD currentPos = idx;
DWORD md5source[16];
BYTE *keyBytes = (BYTE *)md5source, keyIndex = 64;
// point to the first meta index
BYTE *dictItem = DictionaryItems[dictionaryItemIndex];
BYTE metaCombines = dictItem[0];
++dictItem;
// loop from the last meta
while (metaCombines-- > 0)
{
// calculate current meta value index
WORD metaStartIndex = MetaInfos[dictItem[metaCombines]].startIndex;
float metaCount = (float)MetaInfos[dictItem[metaCombines]].metaCount;
// MOD and get the current value index
float leftMetaIndex;
WORD metaValueIndex;
leftMetaIndex = floor((idx + (float)CurrentIndexies[metaCombines]) / metaCount);
metaValueIndex = (idx + (float)CurrentIndexies[metaCombines]) - leftMetaIndex * metaCount;
idx = leftMetaIndex;
// pick the meta value out and append to key
BYTE *metaValue = MetaValues[metaStartIndex + metaValueIndex];
// loop and copy bytes
char metaValueLen = metaValue[0];
++metaValue;
while (--metaValueLen >= 0)
keyBytes[--keyIndex] = metaValue[metaValueLen];
}
md5(md5source, md5source);
if (checkKey(md5source))
output[0] = currentPos;
}[/codebox]