Questions about "warp serialize" and constant memory

Hi everyone

I am working on a CUDA program, the visual profiler data of it is here:

[codebox]

CUDA_PROFILE_LOG_VERSION 1.5

CUDA_PROFILE_CSV 1

TIMESTAMPFACTOR 117029e935ba6826

CUDA_DEVICE 0 GeForce 8600 GT

gpustarttimestamp method gputime cputime occup gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmem staSmem register streamID sm_cta_lnch branch divergent instructs warpSerial cta_launch gld_incoh gld_coh gst_incoh gst_coh local_load local_store tlb_hit tlb_miss

11702b007fced9a0 findKey_kernel 152954 156866 0.333 8192 1 128 1 1 0 24 22 0 2048 944128 2048 50677402 597964 4096 0 0 0 0 1146880 10223616 24 4

Occupancy analysis for kernel ‘findKey_kernel’ for context ‘Session6 : Device_0 : Context_0’ :

Kernel details : Grid size: 8192 x 1, Block size: 128 x 1 x 1

Register Ratio = 0.6875 ( 5632 / 8192 ) [22 registers per thread]

Shared Memory Ratio = 0.0625 ( 1024 / 16384 ) [24 bytes per Block]

Active Blocks per SM = 2 : 8

Active threads per SM = 256 : 768

Occupancy = 0.333333 ( 8 / 24 )

Occupancy limiting factor = Registers

[/codebox]

There is a high “warp serialize” value. I have no shared memory defined or used in the program, then what could be the source of the “warp serialize” ? And why there are 24 “static shared memory” per block ?

What’s more, I have lots of constant values in the program, and each thread in a warp will retrive the same value at the same time, will this lead to time penalty or confilict ? Constant memory is visited very frequently in my program, will it be the bottle-neck of the performance ?

And in addition, “local load” and “local store” have high value too, because I am using an array of DWORD[16] in the kernel, is there any hint to increase the performace ?

some code part of the kernel here:

[codebox]constant BYTE MetaValues[MAX_META_VALUES][META_VALUE_LENGTH];

constant BYTE DictionaryItems[MAX_DICTIONARY_ITEMS][DICTIONARY_ITEM_LENGTH

];

// total dictionary item count

constant WORD CurrentIndexies[15];

global

static void findKey_kernel(WORD dictionaryItemIndex, DWORD *output)

{

float idx = blockIdx.x * blockDim.x + threadIdx.x;

DWORD currentPos = idx;

DWORD md5source[16];

BYTE *keyBytes = (BYTE *)md5source, keyIndex = 64;

// point to the first meta index

BYTE *dictItem = DictionaryItems[dictionaryItemIndex];

BYTE metaCombines = dictItem[0];

++dictItem;

// loop from the last meta

while (metaCombines-- > 0)

{

	// calculate current meta value index

	WORD metaStartIndex = MetaInfos[dictItem[metaCombines]].startIndex;

	float metaCount = (float)MetaInfos[dictItem[metaCombines]].metaCount;

	// MOD and get the current value index

	float leftMetaIndex;

	WORD metaValueIndex;

	leftMetaIndex = floor((idx + (float)CurrentIndexies[metaCombines]) / metaCount);

	metaValueIndex = (idx + (float)CurrentIndexies[metaCombines]) - leftMetaIndex * metaCount;

	idx = leftMetaIndex;

	// pick the meta value out and append to key

	BYTE *metaValue = MetaValues[metaStartIndex + metaValueIndex];

	// loop and copy bytes

	char metaValueLen = metaValue[0];

	++metaValue;

	while (--metaValueLen >= 0)

		keyBytes[--keyIndex] = metaValue[metaValueLen];

}

md5(md5source, md5source);

if (checkKey(md5source))

	output[0] = currentPos;

}[/codebox]

There is a high “warp serialize” value. I have no shared memory defined or used in the program, then what could be the source of the “warp serialize” ?
We have 3 sources make “warp serialize” occurs and that are

  1. Shared memory (you already know)
  2. Constant memory
  3. Eventually, Register can make “warp serialize” occurs.
    And why there are 24 “static shared memory” per block
    That shared memory use to allocate your kernel parameters.
    And in addition, “local load” and “local store” have high value too, because I am using an array of DWORD[16] in the kernel, is there any hint to increase the performance?
    It’s quite hard to answer this question completely and properly. To optimize this field you must have knowledge about coalesced accessing pattern on global memory (because The structure of local memory is same as global memory).
    I hope someone will give you some good advices.

Thank you very much :) I think I am going to look for the “serialize” problem somewhere in the constant memory.

And some more questions, dose every function defined as “device” use the shared memory to store the parameters? Because I have only 1 WORD (2 bytes) and 1 pointer (4 bytes) in the kernel parameters, sum to 6 and far below 24.

And as you said, auto value array in local memory also need to be used carefully because of the coalesced matter, then incorrect use should only increase the “gld uncoalesced” value while not “warp serialize” value if I am right, since it is in the global memory, not shared, constant or register. Is that true?

Any hints to optimize the program will be greatly appreciated! :)