the problem of cudaProf counters I can't get correct value of "warp serialize"

I use the following code to determine which values(int32 type) exist both in pList1 and pList2

[codebox]157 global void BinarySearchSmem(unsigned int pList1, unsigned int nList1Length, unsigned int pList2, unsigned int nList2Length, unsigned int pIsComm on) {

158

159 shared unsigned int s_pList2[SMEM_SIZE];

160

161 unsigned int nThreadId = blockIdx.x * blockDim.x + threadIdx.x;

162 if(nThreadId >= nList1Length) return;

163

164 //load all docIDs in long list into smem

165 for (unsigned int i = 0; i < SMEM_SIZE; i += blockDim.x)

166 {

167 s_pList2[i + threadIdx.x] = pList2[i + threadIdx.x];

168 }

169 __syncthreads();

170

171 int nLeft = 0;

172 int nRight = nList2Length - 1;

173 unsigned int nTarget = pList1[nThreadId];

174

175 int found = 0;

176

177 while (nLeft <= nRight)^M

178 {

179 unsigned int nMid = (nLeft + nRight) / 2;

180 unsigned int nMidValue = s_pList2[nMid];

181 if(nMidValue > nTarget) {

182 nRight = nMid - 1;

183 }

184 else if(nMidValue < nTarget) {

185 nLeft = nMid + 1;

186 }

187 else {

188 found = 1;

189 break;

190 }

191 }

192

193 pIsCommon[nThreadId] = found;

194

195

196 }

[/codebox]

the values of pList1 are from 0, 1, 2, 3, to 2047, values of pList2 are from 0, 2, 4, 6, to 4094

nList1Len and nListLen are both 2048, which eaqual to SMEM_SIZE

The problem is: under cudaProf 2.3 (windows) and cudaProf 3.0 (linux), on Tesla C1060, the value of “warp serialize” is ZERO to this kernel

[codebox]Method #Calls GPU usec CPU usec %GPU time sm cta launched branch divergent branch instructions warp serialize cta launched local load local store gld 32b gld 64b gld 128b gst 32b gst 64b gst 128b gld request gst request tlb hit tlb miss glob mem read throughput (GB/s) glob mem write throughput (GB/s) glob mem overall throughput (GB/s) gld efficiency gst efficiency instruction throughput -----------------------

BinarySearchSmem 1 27.36 37.66 28.44 0 0 0 0 0 1 0 0 0 144 0 256 0 0 0 0 0 1 3.32 2.95 6.26 0 0 0

[/codebox]

I have tried making 8-way bank confilict , too. I commented line 177 to 193, then insert the following code:

[codebox]

pIsCommon[nThreadId] = s_pList2[threadIdx.x % 2];

[/codebox]

However, the “warp serialize” counter is still 0. orz…

[codebox]

gpustarttimestamp method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize sm_cta_launched branch divergent_branch instructions warp_serialize cta_launched local_load local_store gld_32b gld_64b gld_128b gst_32b gst_64b gst_128b gld_request gst_request tlb_hit tlb_miss

11a7bfb765ee5040 Z16BinarySearchSmemPjjS_jS 9.09 19.31 0.25 9 1 256 1 1 0 8228 7 0 0 0 0 0 0 1 0 0 0 128 0 0 16 0 0 0 0 1

[/codebox]

My questions are:

  1. Why can’t I get the correct “warp serialize” value?

  2. How the cudaProf measure the global throughput(read/write)?

    cudaProf 3.0(linux) showed me that the kernel has throughput over 300GB+/s. aha, that’s quite strange. We don’t need fermi any more :-)

    cudaProf 2.3(windows) showed me the throughput is about 5BG/s, ok, this is quite normal