Problem about ScanLargearry I get different results :(

Well, I’d say comment out all the compute code you can, leaving just gmem reads/writes and time it again. That should give you an idea of the real I/O bandwidth you’re achieving. My guess is that it’ll be higher than 33GB/s.

Paulius

Thought found a bug using a new test case… but the bug turns out to be in the new test case -_-

Without calculation:
n=16000001 Time: 4.58103 ms
A somewhat 40G/s. Still not at the peak:(

Defered a shared memory write and improved a little bit.
n=16000001 Time: 5.23027 ms
no error
TEST PASSED!

By the way, does __syncthreads wait for global memory operations to complete?

I don’t think so. Global memory access instructions don’t block. A thread may block if an instruction’s argument is not ready (for example, hasn’t arrived from gmem yet).

However, if what you have is something like

sa[...] = ga[...];

__syncthreads();

where sa and ga are shared and global memory arrays, respectively, I think __syncthreads will happen after the values have arrived from gmem. The reason is that at assembly level you’ll first read from gmem into registers, then copy the register to smem (so block and wait for the value to appear in the register). I’d check the .ptx to see where exactly __syncthreads gets put to be sure, though.

Paulius

… the time you get is really good, I have a array of 16M elements and I use each thread to read 17 numbers from global memory to shared memory, 96 thread each block, it takes me 9.5 ms only in reading from global to share!!! why it takes so long? I use GTS 8800, the reason why I choose 17 is that in this case there is no bank confict, I think.

Some one can tell me why it is so slow? paulius , asadafag?

That’s only ~7GB/s. It would seem that your global reads are not coalesced. You should be able to get up to 70GB/s, regardless of shared memory bank conflicts.

unsigned int bx = blockIdx.x;
unsigned int tx = threadIdx.x;
unsigned int ai=__mul24(tx,17); // 17 means each thread reads 17 numbers
unsigned int bi=__mul24(bx,1632)+ai; //1632 means each block reads 1632 numbers, 96 thread each block
// start to read
s_data[ai]=d_input[bi];
s_data[ai+1]=d_input[bi+1];
s_data[ai+2]=d_input[bi+2];
s_data[ai+3]=d_input[bi+3];
s_data[ai+4]=d_input[bi+4];
s_data[ai+5]=d_input[bi+5];
s_data[ai+6]=d_input[bi+6];
s_data[ai+7]=d_input[bi+7];
s_data[ai+8]=d_input[bi+8];
s_data[ai+9]=d_input[bi+9];
s_data[ai+10]=d_input[bi+10];
s_data[ai+11]=d_input[bi+11];
s_data[ai+12]=d_input[bi+12];
s_data[ai+13]=d_input[bi+13];
s_data[ai+14]=d_input[bi+14];
s_data[ai+15]=d_input[bi+15];
s_data[ai+16]=d_input[bi+16];

why some one tell me why it is slow?

It is as I said, your global memory reads are not coalesced.

why it is not coalesced? :(

Look in the guide: coalescing requires that entire warps access contiguous memory regions. There are a few other requirements (an additional alignment requirement, for instance), but that is the gist of it. You have each thread in a warp accessing elements 17 elements apart.

If you really, really, need to access memory in that pattern, you could use a texture. But you could change each thread to read 16 values instead and read contiguous rows with the proper alignment from d_input. You could still read your 17, but then you would need padding inside d_input. That would also avoid bank conflicts too, because each thread i would be writing shared memory location base+i.

What do you mean by “contiguous memory regions”? The first thread reads the first 17 number in global mem, and the next thread reads the next 17, isn’t it contiguous?
Also I change my way to each thread reads one number from global to share, for 16M number it is 2.7 ms, is it also slow?

Thanks MisterAnderson42 :P

Read pages 44 and 45 of the Programming Guide carefully. That’s where the requirements for coalescing are described.

Paulius

To Paulius:
Strangely, I found consume the values in an add after each texture load is faster than saving them all in register and adding after all loads (I checked the ptx to be sure).
Since I’m running out of registers, it’s unlikely I can do much about reads without changing algorithm. However, maybe something can be done with the writes. As you say, global operations are non-blocking. When would a write block an instruction, then? When the source register is overwritten?

To dingshuai1985:
Coalescing is really important. I paid two syncs and many shared I/O for coalescing, and it’s well worth it.
Also, the usual way to avoid bank conflict is to use 16 dwords and waste 1 in each thread instead of using all 17, since it’s nearly impossible to coalesce a 17-dword write.

I believe a write is always non-blocking. I’m not sure about when the source register can be reused without a block. Basically, from assembly point of view, it’s safe to be reused by the succeeding instruction since register conflicts are prevented by the hardware. However, this may mean that the succeeding instruction may block until the register value has been handed off to the “write circuitry.” If I had to make an educated guess, having 192 concurrent threads (25% occupancy) should be plenty to hide this latency (it is enough for RAW dependencies, as pointed out in the programming guide).

Paulius

Oh, this topic becomes hot… My ignorance must be seen by many people :)

I try the best prescan from UIUC and it seems that the copy from global to share takes 2.5 ms for 16M numbers… That means 7G/S, why it is so slow? And also it does not use cudaThreadSynchronize(), why?

Profiling seems to have confirmed that…

method=[ e023a16e8_devgrandscan0 ] gputime=[ 1719.808 ] cputime=[ 1780.394 ] occupancy=[ 0.250 ]

method=[ e023a1940_devgrandscan1 ] gputime=[ 2730.016 ] cputime=[ 2784.711 ] occupancy=[ 0.250 ]

My reduction pass (read+compute) seems to be spending too much time. If the thing is entirely memory bounded, the scan pass (read+compute+write) should be spending roughly 2x time than reduction pass, right?

It seems my current algorithm’s problem is that it can’t overlap computation with memory latency, due to shared memory and sync.

Maybe I could try a syncless one… I’ll do some redesigning when I have time.

Edit: the syncless version is surprisingly easy to do…

n=16000001 Time: 4.02900 ms

no error

TEST PASSED!

However, bank conflict has increased and slowed down the 8M case. By the way, bank conflict between two half warps of a warp matters, but bank conflict between warps doesn’t matter, right?

Bank conflicts matter only within half-warps. Conflicts between threads from different half-warps of the same warp should not affect performance (plus there’s no way to avoid them. It’s the good old pigeon whole problem - you have 32 accesses and 16 banks to access from).

Paulius

It’s rather embarrassing to have forgotten this…:(

I modified the thing to a short-to-int scan today (simply changing int2 tex fetch to int tex fetch and a few bit operations, all other parts are unmodified). But surprisingly, the speed remains exactly the same!?

Does that imply the actual bottleneck is tex1Dfetch’s call overhead? Or rather, the memory latency is already completely hidden behind arithmetic?