 # Theoretical ON-CHIP Bandwidth how to determine?

Welcome everyone to my second thread on memory bandwidth! (first one on cmem here)

smem := shared memory

cmem := constant memory

tmem := texture memory

gmem := global memory

MP := multiprocessor

(i’m using 9800 GT)

Theoretical Discussion:

The Programming Manual will tell us how to calculate global memory bandwidth, but what about ON-CHIP bandwidths? (i.e. smem, cmem cache, tmem cache, registers)

Global Bandwidth (OFF-CHIP) = (memory clock rate) * (data bus size) GB/s

ON-CHIP Bandwidth = ???

Now if we’re looking inside an MP, I’m guessing the 8 ALUs don’t have anything to do with mem transfer… I’ve got no idea how the data bus is configured… Does it still operate in half warps like a gmem transfer? how long does a half warp take to complete a data transfer, 1 clock cycle?

For a given MP, if we assume 1 clock cycle to transfer a half warp (16 threads), it would then take 16 integers/floats per clock cycle or 64 bytes / clk cycle

furthermore, suppose we have a clock rate of 1400 MHz then 1 clock cycle = 1/clock_rate seconds or 89.6 GB/s (but now i think i’m confusing shader clock w/ memory clock)

Measured Results:

after transferring ~5KB of data from smem to smem (see code below) I got 4.52 GB/s

or 5120 bytes / 1836 clk cycles = 2.8 bytes / clk cycle (not even an integer(4bytes) per clk cycle!)

(again clk cycle is referenced to the shader clock)

For comparison, I found that when transferring ~5kb of data from gmem to smem i got 1.36 GB/s (only ONE MP running 1 block of 128 threads)

or 0.88 bytes / clk cycle (code can be found in my first thread)

IS IT TRUE THAT YOU ONLY EXPERIENCE A SPEED UP OF 3x TIMES??? I thought ON-CHIP memory would be significantly faster… :blink:

thanks for the read! (below you can find smem to smem output & code)

Output:

[codebox]Your GPU clock rate = 1.620000 GHz

Smem test used 10000 test iterations

Total time of 11.333570 ms (18360384 clk cycles)

Average time of 1.133333 us (1836 clk cycles)

Transfered data = 5120 bytes

—> Bandwidth = 4.517647 GB/s <—

[/codebox]

Code:

[codebox]#include <stdio.h>

#include <cuda.h>

extern shared char smem;

global void mem3_kernel( int *data, int *output, int num_elements, int test_iterations )

{

``````int start, stop, i, s;

int* smem1 = (int*)smem;

int* smem2 = (int*)&smem1[num_elements];

// INITIAL transfer gmem to smem

for ( s = 0 ; s < num_elements ; s += blockDim.x )

smem1[tid + s] = data[tid + s] ;

if ( tid == 0 )

start = clock();

for ( i = 0 ; i < test_iterations ; i ++ )

{

// transfer smem to smem

for ( s = 0 ; s < num_elements ; s += blockDim.x )

smem2[tid + s] = smem1[tid + s];

}

if  (tid == 0 )

stop = clock();

// transfer smem to gmem

for ( s = 0 ; s < num_elements ; s += blockDim.x )

output[tid + s] = smem2[tid + s];

if ( tid == 0  )

output = stop - start;
``````

}

int main ()

{

``````int num_elements, data_size;

int grid_size, block_size;

int test_iterations, i;

int total_clk_cycles, ave_clk_cycles;

int *h_data, *d_data, *h_output, *d_output;

float bandwidth, gpu_clk_rate;

float total_time, ave_time;

// get GPU clk rate

cudaGetDeviceProperties(&deviceProp, 0);

gpu_clk_rate = deviceProp.clockRate * 1e-6;

printf("\nYour GPU clock rate = %f GHz\n", gpu_clk_rate);

grid_size  = 1;

block_size = 128;

test_iterations = 10000;

num_elements = block_size*10; // 1280 ints (5.12kb)

data_size    = num_elements*sizeof(int);

// allocate arrays

h_data   = (int*)malloc(data_size);

h_output = (int*)malloc(data_size);

// allocate CUDA arrays

cudaMalloc((void **) &d_data,   data_size);

cudaMalloc((void **) &d_output, data_size);

// fill data

for ( i = 0 ; i < num_elements ; i ++ )

h_data[i] = i+1;

// transfer data from cpu to gpu CONSTANT MEMORY

cudaMemcpy(d_data, h_data, data_size, cudaMemcpyHostToDevice);

// kernel invocation

mem3_kernel <<< grid_size, block_size, data_size*2.3 >>> ( d_data, d_output,  num_elements, test_iterations  );

// trasnfer output from gpu to cpu

cudaMemcpy(h_output, d_output, data_size, cudaMemcpyDeviceToHost);

// calculate bandwidth

total_clk_cycles = h_output;

ave_clk_cycles   = total_clk_cycles / test_iterations;

total_time = total_clk_cycles / gpu_clk_rate / 1e9; // (seconds)

ave_time   = ave_clk_cycles   / gpu_clk_rate / 1e9; // (seconds)

bandwidth  = data_size / ave_time; // (byte/second)

// display results

printf("\nSmem test used %d test iterations\n", test_iterations);

printf("Total time of %f ms (%d clk cycles)\n", (float)total_time*1e3, total_clk_cycles);

printf("Average time of %f us (%d clk cycles)\n", (float)ave_time*1e6, ave_clk_cycles);

printf("Transfered data = %d bytes\n", data_size);

printf("---> Bandwidth = %f GB/s <---\n\n", bandwidth*1e-9);

// free memory

free(h_data);

free(h_output);

cudaFree(d_data);

cudaFree(d_output);
``````

}

[/codebox]

Have you ran this code through the profiler?

yeah, okay! but i’m not sure which signals you want to look at:

[codebox]extern __shared__ char smem[];

global void mem3_kernel( int *data, int *output, int num_elements, int test_iterations )

{

``````int start, stop, i, s;

int* smem1 = (int*)smem;

int* smem2 = (int*)&smem1[num_elements];

// INITIAL transfer gmem to smem

for ( s = 0 ; s < num_elements ; s += blockDim.x )

smem1[tid + s] = data[tid + s] ;

if ( tid == 0 )

start = clock();

for ( i = 0 ; i < test_iterations ; i ++ )

{

// transfer smem to smem

for ( s = 0 ; s < num_elements ; s += blockDim.x )

smem2[tid + s] = smem1[tid + s];

}

if  (tid == 0 )

stop = clock();

// transfer smem to gmem

for ( s = 0 ; s < num_elements ; s += blockDim.x )

output[tid + s] = smem2[tid + s];

if ( tid == 0  )

output = stop - start;
``````

}[/codebox]

I’m also surprised about a single divergent warp but the way the profiler works is a mystery. The thing I wanted to see was warp serialization (ie. bank conflicts) and there’s none.

Perhaps you’re not saturating the device enough. I’ve there’s not enough threads per MP there will be RAW latencies in registers (and in smem too I think) and probably other issues that don’t come to my mind now.
In a toy benchmark (basically full saturation and no global memory reads) I was able to achieve around 400 GFLOPS with MADs on my card (8800 GTS). MADs read three operands and store one. Since a MAD counts as two flops, we can say I had 200 GMADS.
That evaluates into 800 GB/s write throughput and 2400 GB/s read throughput per whole GPU within registers. Divide by 16 to get per-MP result.

Can somebody correct me on those calculations?

hmmm… I’m not sure what you mean by “not saturating the device enough”

we’re talking about smem to smem transfer (data movement on a given MP), this measurement is independent of what the other MPs are doing so we need only measure 1 MP.

as for thread saturation on 1 MP, i’ve tried the following combinations

32 threads -> 1.634043 GB/s (5k data transfer)

64 threads -> 3.099552 GB/s (5k data transfer)

128 threads -> 4.517647 GB/s (5k data transfer)

256 threads -> 4.358591 GB/s (5k data transfer)

512 threads -> 3.851888 GB/s (6k data transfer)

as for varying data size, we have to remember we have a limit of 16k for smem (in my method i copy the first half of smem to the second so i’m limited to 8k)

I found that the bandwidth improves dramatically w/ data size

7.6k -> 4.811137 GB/s

5.6k -> 4.573353 GB/s

4.0k -> 4.297616 GB/s

2.5k -> 3.790859 GB/s

1.5k -> 3.133904 GB/s

1.0k -> 2.567926 GB/s

0.5k -> 1.728000 GB/s

if you graph (and you can kind off see from the numbers) you’ll find the slope is decreasing as you add more data (i.e. it probably saturates around 5.5 GB/s)

as for flops, I think you’ve made a mistake somewhere, you’ve far exceeded theoretical flop for using MAD…

this thread has all the details: http://forums.nvidia.com/lofiversion/index.php?t92140.html

(someone there has tried their 8800 GTS and only achieved 222 Gflops)

Although it is interesting to try to derive bandwidth from flops, but the theoretical values are calculated under the assumption that cores are performing flops every clock cycle, and thus ignore memory latency (i believe they try to mimic this by pre-loading the registers before the measurements begin)

Getting back to main topic… Is there anyone out there that knows how the ON-CHIP memory transfer works? theoretical bandwidth calculation? did anyone else try to measure their effective ON-CHIP bandwidth?(!)(!)(!)

GPU’s are throughput machines, meaning they are designed to handle LOTS of data and threads. They use the abundance of threads/blocks/warps etc. to hide various latencies and achieve greater average bandwidth.

No I didn’t, the theoretical maximum FLOPS of my 8800 GTS 512 is around 600 GFLOPS. This is only theoretically achievable with perfect dual issue, one MAD + one MUL per clock. With MADs alone, you can get about 400 GFLOPS (2/3 of max). I’ve reached 97% of this peak performance. Here’s the code and a relevant topic

http://forums.nvidia.com/index.php?showtop…st&p=581105

The person with a 8800 GTS from the thread you’ve mentioned has an older version of the card. Other people from that thread also back my results of reaching ~95% of MAD-only peak performance.

The hardware should be designed to handle the theoretical peak performance, not just the practical we were able to reach, so my estimations on register throughput are probably conservative.

Cores are performing flops every clock cycle (or at least are able to) if they can work on registers alone. That’s how we count FLOPS on those benchmarks by the way (taking into account shader clock Hz from cudaGetProp). That imposes certain register throughput requirements which led me to those numbers (800, 2400).

Cores cannot do flops directly from shared memory, data from smem needs to be extracted into registers first. I’m not sure what’s the throughput of smem compared to registers. The register file and the smem block are physically very similar AFAIK so I’d suspect similar theoretical throughput (ignoring any scheduling and issuing problems that may lower it significantly) but I’ve never really tested it.

EDIT: I’ve started modifying my flops benchmark to use shared memory where registers would make more sense ;) Initial results show halved performance but I’m getting some weird driver failures and I suspect I screwed something up.

Arithmetic instructions can have one operand coming from shared memory. However they often have a lower throughput in this case.
This is discussed in Vasily Volkov’s paper: http://mc.stanford.edu/cgi-bin/images/6/65…_Volkov_GPU.pdf (section 3.7).

Peak throughput of shared memory is 32B/clock (16 banks, 32-bit wide each running at half the SP clock).

Peak throughput from registers should be around 128B/clock on G80, and probably higher on GT200.
Note that dual issuing MAD+MUL with all inputs coming from different registers requires 160B/clock (5 registers read at the same time). This is my personal explanation for the difficulty to reach peak flops on G80…

Can someone upload the complete source code (Kernel.cu + main.c ) here to test that also ?
I think kernel.cu code is shown here but not the main.c to run/fire that kernel and show the results.
Thanks

hey mitchde… which source code are you referring to? if you’re referring to the code i posted at the very beginning of this thread, it does include kernel+main (it’s all in one file, main() starts about half way down)

(i still plan on replying to Big_Mac & Sylvain Collange, i’ve just been very busy )

Upps, sorry - good if someone can scroll such an codebox ;)

I now see that its already complete. I will share my expierence with that later.

Thanks + sorry

cc = clock cycle

@ Big_Mac

okay my bad, w/ 8800 gts 512 i agree with you, i’ve also ran your code (which i thought was pretty good ) and got 97.108173% for MAD & 73.834145% for MAD+MUL

now for your bandwidth calculations based on the results of your program, i too got the same value if you calculate total bytes and divide by total time

however if you load 3 values and store only 1, then you’re really only transferring 1/3 of the total bytes from one register to another register

for smem we know they transfer in groups of half warps but for registers… well i’m not quite sure what goes on there

@ Sylvain Collange & everyone

Thank you for the paper

it seems like you’ve answered my question of theoretical on-chip bandwidth

From programming manual

where n = 16 for devices of 1.x compute capability (i.e. 16 banks)

which from there gives (32 bit / 2cc) * 16 banks = 32 Bytes/cc (just like you said)

BUT strangely enough the “Best Practices Manual 2.3” says it’s twice as much:

however Vasily Volkov’s paper seems to agree w/ the programming manual

assuming an instruction requires 1 clock cycle then we get 32 bits/2cc

did the Best Practices Manual make a mistake?

so let’s follow Sylvain, Vasily & programming manual and say bw is 32 Bytes/cc, then for my 9800 GT (sp clock 1.62 GHz)

theoretical on-chip bandwidth = 32 Bytes/ cc * 1.62 Gcc/s = 52 GB/s (right?)

hey now we’re getting somewhere! But I still have a problem… at the very beginning of this thread, i measured 4.5 GB/s (it’s almost 32 bit/2cc)

so it looks like the performance of code that has a 16 way bank conflict! but i looked at the profiler (as Big_Mac suggested) and the warp_serialize = 0

also, looking at my kernel (at top of thread) i don’t see how there could be any bank conflicts…

Do I have any bank conflicts?

is this simple diagram correct of how the banks are divided?

a couple side questions for Sylvain Collange

how do you know it’s referencing the SP clock? so the memory gets moved according to SP clock on chip, but gmem data gets moved w/ mem clock, is that right?

how did you get 128 B/clock for registers?

@Nikolai: Try to unloop the innermost loop. The compiler does this if you declare num_elements and the block size as constants. Further, it seems that there should be at least 256 threads for the kernel to occupy one multi-processor fully.

The confusion comes from the fact that shared memory, registers and most of the Streaming Multiprocessor run on a different, twice slower, clock than the SPs. So the bandwidth is equivalently 32B / SP clock or 64B / SM clock.

An instruction takes at least 4 clock cycles. One instruction is needed to read from shared mem and one to write to shared mem.

Since you can’t both read and write shared mem at the same time, bandwidth is halved when copying from shared to shared, compared to shared->registers or registers->shared.

Yes, 52 GB/s one way or 26 GB/s both ways.

Do as suggested by Bloodhunt and look at the produced code using decuda. Your current code uses 9 instructions per loop iteration.

To reach peak bandwidth, the inner loop should look like (2 instructions per transfer):

``````mov.b32 \$r1, s[\$ofs1 += 0x0020]

mov.b32 s[\$ofs1 + 0x0000], \$r1
``````

It’s a bit more complex than that. Inside a SM, data moves at half the SP clock. Across the chip, data moves at the Core clock (around 650 MHz on your (overclocked?) 9800 GT). Between the DRAM and the chip, data moves at the memory clock (1800 MHz).

Of course clocks are meaningless when you don’t know the bus widths…

Speculation considering a register file split into either four 1024-bit banks, or more likely eight 512-bit banks.

thank you Bloodhunt, you’re quite right about unrolling the inner loop and also about declaring num_elements and the block_size as constants

Case 1:

when i transfer data from the first half of smem to the second half (smem1 & smem2 in my code) i get:

15.950769 GB/s or 9.846154 Bytes/clock (61.538464% of theoretical)

(i transfered 7168 bytes of integers using 256 threads)

Case 2:

since there is only 16kb of room on smem, i can only transfer a max of 8k data from smem1 to smem2

so in an attempt to increase the amount of data movement, i filled up almost 16k then just shuffled the data around in groups of 256 ints and get:

(!) 24.784062 GB/s or 15.298805 Bytes/clock (95.617531% of theoretical) (!)

(I transfered 15360 bytes using 256 threads)

but wait! I think it cheated… there’s 15360 bytes which which equals 15 groups of 256 integers

i take data from the first group and use it to overwrite each of the other 14 groups (first group is copied twice to make 15 reads & 15 writes)

but maybe i think it might somehow be just reading the first group ONCE then copying it to all the other group yielding 1 read & 15 writes

Sylvain Collange, i’ve downloaded decuda and used it for my 2 cases and attached them below (first case i unroll the inner loop 2 ways, so there is 2 files)

it looks like there is an add instruction for case 1… you’ll have to forgive me, i’m not very familiar with ptx code… should that be there?

i also see that you helped in the development of decuda, very impressive Anyway, i just realized that the smem bandwidth is lower than the gmem bandwidth (52 GB/s vs 60 GB/s or on a gtx 285 47 GB/s vs 160 GB/s)

yet gmem is suppose to be much slower than smem, well this is b/c of latency… which the theoretical bandwidth formula doesn’t take into account

so i want to make a formula that includes latency by making it a function of data size as well… (i’ll read around and post back what i come up with)

Case 1:

[codebox]#include <stdio.h>

#include <cuda.h>

#define BLOCK_SIZE 256

#define NUM_ELEMENTS 1792 // 256 * 7

extern shared char smem;

global void mem3_kernel( int *data, int *output, int test_iterations )

{

``````int start, stop, i, s;

int* smem1 = (int*)smem;

int* smem2 = (int*)&smem1[NUM_ELEMENTS];

// INITIAL transfer gmem to smem

for ( s = 0 ; s < NUM_ELEMENTS ; s += blockDim.x )

smem1[tid + s] = data[tid + s] ;

if ( tid == 0 )

start = clock();

#pragma unroll 75

for ( i = 0 ; i < test_iterations ; i ++ )

{

// transfer smem to smem

for ( s = 0 ; s < NUM_ELEMENTS ; s += BLOCK_SIZE )

smem2[ tid + s ] = smem1[ tid + s];

/* // manual unroll

smem2[tid] = smem1[tid];

smem2[tid +  256] = smem1[tid +  256];

smem2[tid +  512] = smem1[tid +  512];

smem2[tid +  768] = smem1[tid +  768];

smem2[tid + 1024] = smem1[tid + 1024];

smem2[tid + 1280] = smem1[tid + 1280];

smem2[tid + 1536] = smem1[tid + 1536];

*/

}

if  (tid == 0 )

stop = clock();

// transfer smem to gmem

for ( s = 0 ; s < NUM_ELEMENTS ; s += blockDim.x )

output[tid + s] = smem2[tid + s];

if ( tid == 0  )

output = stop - start;
``````

}

int main ()

{

``````int num_elements, data_size;

int grid_size, block_size;

int test_iterations, i;

int total_clk_cycles, ave_clk_cycles;

int *h_data, *d_data, *h_output, *d_output;

float bandwidth, bandwidth2, gpu_clk_rate;

float total_time, ave_time;

// get GPU clk rate

cudaGetDeviceProperties(&deviceProp, 0);

gpu_clk_rate = deviceProp.clockRate * 1e-6;

printf("\nYour GPU clock rate = %f GHz\n", gpu_clk_rate);

grid_size  = 1;

block_size = BLOCK_SIZE;

test_iterations = 100000;

num_elements = NUM_ELEMENTS;

data_size    = num_elements*sizeof(int);

// allocate arrays

h_data   = (int*)malloc(data_size);

h_output = (int*)malloc(data_size);

// allocate CUDA arrays

cudaMalloc((void **) &d_data,   data_size);

cudaMalloc((void **) &d_output, data_size);

// fill data

for ( i = 0 ; i < num_elements ; i ++ )

h_data[i] = i+1;

//printf("h_data[%d] = %d\n", 1535, h_data);

// transfer data from cpu to gpu CONSTANT MEMORY

cudaMemcpy(d_data, h_data, data_size, cudaMemcpyHostToDevice);

// kernel invocation

mem3_kernel <<< grid_size, block_size, data_size*2 >>> ( d_data, d_output, test_iterations );

// trasnfer output from gpu to cpu

cudaMemcpy(h_output, d_output, data_size, cudaMemcpyDeviceToHost);

// calculate bandwidth

total_clk_cycles = h_output;

ave_clk_cycles   = total_clk_cycles / test_iterations;

total_time = total_clk_cycles / gpu_clk_rate / 1e9; // (seconds)

ave_time   = ave_clk_cycles   / gpu_clk_rate / 1e9; // (seconds)

bandwidth  = data_size / ave_time; // (byte/second)

bandwidth2 = float(data_size)/ave_clk_cycles;

// display results

printf("\nSmem test used %d test iterations\n", test_iterations);

printf("Total time of %f ms (%d clk cycles)\n", (float)total_time*1e3, total_clk_cycles);

printf("Average time of %f us (%d clk cycles)\n", (float)ave_time*1e6, ave_clk_cycles);

printf("Transfered data = %d bytes\n\n", data_size);

printf("%f Bytes/clock (%f%% of theoretical)\n", bandwidth2, bandwidth2*100/16);

printf("---> Bandwidth = %f GB/s <---\n\n", bandwidth*1e-9);

// free memory

free(h_data);

free(h_output);

cudaFree(d_data);

cudaFree(d_output);
``````

}[/codebox]

Case 2:

[codebox]#include <stdio.h>

#include <cuda.h>

#define BLOCK_SIZE 256

#define NUM_ELEMENTS 3840 // 256 * 7

extern shared char smem;

global void mem3_kernel( int *data, int *output, int num_elements, int test_iterations )

{

``````int start, stop, i, s;

int* smem1 = (int*)smem;

// INITIAL transfer gmem to smem

for ( s = 0 ; s < num_elements ; s += blockDim.x )

smem1[tid + s] = data[tid + s] ;

if ( tid == 0 )

start = clock();

#pragma unroll 130

for ( i = 0 ; i < test_iterations ; i ++ )

{

// transfer smem to smem

for ( s = 0 ; s < NUM_ELEMENTS ; s += BLOCK_SIZE )

if ( s == 0 )

smem1[tid+BLOCK_SIZE ] = smem1[tid];

else

smem1[tid+s] = smem1[tid];

}

if  (tid == 0 )

stop = clock();

// transfer smem to gmem

for ( s = 0 ; s < num_elements ; s += blockDim.x )

output[tid + s] = smem1[tid + s];

if ( tid == 0  )

output = stop - start;
``````

}

int main ()

{

``````int num_elements, data_size;

int grid_size, block_size;

int test_iterations, i;

int total_clk_cycles, ave_clk_cycles;

int *h_data, *d_data, *h_output, *d_output;

float bandwidth, bandwidth2, gpu_clk_rate;

float total_time, ave_time;

// get GPU clk rate

cudaGetDeviceProperties(&deviceProp, 0);

gpu_clk_rate = deviceProp.clockRate * 1e-6;

printf("\nYour GPU clock rate = %f GHz\n", gpu_clk_rate);

grid_size  = 1;

block_size = BLOCK_SIZE;

test_iterations = 100000;

num_elements = NUM_ELEMENTS;

data_size    = num_elements*sizeof(int);

// allocate arrays

h_data   = (int*)malloc(data_size);

h_output = (int*)malloc(data_size);

// allocate CUDA arrays

cudaMalloc((void **) &d_data,   data_size);

cudaMalloc((void **) &d_output, data_size);

// fill data

for ( i = 0 ; i < num_elements ; i ++ )

h_data[i] = i+1;

//printf("h_data[%d] = %d\n", 1535, h_data);

// transfer data from cpu to gpu CONSTANT MEMORY

cudaMemcpy(d_data, h_data, data_size, cudaMemcpyHostToDevice);

// kernel invocation

mem3_kernel <<< grid_size, block_size, data_size >>> ( d_data, d_output,  num_elements, test_iterations );

// trasnfer output from gpu to cpu

cudaMemcpy(h_output, d_output, data_size, cudaMemcpyDeviceToHost);

// calculate bandwidth

total_clk_cycles = h_output;

ave_clk_cycles   = total_clk_cycles / test_iterations;

total_time = total_clk_cycles / gpu_clk_rate / 1e9; // (seconds)

ave_time   = ave_clk_cycles   / gpu_clk_rate / 1e9; // (seconds)

bandwidth  = data_size / ave_time; // (byte/second)

bandwidth2 = float(data_size)/ave_clk_cycles;

// display results

printf("\nSmem test used %d test iterations\n", test_iterations);

printf("Total time of %f ms (%d clk cycles)\n", (float)total_time*1e3, total_clk_cycles);

printf("Average time of %f us (%d clk cycles)\n", (float)ave_time*1e6, ave_clk_cycles);

printf("Transfered data = %d bytes\n\n", data_size);

printf("%f Bytes/clock (%f%% of theoretical)\n", bandwidth2, bandwidth2*100/16);

printf("---> Bandwidth = %f GB/s <---\n\n", bandwidth*1e-9);

// free memory

free(h_data);

free(h_output);

cudaFree(d_data);

cudaFree(d_output);
``````

}[/codebox]
cheating_95percent.txt (113 KB)
compiler_unroll_smem1TOsmem2.txt (51.4 KB)
manual_unroll_smem1TOsmem2.txt (49 KB)

Seems the compiler fails to generate post-incrementation of the address register inside the load instruction, and needs a third instruction to increment it. Hence, it only reaches 2/3 of the peak bandwidth.

This doesn’t happen here (but it might in a future compiler version). Here, two instructions are emitted for each transfer, so bandwidth occupation gets close to 100%.

Wait, no!..

Quoted peak gmem bandwidth can only be reached when all SMs access global mem in parallel. If you run a memory benchmark on just one SM, you’ll saturate that part of the on-chip interconnect and get only a fraction of peak bandwidth.

For an apple-to-apple comparison, you need to consider all SMs accessing shared mem in parallel, which gives 5216=832 GB/s on your board and 4730=1.4 TB/s on the GTX 285… Isn’t that better? :)

With enough blocks running coalesced accesses concurrently, the gmem latency should be hidden as well. That is why you need so many threads in the first place…

Sylvain, thank you for your analysis on my ptx code & for correcting my statement on gmem vs smem bandwidth

Arbitrary smem to smem transfer:

The ADD operation that the compiler likes to sneak in drops the bandwidth significantly… (23 GB/s -> 15 GB/s) If I use 1 offset (like Case 2) I don’t experience the extra ADD

i.e. smem1[tid] = smem1[tid + ofs]

but if i use 2 offsets (like Case 1) I DO experience the extra ADD

i.e. smem1[tid + ofs1] = smem2[tid + ofs2]

Therefore in general I will incur a drop in bandwidth whenever i perform an arbitrary transfer (there’s a picture below to help)

and the add instruction is really ridiculous :wacko: it introduces a useless variable ofs2 which triggers the add operation

``````mov.b32 \$r2, s[\$ofs1+0x0030]

mov.b32 s[\$ofs1+0x1c30], \$r2

add.b32 \$ofs2, \$ofs1, 0x00000430   <---  ofs2 = ofs1 + 0x0430

mov.b32 \$r2, s[\$ofs2+0x0000]		  <--- adds 0 to ofs2 !!! (wasted computation)

mov.b32 s[\$ofs1+0x2030], \$r2

...

WHEN IT SHOULD BE:

mov.b32 \$r2, s[\$ofs1+0x0030]

mov.b32 s[\$ofs1+0x1c30], \$r2

mov.b32 \$r2, s[\$ofs1+0x0430]		 <--- just add ofs1 w/ 0x0430 in the address

mov.b32 s[\$ofs1+0x2030], \$r2

...
``````

Surely there must be some trick to get the compiler to behave properly… or else you lose A LOT (23 GB/s -> 15 GB/s) of bandwidth every time an arbitrary smem transfer occurs here are the simplified transfer cases (A1, A2, A3): (full code is attached below)

``````// suppose 256 threads moving 2*256 elements

// A1.  no extra ADD (using 1 smem var) (like case 2)

smem1[tid] = smem1[tid+256];

smem1[tid] = smem1[tid+512];

// A2.  extra ADD (using 2 smem vars) (like case 1)

smem1[tid] = smem2[tid];

smem1[tid+256] = smem2[tid+256]

// A3.  extra ADD (using 1 smem var) (like case 1)

smem1[tid + 512] = smem1[tid];

smem1[tid + 768] = smem1[tid + 256];
``````

Here’s a picture illustrating the data transfer of the above code (for cases A1, A2, A3): Correction to Case 2: (just a statement, no questions here)

Case 2 (the one that achieved 95% theoretical) was cheating! I looked at the ptx code and found it was doing 14 loads & 15 stores (it should be doing 15 of both)

``````mov.b32 \$r1, s[\$ofs1+0x0030]

mov.b32 s[\$ofs1+0x0430], \$r1

mov.b32 s[\$ofs1+0x0430], \$r1		   <--- double store but only single load

mov.b32 \$r1, s[\$ofs1+0x0030]

mov.b32 s[\$ofs1+0x0830], \$r1

mov.b32 \$r1, s[\$ofs1+0x0030]

mov.b32 s[\$ofs1+0x0c30], \$r1

...
``````

which is easily fixed by declaring the smem as volatile to ensure that every reference to smem results in a read instruction.

but it drops my effective bw to 22 GB/s (85% theoretical)… 10% drop! and it only had to do 1 more load instruction out of 30 load/stores (1/30 = 3.3% more)

(if i use 192 threads (instead of 256) i get 23 GB/s (89% theoretical))
A3__text_file_.txt (3.09 KB)
A2__text_file_.txt (3.08 KB)
A1__text_file_.txt (3.28 KB)
A3.cu (2.69 KB)
A2.cu (2.63 KB)
A1.cu (2.84 KB)