Strange result Comparing a Tesla C1060 against GTS 250

Hello.

I’m running exactly the same program with the same execution configuration in both a Tesla C1060 and one GTS 250. The execution configuration is: 32 thread blocos each one with 96 threads, 2212 Kb of shared memory per block and 14 registers per thread. Strangely the execution time in GTS 250 is approximately 2 seconds minor than in Tesla. Could some good soul give me any hint on this?

Is this someway related to latency hiding?

John.

TESLA C1060 has 30 multiprocessors. So, the load on each multi-processor will mostly be 1 block (except for 2 SMs which will get 2 blocks - because you are launching 32 blocks).
For effective register-register dependency hiding, you need at least 192 threads on a TESLA C1060. However since 1 block has only 96 threads, you get very poor performance on a TESLA.

GTS 250 has 16 MPs. So, it should get 2 blocks each and hence will have 192 threads - good enough to hide register latencies.

Hi!

2.2MB of shared memory by block? The maximum of shared memory per multiprocessor is 16KB in CCs 1.x and 48KB in CC 2.0. The strange is that it works!
The occupancy of SM could be 0 because no block can be launched per SM.

Regards!

Hello Sarnath. thanks for your reply.

I’m needing put all this in paper, could you please cite any nvidia’s/* document that explicit this?

thanks in advance.

¬¬ 2212 bytes.

How are you measuring the execution time and how are you compiling the code for the Tesla?

CUDA 3.2 - “CUDA C Best Practices Guide” - Search for “192”

$ nvcc nn_par_batch.cu -o nn_par_batch

I’m measuring the execution time through cudaEvent. Precisely:

int main(int argc, char *argv[]) {

    cudaEvent_t 

                start, 

                stop ;

float 

                elapsedTime,

                maior = -1,

                menor = 1000000000,

                soma = 0;

cudaEventCreate(&start) ;

    cudaEventCreate(&stop) ;

for (int i=0; i<N_ITERACOES; i++) {

        configura();

cudaEventRecord(start, 0) ; 

/* this function do the job */

        treina();

cudaEventRecord(stop, 0) ;

        cudaEventSynchronize(stop) ;

cudaEventElapsedTime(&elapsedTime, start, stop);

if (elapsedTime > maior) maior = elapsedTime;

        if (elapsedTime < menor) menor = elapsedTime;

soma += elapsedTime;

    }

printf("%f, %f, %f", menor, maior, soma/N_ITERACOES);

return 0;

}

You should use -arch=sm_13 when you compile for Telsa, otherwise there will be runtime recompilation of the PTX code in your program for the different architecture. That runtime JIT compilation might be effecting your time measurements, especially if the total kernel execution time is not very long.

But the GTS250 is CC1.2 so it has to recompile as well. Anyway the explanation is more likely that the GTS250 runs a higher frequency on its cores. While it has much less memory bandwidth and less SMs, each SM is faster (1.8GHz vs 1.3GHz). As Sarnath wrote since 32 blocks are started each SM on the GTS250 has to run 2 blocks. The C1060 has 30 SMs so two of the SMs have to run 2 blocks as well. Hence on both cards the maximum number of blocks run per SM is 2. And since the GTS250 SMs work at higher frequency they will be faster done with the job as long as you are compute bound.

Cheers

Ceearem

Data: http://www.nvidia.com/object/product_geforce_gtx_280_us.html , http://www.nvidia.com/object/product_geforce_gts_250_us.html , http://www.nvidia.com/object/product_tesla_c1060_us.html

No the GTS250 is Compute 1.1, the last evolution of the venerable GT92b.

Right, I’ll experiment it.

Avidday, if I compile the program with -arch=sm_13 It uses 33 registers what limit the amount of active blocks per multiprocessor…

I’m using the visual profiler to analyse the execution and getting very weird results:

shared_mem_per_block = 2212
block_x = 96
register_per_thread = 14

occupancy=0
cta_launched=4 or 3

occupancy zero means that the kernel isn’t running? but I’m getting correct results…!!! Beside that with Cuda Occupancy Calculator I get 56% of occupancy. Is this a problem with visual comp. prof?
cta_launched = 4. But with 30 S.M and 32 blocks how one S.M receive too much blocks?

I’m a bit confused and would thanks for any advice.

John.

Guys,

I dont even understand why we are taking this too far.

If there are 32 blocks, the scheduler will schedule 30 blocks on 30 SMs. Thus each SM would have 1 block on it == 96 active threads.
If an SM can hold 2 blocks, then 2 SMs alone would run 2 blocks == 192 active threads.

All SMs that run 96 threads will underperform on a TESLA.

However GTS250 has 16 SMs. If the occupancy allows 2 blocks per SM, then all 16 would run 2 active blocks == 192 threads. Performance is good.

Thats all. This is what I hightlighted in my original reply. I have also pointed John to the official documentation on “192”.

John, Are you not convinced with this answer? Am I missing something here?

Best regards,
Sarnath

Sarnath your answer make sense and is in concordance with nvidia official document. But how to ensure that is happening register data dependency? Beside this, I’m getting results that diverge from what we expect.

Again, I’m running the same program with the same configuration in both GTS and Tesla, the configurations and results are as follow:

Thread blocks = 103
threads per block = 128
Mean execution time in GTS = 2714 ms
Mean execution time in Tesla = 2187 ms

Okay Tesla is faster but only 700 ms? I’m attaching the program, if someone could execute in his hardware and tell me what are the results I’ll be grateful!

to run the code, just unzip it and run ‘./do’

John.
nn_128.zip (128 KB)

Does the lower frequency that the Tesla S.Ps runs justify the lower speedup?

This makes totally sense if you are compute bound your total performance (in arbitrary units) can be calculated by multiplying number of processors or processor blocks by the frequency:

GTS250: 16SM x 1.8GHz = 28.8 [arb.u.]

C1060: 30SM x 1.3GHz = 39 [arb.u.]

This gives a performance ratio of:

GTS250 / C1060 = 0.74

Your timings are 2187/2714 = 0.8 which is not far off from the theoretical compute performance numbers as calculated above.

Regards

Ceearem