Deep understanding how block is actually processed in MP

Hi there.
I’m working with several CUDA machines: GTX 260 at home, 9800 GT and 4x Tesla C1060 at work. NVIDIA CUDA Programming guide is read all-through.

I’ve wrote a simple test kernel: it calculates around million sines without any requests to global or shared memory. In other words, this kernel is pure mathematic. It uses 1 register per thread. I launch it at different launch configurations, measure its working time (using cudaStreamSynchronize(0); to wait for kernel, of course) and calculate performance (number of sines per second).

Things are clear about blocks: one or more blocks (up to 8) can reside on a single MP, while 1 block can’t be split to work on 2 MPs. So peak performance is achieved if number of blocks is a multiple of number of MPs. The graph of “performance vs blocks count” is a “saw” with “teeth” period equal to MP number (14 for my 9800 GT):

This is clear: GPU processes up to 14 blocks at once, so, for example, with 21 blocks, half MPs are idle on the second block processing step, thus reducing performance.

I still can’t understand how threads are distributed and processed inside a multiprocessor. I passed the same test varying number of threads in block with 14 and 28 blocks total, and this is what I see:


(for 14 blocks)

(for 28 blocks)

I see, that 32*8=256 threads is a hardware limit of the amount of threads being processed at once, as long as it gives maximum performance. But official programming guide says, that this limit must be 768 threads which is 3 times more (Compute Capability 1.2).
On the other hand, this means, that every MP core processes the whole warp at once, as long as MP processes 8 warps at once, but in this case 9 warps would be processed in 2 cycles, similar to block processing, but they are actually processed at around 1,125 cycles.

On the other hand, these graphs also contain “saw” with period of 32 threads or 1 warp. I can hardly understand this on friday evening :)

So, if anyone knows, please explain:

  1. How many threads a single MP can process at once physically?
  2. What “Maximum number of resident warps per multiprocessor” means in Programming Guide?
  3. How many threads a single MP core actually processes at once?
  4. How a single block is actually split between multiprocessor cores?

Thanks.

Hi there.
I’m working with several CUDA machines: GTX 260 at home, 9800 GT and 4x Tesla C1060 at work. NVIDIA CUDA Programming guide is read all-through.

I’ve wrote a simple test kernel: it calculates around million sines without any requests to global or shared memory. In other words, this kernel is pure mathematic. It uses 1 register per thread. I launch it at different launch configurations, measure its working time (using cudaStreamSynchronize(0); to wait for kernel, of course) and calculate performance (number of sines per second).

Things are clear about blocks: one or more blocks (up to 8) can reside on a single MP, while 1 block can’t be split to work on 2 MPs. So peak performance is achieved if number of blocks is a multiple of number of MPs. The graph of “performance vs blocks count” is a “saw” with “teeth” period equal to MP number (14 for my 9800 GT):

This is clear: GPU processes up to 14 blocks at once, so, for example, with 21 blocks, half MPs are idle on the second block processing step, thus reducing performance.

I still can’t understand how threads are distributed and processed inside a multiprocessor. I passed the same test varying number of threads in block with 14 and 28 blocks total, and this is what I see:


(for 14 blocks)

(for 28 blocks)

I see, that 32*8=256 threads is a hardware limit of the amount of threads being processed at once, as long as it gives maximum performance. But official programming guide says, that this limit must be 768 threads which is 3 times more (Compute Capability 1.2).
On the other hand, this means, that every MP core processes the whole warp at once, as long as MP processes 8 warps at once, but in this case 9 warps would be processed in 2 cycles, similar to block processing, but they are actually processed at around 1,125 cycles.

On the other hand, these graphs also contain “saw” with period of 32 threads or 1 warp. I can hardly understand this on friday evening :)

So, if anyone knows, please explain:

  1. How many threads a single MP can process at once physically?
  2. What “Maximum number of resident warps per multiprocessor” means in Programming Guide?
  3. How many threads a single MP core actually processes at once?
  4. How a single block is actually split between multiprocessor cores?

Thanks.

If you don’t write back anything, the compiler will completely optimize away your calculation (explaining why the kernel only needs a single register). So you are measuring thread setup throughput, not the throughput of sin().
Do a significant amount of arithmetics in each thread, and then write back something that depends on those calculations.

If you don’t write back anything, the compiler will completely optimize away your calculation (explaining why the kernel only needs a single register). So you are measuring thread setup throughput, not the throughput of sin().
Do a significant amount of arithmetics in each thread, and then write back something that depends on those calculations.

Hi,

This is an interesting post. I don’t have any answers to your questions, but I would like to repeat it on some of my hardware. What size blocks are you using? What are the units for “performance” on your graphs, and how did you calculate it? Any chance you would post the code for this test?

Ken D.

Hi,

This is an interesting post. I don’t have any answers to your questions, but I would like to repeat it on some of my hardware. What size blocks are you using? What are the units for “performance” on your graphs, and how did you calculate it? Any chance you would post the code for this test?

Ken D.

tera, I know, that optimization is taking place. My test is really very synthetical and it is made to write nothing out. I know about memory coalescing. For example, “read->sin->write” test was also made on the same hardware. It took around 2 ms for fully coalesced access (read and write), ~120 ms for fully uncoalesced and ~60 ms for half-coalesced (read or write). The idea os this test is to try to find how threads are distributed in hardware to find the best launch configuration for particular GPU.

kaberdude, I’ve varied block sizes and amounts of blocks. Test “perf vs block count” is made for block containing 512 threads, for example.

Performance is roughly calculated as number of sines divided to time it took to process kernel. Number of sines is blockDimgridDimSinesPerThread, where SinesPerThread is fixed to 1e+6, while blockDim and gridDim are varied.

Here’s the code. It’s written for linux, being compiled using nvcc on Ubuntu 10.04 and Gentoo. Using linux gettimeofday() call for time measurement. This code is for 30 blocks of 1…512 threads (last test).

#include <cuda.h>

#include <cuda_runtime.h>

#include <stdio.h>

#include <sys/time.h>

#define NUM_PASSES 1000000

<...>

void __global__ mathKernel()

{

float c[32];

for(int a=0;a<NUM_PASSES;a++)

  c[threadIdx.x/32]=sinf((float)a);

}

<...>

int main()

{

<...> //some initialization and device-probe code

float elapsedTime=0;

dim3 grid,block;

block.x=512;

block.y=1;

block.z=1;

grid.x=1;

grid.y=1;

grid.z=1;

FILE *F=fopen("speedinfo.txt","w");

float max=0;

int maxBlock=0;

for(int a=1;a<2;a++)

  {

  grid.x=30; //gives 1 block per MP for c1060

  for(int b=0;b<512;b++)

    {

    block.x=(b+1);

    gettimeofday(&tv1,NULL);

    mathKernel<<<grid,block>>>();

    cudaStreamSynchronize(0);

    gettimeofday(&tv2,NULL);

    elapsedTime=(tv2.tv_sec-tv1.tv_sec)*1000+0.001*(tv2.tv_usec-tv1.tv_usec);

    printf("pass %d: %.3f ms: %d blocks of %d threads, %d threads total\n",b,elapsedTime,grid.x,block.x,block.x*grid.x);

    float sps=1.0f*block.x*grid.x*NUM_PASSES/elapsedTime/1000;

if(sps>max)//simple "remember_the_maximum"

      {

      max=sps;

      maxBlock=block.x;

      }

printf("%.3f million sines per second\n",sps);

    printf("----------------\n");

    fflush(stdout);

fprintf(F,"%d %.3f\n",b+1,sps);

    fflush(F);

    }

  }

printf("Peak performance is %f sines/sec at %d threads\n",max,maxBlock);

fclose(F);

return 0;

}

tera, I know, that optimization is taking place. My test is really very synthetical and it is made to write nothing out. I know about memory coalescing. For example, “read->sin->write” test was also made on the same hardware. It took around 2 ms for fully coalesced access (read and write), ~120 ms for fully uncoalesced and ~60 ms for half-coalesced (read or write). The idea os this test is to try to find how threads are distributed in hardware to find the best launch configuration for particular GPU.

kaberdude, I’ve varied block sizes and amounts of blocks. Test “perf vs block count” is made for block containing 512 threads, for example.

Performance is roughly calculated as number of sines divided to time it took to process kernel. Number of sines is blockDimgridDimSinesPerThread, where SinesPerThread is fixed to 1e+6, while blockDim and gridDim are varied.

Here’s the code. It’s written for linux, being compiled using nvcc on Ubuntu 10.04 and Gentoo. Using linux gettimeofday() call for time measurement. This code is for 30 blocks of 1…512 threads (last test).

#include <cuda.h>

#include <cuda_runtime.h>

#include <stdio.h>

#include <sys/time.h>

#define NUM_PASSES 1000000

<...>

void __global__ mathKernel()

{

float c[32];

for(int a=0;a<NUM_PASSES;a++)

  c[threadIdx.x/32]=sinf((float)a);

}

<...>

int main()

{

<...> //some initialization and device-probe code

float elapsedTime=0;

dim3 grid,block;

block.x=512;

block.y=1;

block.z=1;

grid.x=1;

grid.y=1;

grid.z=1;

FILE *F=fopen("speedinfo.txt","w");

float max=0;

int maxBlock=0;

for(int a=1;a<2;a++)

  {

  grid.x=30; //gives 1 block per MP for c1060

  for(int b=0;b<512;b++)

    {

    block.x=(b+1);

    gettimeofday(&tv1,NULL);

    mathKernel<<<grid,block>>>();

    cudaStreamSynchronize(0);

    gettimeofday(&tv2,NULL);

    elapsedTime=(tv2.tv_sec-tv1.tv_sec)*1000+0.001*(tv2.tv_usec-tv1.tv_usec);

    printf("pass %d: %.3f ms: %d blocks of %d threads, %d threads total\n",b,elapsedTime,grid.x,block.x,block.x*grid.x);

    float sps=1.0f*block.x*grid.x*NUM_PASSES/elapsedTime/1000;

if(sps>max)//simple "remember_the_maximum"

      {

      max=sps;

      maxBlock=block.x;

      }

printf("%.3f million sines per second\n",sps);

    printf("----------------\n");

    fflush(stdout);

fprintf(F,"%d %.3f\n",b+1,sps);

    fflush(F);

    }

  }

printf("Peak performance is %f sines/sec at %d threads\n",max,maxBlock);

fclose(F);

return 0;

}

A number of people (thinking about the HOOMD-Blue folks here) have found that the ideal launch configuration depends on the combination of your particular GPU and your particular kernel code. Kernel properties like register usage and your distribution of memory I/O vs. floating point operations can affect throughput in non-obvious ways. You might want to try a fully empirical approach of testing a range of sensible block sizes for your real kernel rather than using a no-op or a simple kernel.

A number of people (thinking about the HOOMD-Blue folks here) have found that the ideal launch configuration depends on the combination of your particular GPU and your particular kernel code. Kernel properties like register usage and your distribution of memory I/O vs. floating point operations can affect throughput in non-obvious ways. You might want to try a fully empirical approach of testing a range of sensible block sizes for your real kernel rather than using a no-op or a simple kernel.

This approach is good for particular kernel, but I don’t want to use it every time I change my hardware or write new kernel. I’d like to make a scalable algorhytm for estimating launch configuration basing on data returned by cudaGetDeviceProperties(), but to make this one, I need to know what is actually happening when I launch kernel.
This can give me some hints how to distribute shared memory and what number of registers is better to limit registers usage to.

That’s why I use such synthetic kernels - to make kernel stuck in a single particular bottleneck (kernel distribution, memory coalescing, etc) to “feel” how it affects total throughput.

Still no idea?
Where can I look for some info to think over? (manual, datasheet, etc.)

It’s me again.
I was able to enable profiling to measure occupancy of the GPU and I didn’t see any correlation between occupancy and performance at all.
Attachment contains profile output (cuda_profile_o.log), program output (dump.log) that contains GPU parameters as well, performance output (speedinfo.txt) and a vizualization file made in MathCad.

As long as I see, occupancy is not related to performance. Higher occupancy tends to lead to higher performance, but won’t guarantee it. What’s wrong?

It is advisable to have good occupancy.
But Having good occupancy does not necessarily guarantee performance.

This is documented in the occupany calculator (at least the old generations. not sure if sthg changed in the newer ones)

Those two sentences are mutually contradictory. :)

High occupancy is good for one thing: hiding latency. At some point, latency stops being your bottleneck and you are limited by memory bandwidth, FLOPS or something else. This is why aiming for 100% occupancy is usually counterproductive. Diminishing returns kick in for most kernels between 30 and 50% occupancy.

I still don’t get it… Manual says that 9800 has CC 1.2, which means 768 resident threads per MP.
I know, that threads are processed in packs. But what size this pack is? 8 (one thread per core)? 328=256 (warp size per core)? 768=323*8 (three warps per core)??? How many threads are being processed at once in one MP?
What is “number of resident threads” responsible for? Number of threads being kept at MP’s registers? I don’t get it :(

This depends a bit on 1.x or 2.0/2.1 devices.

for 1.x devices 8 threads are processed at a given time on one multiprocessor. But 4 clock-cycles in a row the same instruction is run, so you can see it as 32 threads that are processed in one go with 1/4 of the clock speed.

2.0 devices process 32 threads at one time in a MP. 16 from one warp and 16 from another. 2.1 devices can even run 16 from another warp.

So for 2.0 devices you can say 32 threads run at a given time.

of resident threads is simply the amount of threads that can run on a MP given the amount of registers each thread needs (and also shared memory is important in this determination). So a MP might be able to run 768 threads in one go, but register usage will usually limit this to 256, 384, 512, or something like that. So CC1.2 does not say 768 resident threads, it say maximum number of threads resident at one moment in time.

This should answer most of your questions:

http://www.stuffedcow.net/research/cudabmk

Simon Green, thanks, this is interesting.

E.D. Riedijk, as long as I know, CC 1.2 has 8192 regs and 768 threads per MP which means 10.67 regs per thread, so “max amount of resident threads” is not related to registers usage. And this is definitely not the amount of threads being run simultaneously, as long as only <NUM_CORES_PER_MP> threads can run at once.
Sorry for my stupidity, but I still don’t see the point to take this parameter into consideration.

As far as I know, it is 16k registers and 1024 threads per MP maximum. So 16 registers per thread. But if you e.g. have a kernel that uses 20 registers, the amount of threads running concurrently goes down (you cannot have 1024 threads per MP at the same time anymore).

Your idea of running at once is a bit skewed I think. There are more threads active as NUM_CORES_PER_MP on a multiprocessor, take for example the pipeline depth of around 20 cycles. That means that given the fact 4 cycles are used to issue a warp on cc1.2, 5 warps are in the pipeline at the same time. So you also need more than 5 warps per MP to hide pipeline latency (as stated in the programming guide).