slow pointers initiation in kernel

Hi,

could anybody explain the following? When I compile (with nvcc -std=c++11 ) and run the following code:

#include
#include

class TestClass
{
public:

  __device__ inline
  TestClass( int a )
  : num1( a ), num2( a ), num3( a )        

// ,ptr1( this ), ptr2( this ), ptr3( this )
{
}

  int num1, num2, num3;
  
  TestClass *ptr1, *ptr2, *ptr3;

};

global void testKernel()
{
TestClass t( 2 );
}

int main( int argc, char* argv )
{
dim3 cudaBlockSize( 256 );
dim3 cudaGridSize( 256 );

int iteration( 0 );
auto t_start = std::chrono::high_resolution_clock::now();
while( iteration < 10000 )
{
testKernel<<< cudaGridSize, cudaBlockSize >>>();
cudaThreadSynchronize();
iteration++;
}
auto t_stop = std::chrono::high_resolution_clock::now();

std::cout << "Elapsed time = "
<< std::chrono::duration<double, std::milli>(t_stop-t_start).count() << std::endl;

return EXIT_SUCCESS;
}

I get a message like

Elapsed time = 204.765

When I uncomment the one commented line, I get

Elapsed time = 1218.97

though I would expect something like 400 ms since initiation of a pointer should take the same time as initiation of integer. I tested it with:

Cuda compilation tools, release 7.5, V7.5.17
on GeForce GT 430
and with g++ (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5.

Thanks for any help.

I assume you are not compiling with debug switch (-G)

I ran it on a Quadro5000 gpu, Fedora 20, CUDA 7.5, and got 708.32 and 915.28 respectively.

Is your GT 430 servicing a display?

Also, to remove any CUDA start-up/initialization time overhead, you might want to call the kernel once before starting the timing on the loop. However for a measurement of ~200 ms I don’t think you are experiencing much.

Without the extra pointer initialization, the CUDA 7.5 SASS code when compiling for the default cc is:

code for sm_20
                Function : _Z10testKernelv
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];  /* 0x2800440400005de4 */
        /*0008*/         EXIT;                   /* 0x8000000000001de7 */

                ................................

which means everything is fully optimized away.

With the extra pointer initialization, I see:

code for sm_20
                Function : _Z10testKernelv
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];        /* 0x2800440400005de4 */
        /*0008*/         ISUB R1, R1, 0x28;            /* 0x4800c000a0105d03 */
        /*0010*/         MOV32I R3, 0x2;               /* 0x180000000800dde2 */
        /*0018*/         MOV32I R2, 0x2;               /* 0x1800000008009de2 */
        /*0020*/         LOP.OR R4, R1, c[0x0][0x4];   /* 0x6800400010111c43 */
        /*0028*/         MOV R5, RZ;                   /* 0x28000000fc015de4 */
        /*0030*/         LOP32I.AND R0, R4, 0xffffff;  /* 0x3803fffffc401c02 */
        /*0038*/         STL.64 [R0], R2;              /* 0xc800000000009ca5 */
        /*0040*/         STL [R0+0x8], R2;             /* 0xc800000020009c85 */
        /*0048*/         STL.64 [R0+0x10], R4;         /* 0xc800000040011ca5 */
        /*0050*/         STL.64 [R0+0x18], R4;         /* 0xc800000060011ca5 */
        /*0058*/         STL.64 [R0+0x20], R4;         /* 0xc800000080011ca5 */
        /*0060*/         EXIT;                         /* 0x8000000000001de7 */
                ................................

So in this case the compiler isn’t smart enough to remove everything (although it could/should, since your code affects no global state in either case).

Therefore I would expect the extra pointer case to be noticeably/significantly longer. The assumption of scaling based on integer initialization in the first case is not valid, since that is not actually happening in the first case.

Thanks for quick reply. Yes, I compile it as

nvcc --std=c++11 -O3 …

My Geforce is servicing a display. Running the kernel once before measuring the time gives more stable timing, it is 130 and 1050 now. I tried it also on Tesla K40 which is not used for X server. The times 224 and 814. So you think that it might be a problem of the compiler? Btw, how can I see the disassembled code?

I don’t think it’s a significant problem with the compiler. I personally would not spend my time analyzing performance of an empty kernel, or compare performance of two codes, where one compiles to an empty kernel.

I also would not spend any time analyzing the performance of code that should compile to an empty kernel, regardless of what the compiler actually does with it.

Regarding disassembly of code, start by reading the documentation:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#abstract

The think is, that this actually slows down my real code. We develop a library where we need to initialize small object in each CUDA thread. In many cases the thread then just perform few arithmetic operations and so this slow pointers initialization takes significant portion of the computation time :-(. The code I am showing here is just the simplest code presenting this problem.

That’s admirable, and for the case of a code behavioral issue (“why isn’t this code working?”) it is exactly what I would want, if I were working on it.

But in the case of a code performance issue or analysis, I don’t recommend stripping everything out. I certainly don’t recommend reducing the code to a point where it affects no global state, because as we’ve already seen, the compiler would or should reduce that code to an empty kernel, and IMO that is just not a logical point of comparison for perf analysis. (You should also never do perf analysis on debug-compiled code.)

Perf analysis requires a little bit more finesse than that. Either you should analyze the actual code you care about or else you need to carefully validate (perhaps with kernel results analysis, or else binary analysis as I have done) that the actual performance comparison of modified code is relevant (for example comparable, because 2 kernels are producing the same results, i.e. the same global state).

In this case I claim it is not. You may disagree; that’s fine; it’s the nature of community. I’m merely stating my opinion.

If you really want to focus on just the code comparison you have provided, my only response would be “Yes, I’m not surprised the 2nd case is 6x longer. It’s executing 6x or more instructions of the first case.” But for me, that sort of response would be unsatisfying, as this is not the correct way to gain insight into your real problem. It would be a silly response anyway, since kernel launch overhead is usually at least on the order of a few microseconds, and it should swamp the effect of instruction execution for such short kernels. It doesn’t seem to be happening in this case, but again, this is just not the way I would analyze a perf comparison.

In many cases the thread then just perform few arithmetic operations

you may look into increasing amount of work performed by thread, it’s a standard optimization for such cases

The situation is as follows. We develop a library for numerical computing. The aim is, that the user writes just one code and we can build from it a solver for CPU and GPU as well. I test the performance on simple heat equation where we approximate the Laplace operator by classical five point stencil as described here

https://en.wikipedia.org/wiki/Discrete_Laplace_operator

I compare the performance of the kernel obtained by our library and performance of a kernel written in pure C. The pure C implementation is 50% faster. If I extract the code from the library to one file, it is more than 5000 lines which I just did not want to submit here :). So I was reducing all unnecessary code until I have reached what I have submitted. I cannot increase amount of work in the kernel since it evaluates very simple Laplace operator. And I cannot avoid the initialization of the three pointers. After few years of development I am having a solver which is 50% slower on GPU and it seems that I cannot do anything with that just hoping that the compiler will be smarter in new version of CUDA. Or is there a way how to report as a nvcc “bug”? I cannot find a forum for nvcc.

You can report a bug.

Go to developer.nvidia.com, become a registered developer, and use the bug submission web page/form that is available there.

What I’m suggesting is that you actually analyze the kernels you care about, rather than these stripped-down ones. Your hand-written C laplace kernel shouldn’t be more than a dozen or so lines of code. If your machine generated kernel is less than 100 lines of code, start there.

Your paring of the kernel down to an “empty” kernel is not useful for understanding performance issues in your machine generated kernel.

Yes, you are right. It is good idea to submit the kernels. I will try it.

I cannot increase amount of work in the kernel since it evaluates very simple Laplace operator.

why one kernel can’t compute data for multiple points or so? i don’t know your code, but it looks like you are asking compiler to optimize thing it cannot optimize and what you should do on your side instead.

Our experience, when we are writing the numerical kernels with no C++ framework, is that mapping one CUDA thread to one point is the best. It is in good agreement with all CUDA performance recommendation. I just do not understand why initiation of pointers is so much slower than initiation of integer. It does not make sense to me.

I don’t believe that to be a supportable claim. Increasing the work per thread can have a variety of benefits, and is part of typical CUDA performance suggestions, such as here:

http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf

(slide 13)

and here:
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

(slide 31)

By contrast, I know of no considered CUDA optimization treatments that suggest that “one thread per point” is in any way optimal.

There’s nothing magical about one thread per point, from an optimization perspective, and it is certainly not optimal for many problems. The principal reason that people consider it is that code formulation around that strategy is oftentimes relatively simple.

And regarding this:

“initiation of pointers is so much slower than initiation of integer.”

I don’t believe you’ve demonstrated that, at least not in this thread. Your provided code compares a kernel that does some initializations to a kernel that does nothing.

I think the bigger aspect that bears on this particular experiment is that GPUs are designed as throughput device, at the expense of latency. So optimizations need to focus on achieving high throughput. By contrast, typical CPU software optimization is often concerned with reducing latency. The experiment performed here could be useful when programming for the CPU, but really provides no relevant information in a GPU programming context, as it primarily serves to expose latency in a contrived scenario. Clearly a null kernel is going to have lower latency than a kernel that has read-write dependencies.

Sometimes having each thread of a thread block handle just one destination data item is a sensible strategy, especially when there is much computation per thread, and when it significantly simplifies addressing arithmetic. But this is by no means a hard-and-fast recommendation, as txbob points out, and in many real-life codes each thread will handle multiple destination data items.

I hope this code will be more convincing:

#include
#include

#include <cuda.h>

class TestClass
{
public:

  __device__ inline
  TestClass( int a )
  : num1( a ), num2( a ), num3( a ) 
    ,ptr1( this ), ptr2( this ), ptr3( this ) // comment this
  {
  }
              
  
  double num1, num2, num3;
  
  TestClass *ptr1, *ptr2, *ptr3;

};

global void testKernel( double* res )
{
TestClass t( 2.0 );
res[ blockIdx.x * 256 + threadIdx.x ] = t.num1 + 2.0 * t.num2 + t.num3 / 2.0 + 5.0;
}

int main( int argc, char* argv )
{
dim3 cudaBlockSize( 256 );
dim3 cudaGridSize( 256 );

double* res;
cudaMalloc( ( void ** ) &res, 65536 * sizeof( double ) );
testKernel<<< cudaGridSize, cudaBlockSize >>>( res );
cudaThreadSynchronize();
int iteration( 0 );
auto t_start = std::chrono::high_resolution_clock::now();
while( iteration < 10000 )
{
testKernel<<< cudaGridSize, cudaBlockSize >>>( res );
cudaThreadSynchronize();
iteration++;
}
auto t_stop = std::chrono::high_resolution_clock::now();

std::cout << "Elapsed time = "
<< std::chrono::duration<double, std::milli>(t_stop-t_start).count() << std::endl;

double aux[ 65536 ];
cudaMemcpy( aux, res, 65536 * sizeof( double ), cudaMemcpyDeviceToHost );
for( int i = 0; i < 65536; i++ )
if( aux[ i ] != 12.0 )
std::cerr << “x”;
cudaFree( res );

return EXIT_SUCCESS;
}

I have added some work with double precision arithmetics and writing to the global memory. The times are 315 ms without pointers and 1755 ms with pointers.

And ok, your idea with mapping more points to one CUDA thread is interesting. I will try it.

It is in good agreement with all CUDA performance recommendation.

show us even one such recommendation :) cuda learning examples usually limit themself to process one item per thread in order to simplify the code

dim3 cudaBlockSize( 256 );
dim3 cudaGridSize( 256 );

while( iteration < 10000 )
{
testKernel<<< cudaGridSize, cudaBlockSize >>>( res );
cudaThreadSynchronize();
iteration++;
}
  1. there is a CODE tag here - it’s a last button in toolbox above the edit box
  2. each kernel call in your code runs only 64K short threads, so you spend most time in cudaThreadSynchronize() rather than computations. Well, synchronization will be made anyway, so the only way to spend less time in synchronization is to increase amount of threads 10-100 times

I am getting the same results even when I take cudaThreadSynchronize() out of the for loop.

yes, of course. as i said, you need to give more worj to every kernel, otherwise you are mostly waiting for kernel done, either with explicit cudaThreadSynchronize call, or with implicit one. do you know that GPU completely finishes execution of one kernel before starting the next one?

Ok, but this is not what I want to solve in this discussion. The main topic is extremely slow inicialization of pointers in object constructor. I have finaly reported it as a nvcc bug. It seems that nvcc developers succesfuly reproduced the same behaviour and they started investigating it. So we will see ;).