Compiling under CUDA 5.5 uses unnecessary global memory

Hi,

compiling the following code snippet

extern "C" __global__ void
test_kernel(double * C, const double * A)
{
   const int tix = threadIdx.x;
   float tmp = (float)tix;
   const double *trackA = A + (int)tmp;
   *C = trackA[0];
}

saved as test.cu with

nvcc -arch sm_35 -v -Xptxas="-v" -keep -ftz=false -cubin test.cu

I get the following strange output

ptxas info    : 11 bytes gmem
ptxas info    : Compiling entry function 'test_kernel' for 'sm_35'
ptxas info    : Function properties for test_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 6 registers, 336 bytes cmem[0]

Inspecting the ptx file I find

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Fri May 10 02:50:05 2013 (1368147005)
// Cuda compilation tools, release 5.5, V5.5.0
//

.version 3.2
.target sm_35
.address_size 64

	.file	1 ".../test.cu", 1375867951, 197
	.file	2 "/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_device_runtime_api.h", 1369186301, 7655
	.file	3 "/usr/local/cuda/bin/../targets/x86_64-linux/include/device_functions.h", 1369186301, 185228
.global .align 1 .b8 $str[11] = {95, 95, 67, 85, 68, 65, 95, 70, 84, 90, 0};

i.e. there is a string variable ‘__CUDA_FTZ’ written to global memory. The problem seems to be the
the part “A + (int)tmp”. Setting the compiler option -ftz=false or -ftz=true does not make any difference. I did not see this behaviour under CUDA 5.0. I also do not understand why the cuda_device_runtime_api.h is included. Thanks for any help on this.

If these are changes between the CUDA 5.0 and CUDA 5.5 toolchains, they seem harmless. What makes you concerned about this?

I always find it difficult to transition from one CUDA version to the next. I see lots of performance variations in my Kepler optimized DGEMM kernels. I do not think that this is the source of the performance fluctuations, I just want to rule it out.
Besides why on earth would someone writte a string (11 bytes) to global memory if a byte should contain more than enough space to store a global state.

Are the DGEMM calls in question CUBLAS DGEMM calls? In general, you should only see performance for those go up between versions (provided there is any change at all). If you observe performance regression in CUBLAS DGEMM (beyond a noise level of a couple of percent), I would suggest filing a bug report.

If you are not using CUBLAS DGEMM, I would suggest giving CUBLAS a try, as it is unlikely locally written code could be much faster, especially on Kepler platforms.

No. The cuBLAS DGEMM is not optimized for the matrix dimensions that I must handle. The software I am working on must execute countless of small sized DGEMM calls (typical dimension between 64 and 256 in any dimension). Lots of these DGEMM calls are independent so we can use streams to execute them concurrently. The matrix dimensions are not uniform so using the batched cuBLAS DGEMM interface does not work either.

The cuBLAS DGEMM was optimized for large matrix sizes, where one kernel utilizes the whole GPU.
The DGEMM kernel we wrote use smaller tile size (32 x 32) using 2 warps compared to the 64 x 128 tile size of CuBLAS DGEMM (using 8 warps). The idea behind this approach is to fit more thread blocks per SMX to execute concurrently. At the same time you must limit the amount of needless computation that you get because you your matrix dimensions are not a multiple of the tile size.

For small matrix dimensions (anything below 192) it is surprisingly easy to get better performance compared to the cuBLAS DGEMM. But we only get about 60 % peak performance.

The platform we are targeting is K20c or K20x.

E.g. for matrices of size 160 (m=n=k) I get using round-robin-scheduling in 32 compute streams
on a K20c using CUDA 5.5

420.57 GFlop/s (19.5 us) cuBLAS
596.84 GFlop/s (13.7 us) my handwritten version

the number in brackets are avg. kernel execution times in micro seconds.

OK, so it seems that CUBLAS does not cover your particular use case (the batched operations were specifically added to deal with many independent small matrices, but require uniform dimensions, as you noted).

You may want to consider filing an enhancement request for functionality not currently in CUBLAS, or for increased performance of particular DGEMM configurations already supported. In general, much of CUDA software development is driven by customer input, so enhancements requests (filed via the bug reporting mechanism) are one way of making these needs known to the development teams.

As for performance regressions in your custom kernels I would encourage you to file bugs against the compiler. Compilers are complex collections of optimization stages driven by various heuristics and thus constitute a compromise across a large body of code. They are thus unlikely to be optimal for all code idioms, but submitting bug reports for regression on important use cases could lead to overall improvement.

I would have to respectfully disagree with this point. Dgemm with a very large inner dimension and smaller outer dimensions, i.e

dim(A) = (20,10^6)
dim(B) = (10^6, 30)

runs particularly slowly (<5% peak) in CUBLAS. A fairly naiive implementation can get it up to 25% peak, but better implementations get towards 50-60% of peak.

Maybe we can open a new topic concerning the expected or wished for performance of the cuBLAS DGEMM
for various matrix dimensions. A most complex and extremely important topic by itself as the DGEMM is the most important building block for linear algebra.

I modified my original kernel a bit and it still shows the same behaviour.

__global__ void test_kernel(double * C, const double * A)
{
  *C = A[(int)((float)threadIdx.x)];
}

The reason for such a construct is that for the DGEMM with small matrix dimensions (should work below 1024) I can do the offset calculation using floats instead of integers. The float units are not used anyhow and for 24 bit integers the arithmetic should be much faster if the floating point units are used. Now the final conversion seems to be the part that triggers the insertion of the string variable.

@sBc-Random: “Unlikely” specifically includes the possibility of counter-examples :-). Have you had a chance to try this particular use case with CUDA 5.5? I am aware that CUBLAS in CUDA 5.5 improved the performance for various cases of matrices with extreme aspect ratios, but don’t have a detailed overview of the configurations that were improved.

If you find the performance of specific use cases still lacking in CUDA 5.5, I would suggest filing an enhancement request via the bug reporting form linked from the registered developer website. Thanks!

;) I actually haven’t had a chance to try out 5.5 at all yet due to compatability issues with a library I’m using, but there have been some very relevant improvements added to cublas and cusparse (particularly cusparse) that I’m eager to try out, this being one of them (Though I can’t actually see where this was documented in the changelog?) Will update when I manage to compile in 5.5 :)

I would assume that the public change logs cover only certain highlights, as comprehensive listings of every improvement of every piece of CUDA software would result in extensive lists nobody would want to read.

It is also possible that I am mistaken regarding the extent of the improvements regarding matrices with extreme aspect ratios, so it would be best to simply try with the cases that are of interest to you. In general best GEMM performance requires that matrix dimensions are multiples of the tile size used by the underlying kernel, where tile dimensions are usually some multiple of 16 (details tend to vary between the many different GEMM kernels used by CUBLAS).