Tesla K80 size problem

Mcewen · September 9, 2015, 1:45pm

Hi all,

I am using Matlab for GPU computing (single precision). To do so, I use a .ptx file compiled from a .cu file with nvcc. It is working well for reasonable problem sizes but errors occur when I am increasing the problem size. I will show a set of Matlab command displays describing my issue :

Here is my device properties obtained by ‘gpuDevice’ function :

gpu = 
  CUDADevice with properties:
                      Name: 'Tesla K80'
                     Index: 2
         ComputeCapability: '3.7'
            SupportsDouble: 1
             DriverVersion: 7.5000
            ToolkitVersion: 5.5000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 1.2079e+10
                FreeMemory: 1.1953e+10
       MultiprocessorCount: 13
              ClockRateKHz: 823500
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

This is my kernel properies:

kernel = 
  CUDAKernel with properties:
       ThreadBlockSize: [1024 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [12274 1 1]
      SharedMemorySize: 0
            EntryPoint: '_Z6kernelfffiPfS_S_iS_S_S_iS_S_S_iS_S_S_S_S_S_S_S_S_S_'
    MaxNumLHSArguments: 19
       NumRHSArguments: 26
         ArgumentTypes: {1x26 cell}

And this is the returned errors:

Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_UNKNOWN 
...
Error using gpuArray/gather
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_UNKNOWN

Error in TOFcomputation_3media_FermatPrinc_GPU3 (line 89)
TOF_final = gather(g_tof);

Error in MAIN_..._3Dview (line 450)
[TOF_reco,RSU_final,samp_final] =
TOFcomputation_3media_FermatPrinc_GPU3(element_XYZ,RSU_XYZ,sample_XYZ,reco_XYZ,c1,c2,c3,precMode);

I guess it is only a size problem since it is working well with smaller GridSizes. But I do not understand because the max GridSize is 2.1475e+09 and my Grid size is only 12274…
Any idea would be appreciated !
Thanks!

njuffa · September 9, 2015, 3:04pm

If you need help with debugging, I would suggest posting self-contained example code that demonstrates the issue.

Are you compiling for the appropriate architecture (-arch=sm_37)? Does the code check the status return of all API calls and kernel launches? When you run the app under the control of cuda-memcheck, are any errors reported by the memory checker or the race checker?

Mcewen · September 9, 2015, 4:10pm

Actually, I believe it is not a debugging problem since it is working with smaller sizes…

The cuda kernel is like this:

__global__ void kernel( float c1, float c2, float c3,
                        int Nelem, float *elem_x, float *elem_y, float *elem_z,
                        int Nint1, float *int1_x, float *int1_y, float *int1_z,
                        int Nint2, float *int2_x, float *int2_y, float *int2_z,
                        int Nreco, float *reco_x, float *reco_y, float *reco_z,
                        float *TOF, 
                        float *int1best_x, float *int1best_y, float *int1best_z, 
                        float *int2best_x, float *int2best_y, float *int2best_z )
{
	int i = threadIdx.x + blockIdx.x * blockDim.x;
	int it_elem, it_int1, it_int2, it_reco;
        float x1, x2, x3, y1, y2, y3, z1, z2, z3;
	float TOF1, TOF2, TOF3;
        float TOFbest = 100000.0;
        float TOFcurr;

	if (i<(Nreco*Nelem))
	{
        it_reco = i/Nelem;   
        it_elem = i-it_reco*Nelem;
        TOFbest = 100000.0;

        for(it_int1=0;it_int1<Nint1;it_int1++)
        {
            x1 = elem_x[it_elem]-int1_x[it_int1];
            y1 = elem_y[it_elem]-int1_y[it_int1];
            z1 = elem_z[it_elem]-int1_z[it_int1];
            TOF1 = sqrt( x1*x1 + y1*y1 + z1*z1 )/c1;

            for(it_int2=0;it_int2<Nint2;it_int2++)
            {
                x2 = int1_x[it_int1]-int2_x[it_int2];
                y2 = int1_y[it_int1]-int2_y[it_int2];
                z2 = int1_z[it_int1]-int2_z[it_int2];
                x3 = int2_x[it_int2]-reco_x[it_reco];
                y3 = int2_y[it_int2]-reco_y[it_reco];
                z3 = int2_z[it_int2]-reco_z[it_reco];
                TOF2 = sqrt( x2*x2 + y2*y2 + z2*z2 )/c2;
                TOF3 = sqrt( x3*x3 + y3*y3 + z3*z3 )/c3;
        
                TOFcurr = TOF1 + TOF2 + TOF3;
                
                if (TOFcurr<TOFbest)
                {
                    TOF[i] = TOFcurr;
                    TOFbest = TOFcurr;
                    int1best_x[i] = int1_x[it_int1];
                    int1best_y[i] = int1_y[it_int1];
                    int1best_z[i] = int1_z[it_int1];
                    int2best_x[i] = int2_x[it_int2];
                    int2best_y[i] = int2_y[it_int2];
                    int2best_z[i] = int2_z[it_int2];
                }   
            }
        }
    }
}

I have the errors when I increase the “Nreco” value…

I did not compile with specific architecture specifications since it is working with all my GPUs (Quadro K2100M, GeForce Titan X and Testla K80) with smaller sizes.

No it doesn’t… I don’t know if I can with the PTX use… I will check.

Thank you

njuffa · September 9, 2015, 5:03pm

You are trying to debug an issue with your code. You are running into trouble doing so and turn to this forum for help with the debugging process. You therefore have a debugging problem.

In order to get help with debugging problems, it is usually necessary to have access to buildable and runnable code. An analogy would be having car trouble: you typically have to physically present the car to the mechanic to get a diagnosis. I am not sure what you mean by “no, it doesn’t”.

Just looking at the kernel without knowing the launch configuration three working hypotheses come to mind: (1) the launch configuration doesn’t match the array sizing (2) the memory allocations don’t match the kernel usage (3) there is an out-of-bounds access in the kernel.

Mcewen · September 11, 2015, 2:29pm

I understand your point njuffa. I am sorry. I will give more details.

The kernel described above is launched by:

kernel = parallel.gpu.CUDAKernel('name.ptx', 'name.cu');
kernel.GridSize = [ceil(Ntotal/kernel.MaxThreadsPerBlock) 1 1];
kernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock 1 1];

My GPU has a maximum threads per block of 1024.

The memory allocation on the GPU is performed to fit my kernel inputs and outputs:

s_c1 = single(c1);
s_c2 = single(c2);
s_c3 = single(c3);
i_Nelem = int32(Nelem);
i_Nint1 = int32(Nint1);
i_Nint2 = int32(Nint2);
i_Nreco = int32(Nreco);
g_elx = gpuArray(single(elem(:,1)));
g_ely = gpuArray(single(elem(:,2)));
g_elz = gpuArray(single(elem(:,3))); 
g_in1x = gpuArray(single(int1(:,1)));
g_in1y = gpuArray(single(int1(:,2)));
g_in1z = gpuArray(single(int1(:,3))); 
g_in2x = gpuArray(single(int2(:,1)));
g_in2y = gpuArray(single(int2(:,2)));
g_in2z = gpuArray(single(int2(:,3))); 
g_rex = gpuArray(single(reco(:,1)));
g_rey = gpuArray(single(reco(:,2)));
g_rez = gpuArray(single(reco(:,3))); 
g_tof = gpuArray(zeros(Ntotal,1,'single'));
g_in1bestx = gpuArray(zeros(Ntotal,1,'single'));
g_in1besty = gpuArray(zeros(Ntotal,1,'single'));
g_in1bestz = gpuArray(zeros(Ntotal,1,'single'));
g_in2bestx = gpuArray(zeros(Ntotal,1,'single'));
g_in2besty = gpuArray(zeros(Ntotal,1,'single'));
g_in2bestz = gpuArray(zeros(Ntotal,1,'single'));

The call to the kernel is done by:

% GPU computation %
[~,~,~,~,~,~,~,~,~,~,~,~,g_tof,g_in1bestx,g_in1besty,g_in1bestz,g_in2bestx,g_in2besty,g_in2bestz] = ...
                                        feval(kernel,s_c1,s_c2,s_c3,...
                                        i_Nelem,g_elx,g_ely,g_elz,...
                                        i_Nint1,g_in1x,g_in1y,g_in1z,...
                                        i_Nint2,g_in2x,g_in2y,g_in2z,...  
                                        i_Nreco,g_rex,g_rey,g_rez,...
                                        g_tof, ...
                                        g_in1bestx, g_in1besty, g_in1bestz, ...
                                        g_in2bestx, g_in2besty, g_in2bestz );

The data are then transfered back from the GPU:

TOF_final = gather(g_tof);

It seems to not be a global memory problem since my biggest problem (Ntotal = 12 567 632 Bytes) reaches a total memory of about 77.8MB. And the maximum memory of the Tesla K80 is 12GB…

I have performed several tests and here are the results (my maximum threads per block remains 1024.)

Ntotal = 216 320 => GridSize = [ 211 1 1] => works well
Ntotal = 480 896 => GridSize = [ 470 1 1] => works well
Ntotal = 1 623 024 => GridSize = [ 1585 1 1] => does not work (total memory ~10MB)
Ntotal = 12 567 632 => GridSize = [12274 1 1] => does not work (total memory ~78MB)

Actually, when it does not work the kernel goes very quickly (whereas it is supposed to take minutes…) and there is an error at the “gather” call (see also in my first post):

Error using gpuArray/gather
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_UNKNOWN

Thanks a lot for your help !

Topic		Replies	Views
help to clairfy usage of number of grids and number of blocks in kernal CUDA Programming and Performance	0	611	February 14, 2014
I don't understand the execution time (k40c & GTX580). CUDA Programming and Performance	9	2459	April 23, 2015
This code doesn't work maybe too much threads assigned? CUDA Programming and Performance	8	1083	February 2, 2014
Theano ConvNet + NVIDIA Tesla-K80 - CuBLAS errors -> memcpyAsync CUDA Setup and Installation	5	2641	June 29, 2016
code that crashes unpredictably CUDA Programming and Performance	15	12634	April 28, 2010
gpu computing sdk 4.0 runtime failures build the sdk succesfully, but the run of any exe failed CUDA Programming and Performance	3	2792	August 8, 2011
kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure CUDA Programming and Performance	38	2912	September 23, 2010
Bandwidth Test fails using Cuda 9.1 and Visual Studio 2013 CUDA Setup and Installation	2	1128	February 21, 2018
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1806	January 12, 2019
CUDA peer resources error when running on more than 8 K80s (AWS p2.16xlarge) CUDA Programming and Performance	15	7297	October 12, 2016

Tesla K80 size problem

Related topics