Tesla K80 size problem

Hi all,

I am using Matlab for GPU computing (single precision). To do so, I use a .ptx file compiled from a .cu file with nvcc. It is working well for reasonable problem sizes but errors occur when I am increasing the problem size. I will show a set of Matlab command displays describing my issue :

Here is my device properties obtained by ‘gpuDevice’ function :

gpu = 
  CUDADevice with properties:
                      Name: 'Tesla K80'
                     Index: 2
         ComputeCapability: '3.7'
            SupportsDouble: 1
             DriverVersion: 7.5000
            ToolkitVersion: 5.5000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 1.2079e+10
                FreeMemory: 1.1953e+10
       MultiprocessorCount: 13
              ClockRateKHz: 823500
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

This is my kernel properies:

kernel = 
  CUDAKernel with properties:
       ThreadBlockSize: [1024 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [12274 1 1]
      SharedMemorySize: 0
            EntryPoint: '_Z6kernelfffiPfS_S_iS_S_S_iS_S_S_iS_S_S_S_S_S_S_S_S_S_'
    MaxNumLHSArguments: 19
       NumRHSArguments: 26
         ArgumentTypes: {1x26 cell}

And this is the returned errors:

Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_UNKNOWN 
...
Error using gpuArray/gather
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_UNKNOWN

Error in TOFcomputation_3media_FermatPrinc_GPU3 (line 89)
TOF_final = gather(g_tof);

Error in MAIN_..._3Dview (line 450)
[TOF_reco,RSU_final,samp_final] =
TOFcomputation_3media_FermatPrinc_GPU3(element_XYZ,RSU_XYZ,sample_XYZ,reco_XYZ,c1,c2,c3,precMode);

I guess it is only a size problem since it is working well with smaller GridSizes. But I do not understand because the max GridSize is 2.1475e+09 and my Grid size is only 12274…
Any idea would be appreciated !
Thanks!

If you need help with debugging, I would suggest posting self-contained example code that demonstrates the issue.

Are you compiling for the appropriate architecture (-arch=sm_37)? Does the code check the status return of all API calls and kernel launches? When you run the app under the control of cuda-memcheck, are any errors reported by the memory checker or the race checker?

Actually, I believe it is not a debugging problem since it is working with smaller sizes…

The cuda kernel is like this:

__global__ void kernel( float c1, float c2, float c3,
                        int Nelem, float *elem_x, float *elem_y, float *elem_z,
                        int Nint1, float *int1_x, float *int1_y, float *int1_z,
                        int Nint2, float *int2_x, float *int2_y, float *int2_z,
                        int Nreco, float *reco_x, float *reco_y, float *reco_z,
                        float *TOF, 
                        float *int1best_x, float *int1best_y, float *int1best_z, 
                        float *int2best_x, float *int2best_y, float *int2best_z )
{
	int i = threadIdx.x + blockIdx.x * blockDim.x;
	int it_elem, it_int1, it_int2, it_reco;
        float x1, x2, x3, y1, y2, y3, z1, z2, z3;
	float TOF1, TOF2, TOF3;
        float TOFbest = 100000.0;
        float TOFcurr;

	if (i<(Nreco*Nelem))
	{
        it_reco = i/Nelem;   
        it_elem = i-it_reco*Nelem;
        TOFbest = 100000.0;

        for(it_int1=0;it_int1<Nint1;it_int1++)
        {
            x1 = elem_x[it_elem]-int1_x[it_int1];
            y1 = elem_y[it_elem]-int1_y[it_int1];
            z1 = elem_z[it_elem]-int1_z[it_int1];
            TOF1 = sqrt( x1*x1 + y1*y1 + z1*z1 )/c1;

            for(it_int2=0;it_int2<Nint2;it_int2++)
            {
                x2 = int1_x[it_int1]-int2_x[it_int2];
                y2 = int1_y[it_int1]-int2_y[it_int2];
                z2 = int1_z[it_int1]-int2_z[it_int2];
                x3 = int2_x[it_int2]-reco_x[it_reco];
                y3 = int2_y[it_int2]-reco_y[it_reco];
                z3 = int2_z[it_int2]-reco_z[it_reco];
                TOF2 = sqrt( x2*x2 + y2*y2 + z2*z2 )/c2;
                TOF3 = sqrt( x3*x3 + y3*y3 + z3*z3 )/c3;
        
                TOFcurr = TOF1 + TOF2 + TOF3;
                
                if (TOFcurr<TOFbest)
                {
                    TOF[i] = TOFcurr;
                    TOFbest = TOFcurr;
                    int1best_x[i] = int1_x[it_int1];
                    int1best_y[i] = int1_y[it_int1];
                    int1best_z[i] = int1_z[it_int1];
                    int2best_x[i] = int2_x[it_int2];
                    int2best_y[i] = int2_y[it_int2];
                    int2best_z[i] = int2_z[it_int2];
                }   
            }
        }
    }
}

I have the errors when I increase the “Nreco” value…

I did not compile with specific architecture specifications since it is working with all my GPUs (Quadro K2100M, GeForce Titan X and Testla K80) with smaller sizes.

No it doesn’t… I don’t know if I can with the PTX use… I will check.

Thank you

You are trying to debug an issue with your code. You are running into trouble doing so and turn to this forum for help with the debugging process. You therefore have a debugging problem.

In order to get help with debugging problems, it is usually necessary to have access to buildable and runnable code. An analogy would be having car trouble: you typically have to physically present the car to the mechanic to get a diagnosis. I am not sure what you mean by “no, it doesn’t”.

Just looking at the kernel without knowing the launch configuration three working hypotheses come to mind: (1) the launch configuration doesn’t match the array sizing (2) the memory allocations don’t match the kernel usage (3) there is an out-of-bounds access in the kernel.

I understand your point njuffa. I am sorry. I will give more details.

The kernel described above is launched by:

kernel = parallel.gpu.CUDAKernel('name.ptx', 'name.cu');
kernel.GridSize = [ceil(Ntotal/kernel.MaxThreadsPerBlock) 1 1];
kernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock 1 1];

My GPU has a maximum threads per block of 1024.

The memory allocation on the GPU is performed to fit my kernel inputs and outputs:

s_c1 = single(c1);
s_c2 = single(c2);
s_c3 = single(c3);
i_Nelem = int32(Nelem);
i_Nint1 = int32(Nint1);
i_Nint2 = int32(Nint2);
i_Nreco = int32(Nreco);
g_elx = gpuArray(single(elem(:,1)));
g_ely = gpuArray(single(elem(:,2)));
g_elz = gpuArray(single(elem(:,3))); 
g_in1x = gpuArray(single(int1(:,1)));
g_in1y = gpuArray(single(int1(:,2)));
g_in1z = gpuArray(single(int1(:,3))); 
g_in2x = gpuArray(single(int2(:,1)));
g_in2y = gpuArray(single(int2(:,2)));
g_in2z = gpuArray(single(int2(:,3))); 
g_rex = gpuArray(single(reco(:,1)));
g_rey = gpuArray(single(reco(:,2)));
g_rez = gpuArray(single(reco(:,3))); 
g_tof = gpuArray(zeros(Ntotal,1,'single'));
g_in1bestx = gpuArray(zeros(Ntotal,1,'single'));
g_in1besty = gpuArray(zeros(Ntotal,1,'single'));
g_in1bestz = gpuArray(zeros(Ntotal,1,'single'));
g_in2bestx = gpuArray(zeros(Ntotal,1,'single'));
g_in2besty = gpuArray(zeros(Ntotal,1,'single'));
g_in2bestz = gpuArray(zeros(Ntotal,1,'single'));

The call to the kernel is done by:

% GPU computation %
[~,~,~,~,~,~,~,~,~,~,~,~,g_tof,g_in1bestx,g_in1besty,g_in1bestz,g_in2bestx,g_in2besty,g_in2bestz] = ...
                                        feval(kernel,s_c1,s_c2,s_c3,...
                                        i_Nelem,g_elx,g_ely,g_elz,...
                                        i_Nint1,g_in1x,g_in1y,g_in1z,...
                                        i_Nint2,g_in2x,g_in2y,g_in2z,...  
                                        i_Nreco,g_rex,g_rey,g_rez,...
                                        g_tof, ...
                                        g_in1bestx, g_in1besty, g_in1bestz, ...
                                        g_in2bestx, g_in2besty, g_in2bestz );

The data are then transfered back from the GPU:

TOF_final = gather(g_tof);

It seems to not be a global memory problem since my biggest problem (Ntotal = 12 567 632 Bytes) reaches a total memory of about 77.8MB. And the maximum memory of the Tesla K80 is 12GB…

I have performed several tests and here are the results (my maximum threads per block remains 1024.)

  • Ntotal = 216 320 => GridSize = [ 211 1 1] => works well
  • Ntotal = 480 896 => GridSize = [ 470 1 1] => works well
  • Ntotal = 1 623 024 => GridSize = [ 1585 1 1] => does not work (total memory ~10MB)
  • Ntotal = 12 567 632 => GridSize = [12274 1 1] => does not work (total memory ~78MB)

Actually, when it does not work the kernel goes very quickly (whereas it is supposed to take minutes…) and there is an error at the “gather” call (see also in my first post):

Error using gpuArray/gather
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_UNKNOWN

Thanks a lot for your help !