threads and blocks

I’m having trouble trying to expand my code out to perform with multiple cases. I have been developing with the most common case in mind, now its time for testing and i need to ensure that it all works for the different cases. Currently my kernel is executed within a loop (there are reasons why we aren’t doing one kernel call to do the whole thing.) to calculate a value across the row of a matrix. The most common case is 512 columns by 512 rows. I need to consider matricies of the size 512 x 512, 1024 x 512, 512 x 1024, and other combinations, but the largest will be a 1024 x 1024 matrix. I have been using a rather simple kernel call:

launchKernel<<<1,512>>>(................)

This kernel works fine for the common 512x512 and 512 x 1024 (column, row respectively) case, but not for the 1024 x 512 case. This case requires 1024 threads to execute. In my naivity i have been trying differnt versions of the simple kernel call to launch 1024 threads.

launchKernel<<<2,512>>>(................)  // 2 blocks with 512 threads each ???

launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???

I beleive my problem has something to do with my lack of understanding of the threads and blocks

Here is the output of deviceQuery, as you can see i can have a max of 1024 threads

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla C2050"

  CUDA Driver Version / Runtime Version          4.2 / 4.1

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:                 2688 MBytes (2818572288 bytes)

  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores

  GPU Clock Speed:                               1.15 GHz

  Memory Clock rate:                             1500.00 Mhz

  Memory Bus Width:                              384-bit

  L2 Cache Size:                                 786432 bytes

  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

  Total number of registers available per block: 32768

  Warp size:                                     32

  Maximum number of threads per block:           1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             512 bytes

  Concurrent copy and execution:                 Yes with 2 copy engine(s)

  Run time limit on kernels:                     Yes

  Integrated GPU sharing Host Memory:            No

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   Yes

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                Yes

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      No

  Device PCI Bus ID / PCI location ID:           40 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro 600"

  CUDA Driver Version / Runtime Version          4.2 / 4.1

  CUDA Capability Major/Minor version number:    2.1

  Total amount of global memory:                 1024 MBytes (1073741824 bytes)

  ( 2) Multiprocessors x (48) CUDA Cores/MP:     96 CUDA Cores

  GPU Clock Speed:                               1.28 GHz

  Memory Clock rate:                             800.00 Mhz

  Memory Bus Width:                              128-bit

  L2 Cache Size:                                 131072 bytes

  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

  Total number of registers available per block: 32768

  Warp size:                                     32

  Maximum number of threads per block:           1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             512 bytes

  Concurrent copy and execution:                 Yes with 1 copy engine(s)

  Run time limit on kernels:                     Yes

  Integrated GPU sharing Host Memory:            No

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   Yes

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      No

  Device PCI Bus ID / PCI location ID:           15 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600

I am using only the Tesla C2050 device

Here is a stripped out version of my kernel, so you have an idea of what it is doing.

#define twoPi               6.283185307179586

#define speed_of_light      3.0E8

#define MaxSize             999

__global__ void launchKernel

(  

        const float *array1,  

        const double *array2,  

        const float scalar1,  

        const float scalar2,  

        const float scalar3,  

        const float scalar4,  

        const float scalar5,  

        const float scalar6,  

        const int scalar7,  

        const int scalar8,    

        float *outputArray1,

        float *outputArray2)  

{  

float scalar9;  

    int idx;  

    double scalar10;

    double scalar11;  

    float sumReal, sumImag;  

    float real, imag;  

float coeff1, coeff2, coeff3, coeff4;  

sumReal = 0.0;  

    sumImag = 0.0;  

// kk loop 1 .. 512 (scalar7)  

    idx = (blockIdx.x * blockDim.x) + threadIdx.x;  

/* Declare the shared memory parameters */

    __shared__ float SharedArray1[MaxSize];

    __shared__ double SharedArray2[MaxSize];

/* populate the arrays on shared memory */

    SharedArray1[idx] = array1[idx];  // first 512 elements

    SharedArray2[idx] = array2[idx];

    if (idx+blockDim.x < MaxSize){

        SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];

        SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];

    }            

    __syncthreads();

// input scalars used here.

    scalar10 = ...;

    scalar11 = ...;

for (int kk = 0; kk < scalar8; kk++)

    {  

        /* some calculations */

        // SharedArray1, SharedArray2 and scalar9 used here

        sumReal = ...;

        sumImag = ...;

    }  

/* calculation of the exponential of a complex number */

    real = ...;

    imag = ...;

    coeff1 = (sumReal * real);  

    coeff2 = (sumReal * imag);  

    coeff3 = (sumImag * real);  

    coeff4 = (sumImag * imag);  

outputArray1[idx] = (coeff1 - coeff4);  

    outputArray2[idx] = (coeff2 + coeff3);  

}

Because my max threads per block is 1024, i thought i would be able to continue to use the simple kernel launch, am i wrong?

How do i successfully launch each kernel with 1024 threads?

Hello,

In order to launch 1024 threads you can launch any combination <<<nofb,tpb>>> with nofb*tbp=1024. But there is a problem. YOu must have enough registers and shared memory available to launch the blocks. 2,512 should work but maybe try also 4,256.

You have this lines

if (idx+blockDim.x < MaxSize){

        SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];

        SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];

    }

they good to insure that you do not access out of bounds, but you also must be sure that all treads are doing the job.

I’m not sure what you mean. How do you check that all threads are performing this task. I thought thats what the global specifier ensured would happen, that the function would be performed by each thread?

I just tried <<<2,512>>> and <<<4,256>> with the 1024 x 512, and 512 x 1024 case, and both cases and both instances returned an unknown error, presumably from the kernel launch

...

    mexPrintf("%d:: calling kernel\n",thread);

launchKernel<<<2,512>>>(........); // or launchKernel<<<4,256>>>(........); 

    mexPrintf("1 Last Error: %s\n",cudaGetErrorString(cudaGetLastError()));

    cudaThreadSynchronize(); 

mexPrintf("%d:: synchronised kernel\n",thread);

    mexPrintf("2 Last Error: %s\n",cudaGetErrorString(cudaGetLastError()));

    ...

Output:

0:: calling kernel

0:: returned from kernel

1 Last Error: no error

0:: synchronised kernel

2 Last Error: unknown error

Because the kernel works for the 512 x 512 case with <<<1,512>>>, I still think its something to do with not launching enough threads for the other cases. If anyone else has any other possible reasons/solutions, i’d be happy to listen? (i’m running out of straws)

The device Query shows this for my registers and shared memory. The shared memory i’m using is one array of 999 elements of floats and one array of 999 elements of doubles (< 12K), so I should have plenty of shared memory and there are plenty of registers?

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Im not sure what to do from here, Is there a way to debug the kernel as well as the calling host function. Im working with Matlab’s Mex function interface to call the kernel. I have Cuda 4.1, i have access to VS 2005, but i haven’t had a lot of luck getting cuda programs to work. I know i can debug the mex function using VS, but is there a way to debug the kernel as well?

I have resolved the problem.

It turns out i was accessing out of bounds elements in the shared arrays when i was launching 1024 threads. It was this that was causing the unknown error fro the kernel launch. I was then able to use 1024 threads in 1 block.

Thanks for all the help provided.