kernel invocation parameters

rattias · January 26, 2015, 9:54pm

Hi,
I’m working on a Tesla K20. According to the deviceQuery utility:
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

However when I run the program below, the program works as long as the number of blocks (computed as NUM_INTS/1024) stays under 65535. If I make NUM_INTS 1610241024, bringing the number of blocks to 65536, the kernel invocation fails with “Invalid argument”. I assume the argument it’s complaining about is the number of blocks >65535, but according to the parameters above I should be ok up to
2147483647. can somebody comment?

#include <string.h>
#include <stdio.h>

#define NUM_INTS (15*1024*1024)
#define BLOCK_SIZE 1024

int a[NUM_INTS], b[NUM_INTS], r[NUM_INTS];

__global__ void vector_add(
        int *ad,
        int *bd,
        int *rd,
        int offset,
        int  n) {
    uint global_idx = blockIdx.x *blockDim.x + threadIdx.x;
    if (global_idx < n) {
      global_idx += offset;
      rd[global_idx] = ad[global_idx] + bd[global_idx];
    }
}



int main(int argc, char **argv) {
    int i;
    int *ad, *bd, *rd;

    // compute number of blocks
    int size = NUM_INTS*sizeof(int);
    int nblocks = size/BLOCK_SIZE;
    if (nblocks*BLOCK_SIZE < size)
        size++;
    printf("will use %d blocks\n", nblocks);


    // initialize input arrays
    for(i=0; i<NUM_INTS; i++) {
        a[i] = i;
        b[i] = i;
    }

    // allocate device memory
    cudaMalloc((void**)&ad, size);
    cudaMalloc((void**)&bd, size);
    cudaMalloc((void**)&rd, size);


    // copy host data to device
    cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
    // run kernel
    vector_add<<<nblocks, BLOCK_SIZE>>>(ad, bd, rd, 0, NUM_INTS);
    cudaError_t err = cudaGetLastError();
    if ( cudaSuccess != err ) {
      fprintf( stderr, "kernel invocationi failed: %s\n", cudaGetErrorString( err ) );
      exit( -1 );
    }
    // copy data  back
    cudaMemcpy(r, rd, size, cudaMemcpyDeviceToHost);
    return 0;
}

allanmac · January 26, 2015, 10:01pm

This probably means you’re compiling with the default target architecture which is sm_20 and limited to 64K-1.

Add a “-arch sm_35” to your compile args and it should work.

rattias · January 26, 2015, 10:09pm

yep, that was it. I guess I thought those type of checks would take into account what the actual device being used is, but instead it seems just to be a function of the compilation assumptions. Thanks!

Topic		Replies	Views
Invalid configuration argument Kernels fail to work with big arrays CUDA Programming and Performance	2	9610	October 6, 2008
Maximum number of threads on thread block CUDA Programming and Performance	12	74847	September 21, 2023
New findings needed to be verified: Maximum thread block is not 1024 in K20 CUDA Programming and Performance	4	760	November 17, 2014
Need help understanding kernel function, grid and block CUDA Programming and Performance	6	539	October 12, 2021
Invalid argument error with grid x dimension > 65535 CUDA Programming and Performance	2	1274	January 14, 2016
K10 has a problem with "large" gridDim.x CUDA Programming and Performance	4	1853	July 30, 2013
Invalid Configuration Argument CUDA Programming and Performance	2	1938	December 16, 2018
Invalid configuration problem CUDA Programming and Performance	2	2951	October 23, 2008
Strange "Invalid Configuration Argument" Error CUDA Programming and Performance	2	31635	January 16, 2008
Kernel Launch Failure Very simple kernel CUDA Programming and Performance	3	3906	September 14, 2011

kernel invocation parameters

Related topics