Kernel launching incorrectly

I have a kernel whose job it is to fill out the values in two arrays, as part of a larger function:

//host code only above here

        dim3 bpg;

        bpg.x = 256;

        bpg.y =  16;

	dim3 tpb;

        tpb.x = 256;

        tpb.y = 256; 

long2* d_H_pos;

        cuDoubleComplex* d_H_vals;

status1 = cudaMalloc(&d_H_pos, dim*stridepos*sizeof(long2));

        status2 = cudaMalloc(&d_H_vals, dim*strideval*sizeof(cuDoubleComplex));

if ( (status1 != CUDA_SUCCESS) || (status2 != CUDA_SUCCESS) ){

              cout<<"Memory allocation for device Hamiltonian failed! Error: "<<cudaGetErrorString( cudaPeekAtLastError() )<<endl;

              return 1;

        }

//dim = 256*256     

SetFirst<<<256, 256>>>(d_H_pos, stridepos, dim, 1); //count the diagonal element

	cudaThreadSynchronize();

	

	FillSparse<<<bpg, tpb>>>(d_basis_Position, d_basis, dim, d_H_vals, d_H_pos, d_Bond, lattice_Size, JJ);

//function continues

When I run cuda-gdb, SetFirst launches with <<<(256,1,1),(256,1,1)>>> and FillSparse launches with <<<(1,1,1),(1,1,1)>>>. What’s going on here? I don’t get any segfaults or memory allocation problems when I create the arrays or run SetFirst. I’m running CUDA 4.0 on a GTX 460 (with 2GB RAM), using the 64-bit Ubuntu 10.10.

I have a kernel whose job it is to fill out the values in two arrays, as part of a larger function:

//host code only above here

        dim3 bpg;

        bpg.x = 256;

        bpg.y =  16;

	dim3 tpb;

        tpb.x = 256;

        tpb.y = 256; 

long2* d_H_pos;

        cuDoubleComplex* d_H_vals;

status1 = cudaMalloc(&d_H_pos, dim*stridepos*sizeof(long2));

        status2 = cudaMalloc(&d_H_vals, dim*strideval*sizeof(cuDoubleComplex));

if ( (status1 != CUDA_SUCCESS) || (status2 != CUDA_SUCCESS) ){

              cout<<"Memory allocation for device Hamiltonian failed! Error: "<<cudaGetErrorString( cudaPeekAtLastError() )<<endl;

              return 1;

        }

//dim = 256*256     

SetFirst<<<256, 256>>>(d_H_pos, stridepos, dim, 1); //count the diagonal element

	cudaThreadSynchronize();

	

	FillSparse<<<bpg, tpb>>>(d_basis_Position, d_basis, dim, d_H_vals, d_H_pos, d_Bond, lattice_Size, JJ);

//function continues

When I run cuda-gdb, SetFirst launches with <<<(256,1,1),(256,1,1)>>> and FillSparse launches with <<<(1,1,1),(1,1,1)>>>. What’s going on here? I don’t get any segfaults or memory allocation problems when I create the arrays or run SetFirst. I’m running CUDA 4.0 on a GTX 460 (with 2GB RAM), using the 64-bit Ubuntu 10.10.

  1. Thou shalt check return codes for errors.

Fermi supports no more than 1024 threads per block. Your tpb dimensions exceed that. You should check for errors after the kernel launch too.

  1. Thou shalt check return codes for errors.

Fermi supports no more than 1024 threads per block. Your tpb dimensions exceed that. You should check for errors after the kernel launch too.

Wow, I feel dumb now. Thanks!

Wow, I feel dumb now. Thanks!