error : "too many resources requested for launch"

after updating cuda toolkit and video driver, cuda code is not working (after updating for compatible) .

// init
checkCudaErrors( cudaMalloc((void**)&verts, num_verts_ * sizeof(float3)) );
checkCudaErrors( cudaMemcpy( verts, verts_, num_verts_ * sizeof(float3), cudaMemcpyHostToDevice ) );
num_verts = num_verts_;
	
checkCudaErrors( cudaMalloc((void**)&indices, num_indices_ * sizeof(int)) );
checkCudaErrors( cudaMemcpy( indices, indices_, num_indices_ * sizeof(int), cudaMemcpyHostToDevice ) );
num_indices = num_indices_;

..

// global
for(int i = 0; i < num_indices/3; i++) {

   int i0 = indices[i*3+0] - 1;
   int i1 = indices[i*3+1] - 1;
   int i2 = indices[i*3+2] - 1;

   float3 v0 = make_float3( 10, 10, i0  ); // ( u,  v, 0 );
   float3 e1 = make_float3( i1, 10, 10 ); // ( i1, u, v );
   float3 e2 = make_float3( 10, i2, 10 ); // ( v, i2, v );
			
   float t =  RayTriangleIntersection( r, v0, e1, e2 ); // error with this string, function tested and                it's good

..
}

Is this a driver bug ?

Could you try the devicequery example from the nvidia sdk. Usually if there are problems (version) you can see then in there.

From which version did you updated?

where cutil_inline.h was . maybe from cudatoolkit_4.0.17_win_32 .

Device Query : result = pass .

P.S.: and in cuda toolkit v5.5 no cpp files in programdata for matrixMul sample .

gpgpu 8800gt

When I run devicequery program somewhere at the beginning of the output I get this:

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 540M"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    2.1

Are the cuda driver version and the runtime version matching?

matching (5.5).

I run cutted this cuda application and it 5x-7x time slowly then was . and tex1Dfetch which was working not work now.

and when I tryed to add residual code application crashed. no means which code is adding.

win 7 x64

When I upgraded from cuda 4 there some smalls changes, such as the cuda copy to symbol did not required ’ ’ (quotes) anymore. Assuming the simulation exmaples from sdk work, maybe there is something similar in your case. Some obsolete features which were removed, but o not give comipling errors. Becasue of you need to find the exact line where the code is not working.

Regarding the speed. Some new tweaking might be necessary (threads per block, launch bounds). Each new release has a list of changes.

no exact line where code is not working , any more line (which is was earlier) produce errors on execute “too many resources requested for launch” or something else .
win32 cuda application.

and win32 opencl application that was work not working now . but it was work on win7 32 .

Is there any cuda program which runs correctly except the devicequery?

CUDA Samples work correctly.

ok. I see. I think if you would post the details about previous setup and present setup (os, cuda version, nvidia driver, visual C version) you will get faster results.

Otherwise uninstall everything also the driver, use something like driver sweeper to erase all the traces of the previous driver and try again. I undestand also hat 5.5 does not work with VC 2010.

According to the CUDA 5.5 release notes, VS2010 is supported by CUDA 5.5:

[url]Release Notes :: CUDA Toolkit Documentation

Table 2. Windows Compilers Supported in CUDA 5.5 Compiler IDE
Visual C++ 11.0 Visual Studio 2012
Visual C++ 11.0 Visual Studio 2012 Express (32-bit)
Visual C++ 10.0 Visual Studio 2010
Visual C++ 9.0

I’am not sure, but I have in old files “cudatoolkit_4.0.17_win_32.msi” and “gpucomputingsdk_4.0.17_win_32” . driver not remember (2xx).
win 7 x32, mvs 2008 (SP1?) were .

now “cuda_5.5.20_winvista_win7_win8_general_64” and “gpucomputingsdk_4.0.19_win_64”.
win 7 x64, mvs 2008 (SP1) . driver 320.57


I rewrite opencl oclVolumeRender and get this results

// opencl code was good when tested last time (on win7 x32).
// .cl

// if use "i=0; i<2;" don't work ( error "CL_OUT_OF_RESOURCES" ),
            // but work if use "i=0; i<1;" and "i=1; i<2;"
            for(int i = 0; i < 2; i++) { 

			float4 v0 = triangles[i*3];
			float4 e1 = triangles[i*3+1];
			float4 e2 = triangles[i*3+2];

			float t = RayTriangleIntersection(r, (float3)(v0.x,v0.y,v0.z), (float3)(e1.x,e1.y,e1.z), (float3)(e2.x,e2.y,e2.z));

			if(t < hit_r.t && t > 0.001)
			{
				hit_r.t = t; 
				hit_r.hit_index = i;
			}
	}

Your old platform was a 32-bit platform, while the new platform is a 64-bit platform. You might want to check:

(1) whether your code (host or device) makes any assumptions about the size of pointers, in particular whther it assumes that pointers occupy four bytes

(2) whether all memory allocations are successful (as they may have grown if they involve storage for pointers). Does the code check the status of every CUDA API call and every CUDA kernel launch?

(3) whether there are any structs that contain pointers, which may get padded differently on the host and the device, leading to size differences and ultimately to corrupted data. The CUDA toolchain tries to ensure that structs are portable between host and device code as far as the host platform allows, but this only works if relevant code is compiled with nvcc. If a struct is passed from regular C/C++ host code it could have a different size.

I suspect register use as the root cause. Use maxrregcount or launch_bounds to make sure you do not run out of registers for launching your desired block size.

if use dim3 (8,16,1) (vs (16,16,1) which was in tested app ) than it works .
but 15x-20x slowly than it was .

this code 25x slowly

for ( int i=0; i<num_indices/3; ++i )  // i from 0 to 156, num_verts = 104
		{
				//float4 v0 = tex1Dfetch(triangle_texture,i*3);
				//float4 e1 = tex1Dfetch(triangle_texture,i*3+1);
				//float4 e2 = tex1Dfetch(triangle_texture,i*3+2);
		
				float3 v0 = verts[ indices[i*3+0] - 1 ];
				float3 e1 = verts[ indices[i*3+1] - 1 ] - v0;
				float3 e2 = verts[ indices[i*3+2] - 1 ] - v0; // faces[i].i2

float t = RayTriangleIntersection(r, make_float3(v0.x,v0.y,v0.z),make_float3(e1.x,e1.y,e1.z), make_float3(e2.x,e2.y,e2.z));
..
}

than code below

for ( int i=0; i<num_indices/3; ++i ) // i from 0 to 156, num_verts = 104
		{
				float4 v0 = tex1Dfetch(triangle_texture,i*3);
				float4 e1 = tex1Dfetch(triangle_texture,i*3+1);
				float4 e2 = tex1Dfetch(triangle_texture,i*3+2);
		
				//float3 v0 = verts[ indices[i*3+0] - 1 ];
				//float3 e1 = verts[ indices[i*3+1] - 1 ] - v0;
				//float3 e2 = verts[ indices[i*3+2] - 1 ] - v0; // faces[i].i2
float t = RayTriangleIntersection(r, make_float3(v0.x,v0.y,v0.z),make_float3(e1.x,e1.y,e1.z), make_float3(e2.x,e2.y,e2.z));
..
}

so, anybody knows why arrays from cudaMalloc 25x slower than tex1Dfetch ?

cached vs. uncached access.

the textured access can also take uncoalesced access patterns much better.

this code cpu compute 2x faster than gpgpu (with arrays allacated with cudaMalloc) .

“vectorAdd2” (0.74 ms) 13.5x faster than “vectorAdd1” (10 ms) . this is gpgpu limits.

but another talk that only 30% differences (1.01 ms vs 1.28 ms). so, test this code, please !

#include <stdio.h>

#include <helper_math.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

texture<float4, 1, cudaReadModeElementType> tex1, tex2, tex3; // the data store in a 1D float4 texture 

/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */
__global__ void
vectorAdd1(const float4 *A, const float4 *B, const float4 *D, float4 *C, int * Ai, int * Bi, int * Di, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        for ( int j=0; j<numElements; ++j ) {// 4000
            C[i] = A[ Ai[j] ] + B[ Bi[j] ] + D[ Di[j] ]; 
        }
    }
}

__global__ void
vectorAdd2(float4 *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        for ( int j=0; j<numElements; ++j ) { // 4000 // 
            float4 Ai = tex1Dfetch( tex1, j );
            float4 Bi = tex1Dfetch( tex2, j );
            float4 Di = tex1Dfetch( tex3, j );

            C[i] = Ai + Bi + Di; 
        }
    }
}

struct Data {

    float4 *d_A;
    float4 *h_A;

    Data( int num_elements ) {

        const int size = num_elements * sizeof ( float4 );

        h_A = (float4 *)malloc(  size);

        for (int i = 0; i < num_elements; ++i)
        {
            h_A[i] = make_float4( 
                rand()/(float)RAND_MAX, rand()/(float)RAND_MAX, 
                rand()/(float)RAND_MAX, rand()/(float)RAND_MAX );
        }

        cudaMalloc((void **)&d_A, size);
        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    } ; 

    ~Data() {
        cudaFree(d_A);
        free(h_A);
    } ; 
} ;

struct Data_i {

    int *d_A;
    int *h_A;

    Data_i( int num_elements ) {

        const int size = num_elements * sizeof ( int );

        h_A = (int *)malloc( size);

        for (int i = 0; i < num_elements; ++i)
        {
            h_A[i] = rand() % num_elements; 
        }

        cudaMalloc((void **)&d_A, size);
        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    } ; 

    ~Data_i() {
        cudaFree(d_A);
        free(h_A);
    } ; 
} ;

//float4 get_Random_Float4_CUDA_Ptr ( int num_elements );
void bindTex( texture<float4, 1, cudaReadModeElementType> & tex, float4 * verts, int num_verts );

/**
 * Host main routine
 */
int main(void)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 5000;

    Data d_A( numElements ), d_B( numElements ), d_C( numElements ), d_D( numElements ) ;
    Data_i d_Ai(numElements), d_Bi(numElements), d_Di(numElements);

    bindTex ( tex1, d_A.d_A, numElements );
    bindTex ( tex2, d_B.d_A, numElements );
    bindTex ( tex3, d_D.d_A, numElements );

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements/4 + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);

    cudaEventRecord(start);

    vectorAdd1<<<blocksPerGrid, threadsPerBlock>>>(d_A.d_A, d_B.d_A, d_D.d_A, d_C.d_A, d_Ai.d_A, d_Bi.d_A, d_Di.d_A, numElements/4);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf ("\nTime for the kernel: %f ms\n", milliseconds );

    cudaEventRecord(start);

    vectorAdd2<<<blocksPerGrid, threadsPerBlock>>>(d_C.d_A, numElements/4);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf ("\nTime for the kernel (tex1DFetch): %f ms\n", milliseconds );

    err = cudaGetLastError();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Reset the device and exit
    err = cudaDeviceReset();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to deinitialize the device! error=%s\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    printf("Done\n");

    getchar();

    return 0;
}

void bindTex( texture<float4, 1, cudaReadModeElementType> & tex, float4 * verts, int num_verts )
{
    tex.normalized = false;                      // access with normalized texture coordinates
    tex.filterMode = cudaFilterModePoint;        // Point mode, so no 
    tex.addressMode[0] = cudaAddressModeWrap;    // wrap texture coordinates

    size_t size = sizeof(float4)*num_verts;       
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float4>();
    cudaBindTexture(0,tex,verts,channelDesc,size);
}