error : "too many resources requested for launch"

itmanager85 · January 12, 2014, 5:13pm

after updating cuda toolkit and video driver, cuda code is not working (after updating for compatible) .

// init
checkCudaErrors( cudaMalloc((void**)&verts, num_verts_ * sizeof(float3)) );
checkCudaErrors( cudaMemcpy( verts, verts_, num_verts_ * sizeof(float3), cudaMemcpyHostToDevice ) );
num_verts = num_verts_;
	
checkCudaErrors( cudaMalloc((void**)&indices, num_indices_ * sizeof(int)) );
checkCudaErrors( cudaMemcpy( indices, indices_, num_indices_ * sizeof(int), cudaMemcpyHostToDevice ) );
num_indices = num_indices_;

..

// global
for(int i = 0; i < num_indices/3; i++) {

   int i0 = indices[i*3+0] - 1;
   int i1 = indices[i*3+1] - 1;
   int i2 = indices[i*3+2] - 1;

   float3 v0 = make_float3( 10, 10, i0  ); // ( u,  v, 0 );
   float3 e1 = make_float3( i1, 10, 10 ); // ( i1, u, v );
   float3 e2 = make_float3( 10, i2, 10 ); // ( v, i2, v );
			
   float t =  RayTriangleIntersection( r, v0, e1, e2 ); // error with this string, function tested and                it's good

..
}

Is this a driver bug ?

pasoleatis · January 12, 2014, 5:24pm

Could you try the devicequery example from the nvidia sdk. Usually if there are problems (version) you can see then in there.

From which version did you updated?

itmanager85 · January 12, 2014, 5:46pm

where cutil_inline.h was . maybe from cudatoolkit_4.0.17_win_32 .

Device Query : result = pass .

P.S.: and in cuda toolkit v5.5 no cpp files in programdata for matrixMul sample .

gpgpu 8800gt

pasoleatis · January 12, 2014, 7:46pm

When I run devicequery program somewhere at the beginning of the output I get this:

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 540M"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    2.1

Are the cuda driver version and the runtime version matching?

itmanager85 · January 12, 2014, 8:11pm

matching (5.5).

I run cutted this cuda application and it 5x-7x time slowly then was . and tex1Dfetch which was working not work now.

and when I tryed to add residual code application crashed. no means which code is adding.

win 7 x64

pasoleatis · January 12, 2014, 11:20pm

When I upgraded from cuda 4 there some smalls changes, such as the cuda copy to symbol did not required ’ ’ (quotes) anymore. Assuming the simulation exmaples from sdk work, maybe there is something similar in your case. Some obsolete features which were removed, but o not give comipling errors. Becasue of you need to find the exact line where the code is not working.

Regarding the speed. Some new tweaking might be necessary (threads per block, launch bounds). Each new release has a list of changes.

itmanager85 · January 13, 2014, 12:22pm

no exact line where code is not working , any more line (which is was earlier) produce errors on execute “too many resources requested for launch” or something else .
win32 cuda application.

and win32 opencl application that was work not working now . but it was work on win7 32 .

pasoleatis · January 13, 2014, 2:51pm

Is there any cuda program which runs correctly except the devicequery?

itmanager85 · January 13, 2014, 3:21pm

CUDA Samples work correctly.

pasoleatis · January 13, 2014, 11:20pm

ok. I see. I think if you would post the details about previous setup and present setup (os, cuda version, nvidia driver, visual C version) you will get faster results.

Otherwise uninstall everything also the driver, use something like driver sweeper to erase all the traces of the previous driver and try again. I undestand also hat 5.5 does not work with VC 2010.

njuffa · January 14, 2014, 12:02am

According to the CUDA 5.5 release notes, VS2010 is supported by CUDA 5.5:

[url]Release Notes :: CUDA Toolkit Documentation

Table 2. Windows Compilers Supported in CUDA 5.5 Compiler IDE
Visual C++ 11.0 Visual Studio 2012
Visual C++ 11.0 Visual Studio 2012 Express (32-bit)
Visual C++ 10.0 Visual Studio 2010
Visual C++ 9.0

itmanager85 · January 14, 2014, 2:21am

I’am not sure, but I have in old files “cudatoolkit_4.0.17_win_32.msi” and “gpucomputingsdk_4.0.17_win_32” . driver not remember (2xx).
win 7 x32, mvs 2008 (SP1?) were .

now “cuda_5.5.20_winvista_win7_win8_general_64” and “gpucomputingsdk_4.0.19_win_64”.
win 7 x64, mvs 2008 (SP1) . driver 320.57

I rewrite opencl oclVolumeRender and get this results

// opencl code was good when tested last time (on win7 x32).
// .cl

// if use "i=0; i<2;" don't work ( error "CL_OUT_OF_RESOURCES" ),
            // but work if use "i=0; i<1;" and "i=1; i<2;"
            for(int i = 0; i < 2; i++) { 

			float4 v0 = triangles[i*3];
			float4 e1 = triangles[i*3+1];
			float4 e2 = triangles[i*3+2];

			float t = RayTriangleIntersection(r, (float3)(v0.x,v0.y,v0.z), (float3)(e1.x,e1.y,e1.z), (float3)(e2.x,e2.y,e2.z));

			if(t < hit_r.t && t > 0.001)
			{
				hit_r.t = t; 
				hit_r.hit_index = i;
			}
	}

njuffa · January 14, 2014, 3:08am

Your old platform was a 32-bit platform, while the new platform is a 64-bit platform. You might want to check:

(1) whether your code (host or device) makes any assumptions about the size of pointers, in particular whther it assumes that pointers occupy four bytes

(2) whether all memory allocations are successful (as they may have grown if they involve storage for pointers). Does the code check the status of every CUDA API call and every CUDA kernel launch?

(3) whether there are any structs that contain pointers, which may get padded differently on the host and the device, leading to size differences and ultimately to corrupted data. The CUDA toolchain tries to ensure that structs are portable between host and device code as far as the host platform allows, but this only works if relevant code is compiled with nvcc. If a struct is passed from regular C/C++ host code it could have a different size.

cbuchner1 · January 14, 2014, 2:42pm

I suspect register use as the root cause. Use maxrregcount or launch_bounds to make sure you do not run out of registers for launching your desired block size.

itmanager85 · January 14, 2014, 3:46pm

if use dim3 (8,16,1) (vs (16,16,1) which was in tested app ) than it works .
but 15x-20x slowly than it was .

this code 25x slowly

for ( int i=0; i<num_indices/3; ++i )  // i from 0 to 156, num_verts = 104
		{
				//float4 v0 = tex1Dfetch(triangle_texture,i*3);
				//float4 e1 = tex1Dfetch(triangle_texture,i*3+1);
				//float4 e2 = tex1Dfetch(triangle_texture,i*3+2);
		
				float3 v0 = verts[ indices[i*3+0] - 1 ];
				float3 e1 = verts[ indices[i*3+1] - 1 ] - v0;
				float3 e2 = verts[ indices[i*3+2] - 1 ] - v0; // faces[i].i2

float t = RayTriangleIntersection(r, make_float3(v0.x,v0.y,v0.z),make_float3(e1.x,e1.y,e1.z), make_float3(e2.x,e2.y,e2.z));
..
}

than code below

for ( int i=0; i<num_indices/3; ++i ) // i from 0 to 156, num_verts = 104
		{
				float4 v0 = tex1Dfetch(triangle_texture,i*3);
				float4 e1 = tex1Dfetch(triangle_texture,i*3+1);
				float4 e2 = tex1Dfetch(triangle_texture,i*3+2);
		
				//float3 v0 = verts[ indices[i*3+0] - 1 ];
				//float3 e1 = verts[ indices[i*3+1] - 1 ] - v0;
				//float3 e2 = verts[ indices[i*3+2] - 1 ] - v0; // faces[i].i2
float t = RayTriangleIntersection(r, make_float3(v0.x,v0.y,v0.z),make_float3(e1.x,e1.y,e1.z), make_float3(e2.x,e2.y,e2.z));
..
}

itmanager85 · January 15, 2014, 4:33pm

so, anybody knows why arrays from cudaMalloc 25x slower than tex1Dfetch ?

cbuchner1 · January 15, 2014, 5:17pm

cached vs. uncached access.

the textured access can also take uncoalesced access patterns much better.

itmanager85 · January 15, 2014, 6:48pm

this code cpu compute 2x faster than gpgpu (with arrays allacated with cudaMalloc) .

itmanager85 · January 16, 2014, 11:36pm

“vectorAdd2” (0.74 ms) 13.5x faster than “vectorAdd1” (10 ms) . this is gpgpu limits.

but another talk that only 30% differences (1.01 ms vs 1.28 ms). so, test this code, please !

#include <stdio.h>

#include <helper_math.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

texture<float4, 1, cudaReadModeElementType> tex1, tex2, tex3; // the data store in a 1D float4 texture 

/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */
__global__ void
vectorAdd1(const float4 *A, const float4 *B, const float4 *D, float4 *C, int * Ai, int * Bi, int * Di, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        for ( int j=0; j<numElements; ++j ) {// 4000
            C[i] = A[ Ai[j] ] + B[ Bi[j] ] + D[ Di[j] ]; 
        }
    }
}

__global__ void
vectorAdd2(float4 *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        for ( int j=0; j<numElements; ++j ) { // 4000 // 
            float4 Ai = tex1Dfetch( tex1, j );
            float4 Bi = tex1Dfetch( tex2, j );
            float4 Di = tex1Dfetch( tex3, j );

            C[i] = Ai + Bi + Di; 
        }
    }
}

struct Data {

    float4 *d_A;
    float4 *h_A;

    Data( int num_elements ) {

        const int size = num_elements * sizeof ( float4 );

        h_A = (float4 *)malloc(  size);

        for (int i = 0; i < num_elements; ++i)
        {
            h_A[i] = make_float4( 
                rand()/(float)RAND_MAX, rand()/(float)RAND_MAX, 
                rand()/(float)RAND_MAX, rand()/(float)RAND_MAX );
        }

        cudaMalloc((void **)&d_A, size);
        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    } ; 

    ~Data() {
        cudaFree(d_A);
        free(h_A);
    } ; 
} ;

struct Data_i {

    int *d_A;
    int *h_A;

    Data_i( int num_elements ) {

        const int size = num_elements * sizeof ( int );

        h_A = (int *)malloc( size);

        for (int i = 0; i < num_elements; ++i)
        {
            h_A[i] = rand() % num_elements; 
        }

        cudaMalloc((void **)&d_A, size);
        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    } ; 

    ~Data_i() {
        cudaFree(d_A);
        free(h_A);
    } ; 
} ;

//float4 get_Random_Float4_CUDA_Ptr ( int num_elements );
void bindTex( texture<float4, 1, cudaReadModeElementType> & tex, float4 * verts, int num_verts );

/**
 * Host main routine
 */
int main(void)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 5000;

    Data d_A( numElements ), d_B( numElements ), d_C( numElements ), d_D( numElements ) ;
    Data_i d_Ai(numElements), d_Bi(numElements), d_Di(numElements);

    bindTex ( tex1, d_A.d_A, numElements );
    bindTex ( tex2, d_B.d_A, numElements );
    bindTex ( tex3, d_D.d_A, numElements );

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements/4 + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);

    cudaEventRecord(start);

    vectorAdd1<<<blocksPerGrid, threadsPerBlock>>>(d_A.d_A, d_B.d_A, d_D.d_A, d_C.d_A, d_Ai.d_A, d_Bi.d_A, d_Di.d_A, numElements/4);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf ("\nTime for the kernel: %f ms\n", milliseconds );

    cudaEventRecord(start);

    vectorAdd2<<<blocksPerGrid, threadsPerBlock>>>(d_C.d_A, numElements/4);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf ("\nTime for the kernel (tex1DFetch): %f ms\n", milliseconds );

    err = cudaGetLastError();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Reset the device and exit
    err = cudaDeviceReset();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to deinitialize the device! error=%s\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    printf("Done\n");

    getchar();

    return 0;
}

void bindTex( texture<float4, 1, cudaReadModeElementType> & tex, float4 * verts, int num_verts )
{
    tex.normalized = false;                      // access with normalized texture coordinates
    tex.filterMode = cudaFilterModePoint;        // Point mode, so no 
    tex.addressMode[0] = cudaAddressModeWrap;    // wrap texture coordinates

    size_t size = sizeof(float4)*num_verts;       
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float4>();
    cudaBindTexture(0,tex,verts,channelDesc,size);
}

Topic		Replies	Views
problem running demos CUDA Programming and Performance	9	8332	January 1, 2009
An Even Easier Introduction to CUDA Technical Blog	148	8373	May 26, 2026
Texture fetches bug I hope this complete report helps CUDA Programming and Performance	5	2917	June 19, 2008
Runtime API to Driver API : translation pbs ? CUDA Programming and Performance	11	7075	November 6, 2008
Cannot run any CUDA kernels CUDA runtime doesn't recognize NVIDIA GPU CUDA Programming and Performance	26	12897	August 24, 2010
Actual max texture dimension size of Tesla C2050 CUDA Programming and Performance	7	6550	June 16, 2011
improved texfetch to exploit all of texture hardware CUDA Programming and Performance	21	9093	May 4, 2007
in emulation mode, the program works without emulation error why? ERROR: 0xC0000005: Access violatio CUDA Programming and Performance	6	8689	January 28, 2010
RegisterResource sometimes fails with multiple cards General	28	3851	October 13, 2020
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	65365	January 25, 2011

error : "too many resources requested for launch"

Related topics