Why is kernel getting failed?

I am generating a 3d array with the dimension(nz, ny, nx) and 1d arrays along 3 dimensions x- axis[nx], y-axis[ny] and z-axis[nz]. Then I am using texture memory to interpolate it.

But I am not getting about how kernel failed?

Here the following code what I am trying to do

#include <iostream>
#include <fstream>
#define NX 50
#define NY 50
#define NZ 10
#define ifact 2

texture<float, 3> tex;

__global__ void getInterpolatedFunctionValue(float *a, float *b, float *c,  float *result, int nx, int ny, int nz)
{
// Calculate normalized texture coordinates
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int z = blockIdx.z * blockDim.z + threadIdx.z;

if ((x < nx) && (y < ny) && (z < nz))
{
	printf("hello \n") ;
	float w = a[x] + 0.5f, v = b[y] + 0.5f, u = c[z] + 0.5f  ;
	result[z * nz * ny + y * nz + x ] = tex3D(tex, u, v, w );
}
}

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

using namespace std;

int main(){

int nx=NX, ny=NY, nz=NZ;
float fff[nz][ny][nx];
float x[nx], y[ny], z[nz] ;

for(int ix=0; ix<nx; ix++)
  for(int iy=0; iy<ny; iy++)
    for(int iz=0; iz<nz; iz++){
		x[ix] =  ix / ifact ;
		y[iz] =  iy / ifact ;
		z[iz] =  iz / ifact ;
      fff[iz][iy][ix] = sin(ix/(float)10)*cos(iy/(float)10)+iz;
    }

float *d_x, *d_y, *d_z, *d_result, *h_result ;


cudaMalloc((void**)&d_x, nx * sizeof(float));
cudaMalloc((void**)&d_y, ny * sizeof(float));
cudaMalloc((void**)&d_z, nz * sizeof(float));
cudaMalloc((void**)&d_result, nx * ny * nz * sizeof(float));


cudaMemcpy( d_x, x, nx * sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpy( d_y, y, ny * sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpy( d_z, z, nz * sizeof(float), cudaMemcpyHostToDevice );

cudaCheckErrors("allocating an array is failed");	

cudaArray *d_volumeArray ;

//const cudaExtent extent = make_cudaExtent(nx, ny, nz);
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaMalloc3DArray(&d_volumeArray, &channelDesc, make_cudaExtent(nz, ny, nx));
cudaCheckErrors("cudaMalloc3D error");

cudaMemcpy3DParms copyParams = {0};
copyParams.srcPtr   = make_cudaPitchedPtr((void*)fff, sizeof(float)*nz,ny,nx);
copyParams.dstArray = d_volumeArray;
copyParams.dstPos   = make_cudaPos(0,0,0);
copyParams.srcPos   = make_cudaPos(0,0,0);
copyParams.extent   = make_cudaExtent(nz, ny, nx);
copyParams.kind     = cudaMemcpyHostToDevice;
cudaCheckErrors("copyParams3D fail");
cudaMemcpy3D(&copyParams);
cudaCheckErrors("cudaMemcpy3DParms fail");

tex.normalized = false;
tex.filterMode = cudaFilterModeLinear;
tex.addressMode[0] = cudaAddressModeClamp;
tex.addressMode[1] = cudaAddressModeClamp;
tex.addressMode[2] = cudaAddressModeClamp;

cudaBindTextureToArray(tex, d_volumeArray, channelDesc);
cudaCheckErrors("bind fail");	

const dim3 blockSize(32, 8, 8 );
const dim3 gridSize(((nz + blockSize.z )/blockSize.z),((ny + blockSize.y )/blockSize.y),((nx + blockSize.x)/blockSize.x));

getInterpolatedFunctionValue<<<gridSize, blockSize>>>(d_x, d_y, d_z, d_result, nx, ny, nz ) ;

cudaCheckErrors("kernel fail");
cudaDeviceSynchronize();	
cudaMemcpy( h_result, d_result, nx * ny * nz * sizeof(float),cudaMemcpyDeviceToHost);		
cudaCheckErrors("cudaMemcpy fail");
  
  
printf("success!\n");

cudaUnbindTexture(tex);
cudaCheckErrors("unbind fail");
cudaFreeArray(d_volumeArray);
cudaCheckErrors("free fail");

return 0;
}

Is there any thread issue, or any else?

When I run your code, I get an error output. The error output is instructive, and indicates what you should be looking at. What happens when you run your code?

If you get an error output, include that error output in your question.

Have you run your code with cuda-memcheck as I already suggested to you here?

https://devtalk.nvidia.com/default/topic/1044591/cuda-programming-and-performance/use-cudagetdevicecount-properly-floating-point-exceptions/

If so, include the output from cuda-memcheck in your question, when asking for help.

If not, please do that.

When I compiled that code, I got this :

$ nvcc -o exec  tex3d_array.cu
tex3d_array.cu(109): warning: variable "h_result" is used before its value is set

tex3d_array.cu(109): warning: variable "h_result" is used before its value is set

While executing it

$ ./exec
Fatal error: kernel fail (invalid configuration argument at tex3d_array.cu:107)
*** FAILED - ABORTING

While using cuda-memcheck

$ cuda-memcheck ./exec
========= CUDA-MEMCHECK
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so.1 [0x351c13]
=========     Host Frame:./exec [0x48fd5]
=========     Host Frame:./exec [0x4d2f]
=========     Host Frame:./exec [0x4b73]
=========     Host Frame:./exec [0x4bcb]
=========     Host Frame:./exec [0x4730]
=========     Host Frame:/usr/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22445]
=========     Host Frame:./exec [0x39e9]
=========
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so.1 [0x351c13]
=========     Host Frame:./exec [0x381a3]
=========     Host Frame:./exec [0x4739]
=========     Host Frame:/usr/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22445]
Fatal error: kernel fail (invalid configuration argument at tex3d_array.cu:107)
=========     Host Frame:./exec [0x39e9]
=========
*** FAILED - ABORTING
========= ERROR SUMMARY: 2 errors

Here, Kernel is getting failed.

I started to write a new code from scratch still facing an issue. https://devtalk.nvidia.com/default/topic/1044635/tex3d-lt-float-gt-not-able-to-to-calculate-from-a-code-/#5300141

How could I solve this issue? Am I doing something wrong while using texture memory?

Next time, include this output in your question, before asking others for help. It makes it a lot easier for others to help you.

This error:

Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunchKernel.

means that your kernel launch is misconfigured. The kernel launch configuration is the data between the <<<…>>> symbols.

On your kernel launch it is this:

getInterpolatedFunctionValue<<<gridSize, blockSize>>>(d_x, d_y, d_z, d_result, nx, ny, nz ) ;
                               ^^^^^^^^^^^^^^^^^^^

So, print out the .x, .y, and .z components for each of gridSize and blockSize, and compare those to the hardware limits for each, which can be obtained by running the deviceQuery sample code. In addition, the total number of threads per block cannot exceed 1024.

Let’s cut to the chase:

This is an illegal choice:

const dim3 blockSize(32, 8, 8 );

cuda kernels are limited to a maximum of 1024 threads per block. This is the product of the block dimensions, just like you would do if you were finding the volume of a 3D rectangular prism.

3288 > 1024, therefore your kernel has an invalid configuration argument (namely, blockSize)

As an aside, this looks a little weird:

const dim3 gridSize(((nz + blockSize.z )/blockSize.z),((ny + blockSize.y )/blockSize.y),((nx + blockSize.x)/blockSize.x));

The customary ordering for the constructor for a dim3 is x,y,z not z,y,x

It’s not illegal per-se, but not be giving you the results you are expecting and could lead to possible unexpected indexing in your kernel code.

Regarding the use of 3D texturing, I would start by studying the CUDA sample codes that cover 3D texturing. No, I won’t be able to list them for you. Use a tool like grep to search the sample codes directly, or study the sample code documentation at:

https://docs.nvidia.com/cuda/index.html

I changed as you suggested

const dim3 blockSize(32, 8, 2 );
const dim3 gridSize(((nz + blockSize.x )/blockSize.x),((ny + blockSize.y )/blockSize.y),((nx + blockSize.z)/blockSize.z));

printf("Blocksize.x = %i, blockSize.y = %i, blockSize.z = %i \n", blockSize.x, blockSize.y, blockSize.z);
printf("gridSize.x = %i, gridSize.y = %i, gridSize.z = %i \n", gridSize.x, gridSize.y, gridSize.z);

getInterpolatedFunctionValue<<<gridSize, blockSize>>>(d_x, d_y, d_z, d_result, nx, ny, nz ) ;

cudaCheckErrors("kernel fail");
cudaDeviceSynchronize();	
h_result = (float*) malloc(nx * ny * nz) ;
cudaMemcpy( h_result, d_result, nx * ny * nz * sizeof(float),cudaMemcpyDeviceToHost);		
cudaCheckErrors("cudaMemcpy fail");
  
cudaUnbindTexture(tex);
cudaCheckErrors("unbind fail");

cudaFree(d_x);
cudaFree(d_y);
cudaFree(d_z);
cudaFree(d_result);
cudaCheckErrors("cudaFree fail");
cudaFreeArray(d_volumeArray);
cudaCheckErrors("free fail");

printf("success!\n");
free(x);
free(y);
free(z);

It worked. But still, I am facing an issue

========= CUDA-MEMCHECK
Blocksize.x = 32, blockSize.y = 8, blockSize.z = 2
gridSize.x = 1, gridSize.y = 7, gridSize.z = 26
*** Error in `./sam': free(): corrupted unsorted chunks: 0x0000000003de2270 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x81499)[0x2b81b7dd4499]
/usr/lib64/libcuda.so.1(+0x115602)[0x2b81b9522602]
/usr/lib64/libcuda.so.1(+0x114bae)[0x2b81b9521bae]
/usr/lib64/libcuda.so.1(+0x1240f5)[0x2b81b95310f5]
/usr/lib64/libcuda.so.1(+0x11811d)[0x2b81b952511d]
/usr/lib64/libcuda.so.1(+0x3416c8)[0x2b81b974e6c8]
/usr/lib64/libcuda.so.1(+0x1e45cb)[0x2b81b95f15cb]
/usr/lib64/libcuda.so.1(+0xf8264)[0x2b81b9505264]
/usr/lib64/libcuda.so.1(cuMemFree_v2+0x15a)[0x2b81b9639a1a]
./sam[0x410c44]
./sam[0x443bfc]
./sam[0x40498f]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b81b7d75445]
./sam[0x403a29]
======= Memory map: ========
00400000-0046c000 r-xp 00000000 00:2b 591526813                          /home/bbadal/texccudapractice/sam
0066c000-0066f000 r-xp 0006c000 00:2b 591526813                          /home/bbadal/texccudapractice/sam
0066f000-00670000 rwxp 0006f000 00:2b 591526813                          /home/bbadal/texccudapractice/sam
00670000-00671000 rwxp 00000000 00:00 0
0191a000-04333000 rwxp 00000000 00:00 0                                  [heap]
200000000-200200000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
200200000-200600000 ---p 00000000 00:00 0
200600000-200800000 rwxs 00000000 00:05 12356                            /dev/nvidia0
200800000-200c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
200c00000-200e00000 rwxs 00000000 00:05 12356                            /dev/nvidia0
200e00000-201e00000 ---p 00000000 00:00 0
201e00000-202000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
202000000-202400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
202400000-202800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
202800000-202c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
202c00000-203000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
203000000-203400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
203400000-203800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
203800000-203c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
203c00000-204000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
204000000-204400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
204400000-204800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
204800000-204c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
204c00000-205000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
205000000-205400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
205400000-205800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
205800000-205c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
205c00000-206000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
206000000-206400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
206400000-206800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
206800000-206c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
206c00000-207000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
207000000-207400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
207400000-207800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
207800000-207c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
207c00000-208000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
208000000-208400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
208400000-208800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
208800000-208c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
208c00000-209000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
209000000-209400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
209400000-209800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
209800000-209c00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
209c00000-20a000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
20a000000-20a400000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
20a400000-20a800000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
20a800000-20ac00000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
20ac00000-20ae00000 ---p 00000000 00:00 0
20ae00000-20b000000 rwxs 00000000 00:05 12332                            /dev/nvidiactl
20b000000-600200000 ---p 00000000 00:00 0
10000000000-10004000000 ---p 00000000 00:00 0
2b81b6a4d000-2b81b6a6f000 r-xp 00000000 00:13 165115                     /usr/lib64/ld-2.17.so
2b81b6a6f000-2b81b6a78000 rwxp 00000000 00:00 0
2b81b6a78000-2b81b6a79000 rwxs 00000000 00:05 12332                      /dev/nvidiactl
2b81b6a79000-2b81b6a7a000 rwxs 00000000 00:05 12332                      /dev/nvidiactl
2b81b6a7a000-2b81b6a7b000 rwxs 00000000 00:05 12332                      /dev/nvidiactl
2b81b6a7b000-2b81b6a7c000 rwxs 00000000 00:05 12332                      /dev/nvidiactl
2b81b6a7c000-2b81b6a8c000 -w-s 00000000 00:05 12356                      /dev/nvidia0
2b81b6a8c000-2b81b6a8d000 r-xs 00000000 00:05 12356                      /dev/nvidia0
2b81b6a8d000-2b81b6a9d000 -w-s 00000000 00:05 220198                     /dev/nvidia1
2b81b6a9d000-2b81b6a9e000 r-xs 00000000 00:05 220198                     /dev/nvidia1========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

I want to know about my new threads, grid size, BlockSize division is correct or not?

You should not be attempting to free() stack-based variables. That is a defect in your code and has nothing to do with CUDA.

I have to make Cuda - free that array, either 1D or 3D. The Error is coming across

cudaFreeArray(d_volumeArray);
cudaCheckErrors("free fail");

This is not stack-based variable.

These are stack-based arrays:

float x[nx], y[ny], z[nz] ;

as such, these statements that you posted:

free(x);
free(y);
free(z);

are illegal (in C/C++, this has nothing to do with CUDA).

Another problem is here:

h_result = (float*) malloc(nx * ny * nz) ;

it should be:

h_result = (float*) malloc(nx * ny * nz *sizeof(float)) ;

That coding error is creating memory corruption in your program when you do this:

cudaMemcpy( h_result, d_result, nx * ny * nz * sizeof(float),cudaMemcpyDeviceToHost);
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The resultant memory corruption is giving rise to the error:

*** Error in `./sam': free(): corrupted unsorted chunks: 0x0000000003de2270 ***