Problem with cudaMalloc & indexes

Hi everybory!

Two problems :

  1. I allocate, with cudaMalloc, an float3 array of 8192 elements on the device, and the same array on the host. But, when I try to do a memcpy from the host to the device, I have a “core dumped”. Any idea about what can occur?

  2. I have used CUDA since two months ago, but I still have problem to compute my indexes. For example, I have a 1x16 grid with 32x1x16 blocks. How can I compute my indexes to access the good elements in my arrays?

Thanks for help!

Have a nice day.

Almost certainly means you are mixing up device and host pointers or doing something wrong with sizes. The text I added emphasis to has me intrigued. If you mean that literally, it could well be the source of your problem. Some simple, concise code which illustrates your problem will probably get you a more constructive answer.

Impossible to say. What are your trying to do? What does “good elements” mean? Some simple, concise code which illustrates your problem will probably get you a more constructive answer.

I fix my “core dumped” problem. It was just an index error on the host code.

Okay! First, I will explain the context. I have a big 3D cube : 33 points on the i-axis, 17 points on the j-axis and 17 points on the k-axis. It means that I have S = { 8192 little cubes } in the biggest. Each cube of S has 8 vertices called R1, R2, R3, R4, R5, R6, R7 and R8. All the R1 vertices are put on a same float3-array called R1 (first following the i-axis, then the j-axis, and then the k-axis). The same for R2 → R8.

I need to compute the skewness of each cube of S. So, for my kernel, I use a 1x16 grid with 32x1x16 blocks. So, my index is :

int gdy = gridDim.y;

  int bdx = blockDim.x, bdz = blockDim.z;

  int by = blockIdx.y;

  int tx = threadIdx.x, tz = threadIdx.z;

  int idx = by * bdx + tx + gdy * bdz * tz;

With this index (which represents a cube), I take the corresponding vertices (in R1 → R8 arrays) to compute the skewness. But the fact is the kernel returns the good value for the 4352 first cubes (i.e. 1.000000), but returns very strange values for the others (like -nan or -35591945656085774336.000000).

I don’t know if it is really clear, but I hope it will help you to help me ^^

Your indexing scheme looks incorrect to me. Threads within a block are arranged within column major order, so that the thread index within a block should be calculated like this:

tidx = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;

the “global” thread index within the grid is then calculated in the same way as

gidx = tidx + blockIdx.x*(blockDim.x*blockDim.y*blockDIm.z) + blockIdx.y*(gridDim.x*blockDim.x*blockDim.y*blockDIm.z);

That gidx is a unique, correctly ordered thread index on the 2D grid of 3D blocks. How that relates to your storage depends on how you have ordered it (row or column major order), and whether it is padded or contains alignment bytes.

Thank you very much!! External Media External Media External Media

That’s solve my problem. Now I will optimize my program!!