Access the values of a float4 array Problem for understand and create a kernel

Hello everybody !!!

I’ve a float4 array => float4* octree of size const unsigned int octreeSize. My first problem is to create a kernel in order to access any point of the array.

My function launchKernel is declared like this :

int launchKernel( float4* octree, const unsigned int octreeSize )

{

   cudaSetDevice( 0 );

/* allocation mémoire du résultat du côté Host */

   float4* h_resultat = ( float4* ) malloc( sizeof( float4 ) );

/* allocation mémoire GPU */

   float4* d_octree;

   const int size = octreeSize * sizeof( float4 );

   float4* d_resultat;

/* allocation mémoire et copie de l'octree en mémoire GPU */

   cudaMalloc( ( void** ) &d_octree, size );

   cudaMemcpy( d_octree, &octree, size, cudaMemcpyHostToDevice );

   /* allocation mémoire pour le résultat du côté Device */

   cudaMalloc( ( void** ) &d_resultat, sizeof( float4 ) );

 mykernel <<< , >>>( d_octree, d_resultat );

   cudaMemcpy( h_resultat, d_resultat, sizeof( float4 ), cudaMemcpyDeviceToHost );

cudaFree( d_octree );

   cudaFree( d_resultat );

   free( h_resultat );

return 0;

}

The red code is a temporary code : calls kernel and the copy of the result. ( float4 => just a node of the octree ).

  1. How to access any boxes of the array ( octree ) from a kernel ?

  2. What to put between the triple < ?

  3. I’ve a another problem when I generate an octree with a depth value of 2, I have no errors when the program running but when I increase the value as 3, I receive a seg fault message : the segfault comes from this line

cudaMemcpy( d_octree, &octree, size, cudaMemcpyHostToDevice );

Thanks for your help.

Hi,

answering question 2 first. Sorry leaving Q 3

The kind of application where it is great to use a GPU and cuda is where you have an operation that you want done on thousands or millions of cells in Parallel with one thread processing the data for each cell. (thats the simplest approach but there are many other things that can be done) So if you had 10000 octrees and are processing those with 10000 threads then you split the 10000 threads into blocks of say 32 threads and would want code like this

dim3 dimBlock( 32 ); 

dim3 dimGrid( (10000+32)/32 );

then when you call the kernel use

mykernel<<< dimGrid, dimBlock >>>(d_octree, d_resultat );

NB its usually more efficient to split it into blocks of 32 or a multiple of 32, but can be other numbers.

Question 1:

In above a 1D array of 10000 cells was allocated, to access just use

float4 a = h_resultant[cellNum]; 

// and the reverse

  h_resultant[ cellNum ] = a;

or float a = h_resultant[cellNum].x;  to access just the first of the floats in the float 4.  .x .y .z for 1st 3, .w for the 4th one

and h_resultant[ cellNum ].x = a;

Now if you do have 10000 octrees then the threads are split into all those blocks of 32 threads per block, you will want the following

int octreeNum = threadIdx.x + blockIdx.x * blockDim.x; // calculate the thread_number within the entire grid from the block number and thread number within its block

and then

int cellNum = octreeNum*MaxOctreeDepth + currentDepth; // so if you have 10000 octrees and MaxOctreeDepth is 10 then the 1st octree would take 1st 10 cells in d_octree,… and d_octree would be 100,000 cells

Hope this helps, I think its a little off track from what you are doing, sorry

pardon any typo errors I may have made

kbam

Thanks for your explications ;)

But I’ve always the problem : For exemple, How to access the 2nd case of my array.

This my sample of code :

#include <cuda.h>

#include <cuda_runtime.h>

#include <stdio.h>

#include <external_dependency.h>

__global__ void kernel_test( float4* octree, float4* d_resultat )

{

   int index = threadIdx.x + blockIdx.x * blockDim.x;

if ( index == 2 ) {

	  *d_resultat = *( octree + index );

   }

}

int launchKernel( float4* octree, const unsigned int octreeSize )

{

   /*for ( unsigned int i = 0; i < octreeSize; ++i )

	  printf( "x: %.2f, y: %.2f, z: %.2f, w: %.2f \n", octree[i].x, octree[i].y, octree[i].z, octree[i].w );*/

cudaSetDevice( 0 );

dim3 dimBlock( 32 );

   dim3 dimGrid( ( octreeSize + 32 ) / 32 );

/* allocation mémoire du résultat du côté Host */

   float4* h_resultat = ( float4* ) malloc( sizeof( float4 ) );

/* allocation mémoire GPU */

   float4* d_octree;

   const int size = octreeSize * sizeof( float4 );

   float4* d_resultat;

/* allocation mémoire et copie de l'octree en mémoire GPU */

   cudaMalloc( ( void** ) &d_octree, size );

   cudaMemcpy( d_octree, &octree, size, cudaMemcpyHostToDevice );

   /* allocation mémoire pour le résultat du côté Device */

   cudaMalloc( ( void** ) &d_resultat, sizeof( float4 ) );

kernel_test<<<dimGrid, dimBlock>>>( d_octree, d_resultat );

   cudaMemcpy( h_resultat, d_resultat, sizeof( float4 ), cudaMemcpyDeviceToHost );

printf("x: %f, y: %f, z: %f, w: %f \n", h_resultat->x, h_resultat->y, h_resultat->z, h_resultat->w );

cudaFree( d_octree );

   cudaFree( d_resultat );

   free( h_resultat );

return 0;

}

And I’ve always the problem about the seg fault when I increase the depth of the octree ( when depth is greater than 2 ).

Finally, my program works !!!
In fact, I just saw that cudaMemcpy( d_octree, &octree, size, cudaMemcpyHostToDevice ); is false => I just changed &octree to octree.

Thanks !!!