Accessing Shared Memory Value by Host Shared Sum Variable

I’m trying to write a simulation for an individual particles that interact with a medium with certain properties (particles do not interact with each other just the medium under test). What I want to do is run a number of these simulations in parallel, and then write the combined results to a file. At present I’ve having trouble getting my head around how to use shared memory. Part of the results is an average of the sum of interactions by the particles, but I don’t know how I can do this. Here is an example of what I’m trying to do:

Host code:


// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil.h>

#include <>

int main (int argc, char** argv)


  float h_total;

  dim3 threads (3, 1, 1);

  dim3 grid (1, 1);

 h_total = 0;

  particleSim <<< grid, threads >>> (&h_total);

  printf ("h_total = %.2f\n", h_total);


Kernel file:


__shared__ float total;

__global__ void 

particleSim(float *h_total) 


  total = *h_total;




  *h_total = total;


I can compile the program and run it, but the output from the program is 0.00 and I expected at least a 3. I need to figure this out because I’ll be using an overall sum in order to compute the average at the end of the simulation.

Unlike the example I’ve shown, I thought the way to do this was to use:

   CUDA_SAFE_CALL (cudaMemcpyToSymbol (total, &h_total,

                    sizeof (float), cudaMemcpyHostToDevice));

from inside (host code), but I get an error related to the fact that the variable has not been declared although it is clearly a global variable (in the sense that it is declared out side the main method).

I just need to know how to use a variable that can be access by all threads in the simulation for keeping a running total. There may be a better way to do this, but its not clear to me from what I’ve read in the API or in the examples. The only example that looks something like this, is the Monte Carlo simulation, but it does not pass the results back to the host.

I could use some suggestions


A good example is the reduction example.

  • you cannot access shared memory from the host

  • you can cudaMemcpyDeviceToHost global memory to get results back to the host

  • generally you will have something like the following at the end of a kernel :


if (threadIdx.x == 0)

 global_mem[some_index] = shared_mem[0];

In which you write the resulting shared memory value out to global memory. In this case you have used all threads in 1 block to generate this 1 value that needs to be written to global memory.

It can very well be that your algorithm has no use for shared memory, maybe you can give some more info what you are trying to do?