Sending 100000 Samples to a CUDA function

Hi this is a really really simple question so hopefully someone can provide me with the answer.

I have the following C code and I’m looking to use CUDAs parallelism to speed it up. How do i call the CUDA function correctly.

for(i=0;i<nimage;i++) {
(void) cuda_function(nsamps, do_s, do_r, tau, vi2,
dts, dtr, dsr, dss, sray, nf, d2p, fn, rdfs, sire, nsre,
din, out, out1, txs2, txr2, txs, txr, i);
}

cheers,

Chris

Okay I didnt get any success with my first post so i thought I would try again.

dim3 threads (WHAT SHOULD THIS VALUE BE?);

dim3 grid (WHAT SHOULD THIS VALUE BE?);

      sumimg_InCUDA<<<grid, threads>>>(nsamps, do_s, do_r,

      tau, vi2,

          dts, dtr,

      dsr, dss,

          sray, nf,

          d2p, fn, rdfs,

          sire, nsre,

      din,

          out, out1,

      txs2, txr2,

          txs,  txr);

My block size is 16 and as I said there’s about 1,000,000 images to be processed. The parralelisim is the fact that each of these images can be called in parralel.

Please feel free to ask any other questions. Any help would be great

Cheers,

Chris

I’m not sure I understand how are setting up your computation. If you had just one big image, the natural way of doing it is splitting into subimages that are processed by blocks, where several threads work on individual pixels. There are some examples in the SDK that work this way.

If it is possible (i.e. fits in memory), you would load all those images and setup a similar configuration, i.e. treat all your images together as one big image, respecting boundaries to avoid reading wrong data. If they do not fit in memory, you can always process them in batches as the kernel launch would have minimal overhead for such computations.