simplest programming environment (editor) for Cuda?

Is there any reason not to just put everything in one .cu file?

For example, in my case, I have a simulator I wrote in regular 'ole C. I’m curious to see how using the parallelism of the GPU can speed things up. All I really did was copy my C code into the .cu template, erasing all Kernel calls and all “CUDA stuff”. So at that point, it was just C code but saved as a .cu file. I just wanted to make sure it would compile and run, and sure enough, it did.

Now, i simply plan to add the Kernel call and the kernel code into this file.

Is that a wrong way of going about it?

If this is your code for you and it is the easiest way to work with it then a single file is fine. Typically larger projects will be split into several files for ease of reuse and maintenance. Right and wrong in programming styles is more of a philosophical debate.

Thanks for the answer on that.

Can anyone help me with the last two posts on page 1?

I asked several questions regarding the execution configuration of the kernel (about nblocks and blockSize).

Thanks.

The generally accepted optimal blocksize is 256 threads, because it divides evenly into both 512 and 768, and ensures the best compatibility with both old and new hardware. You could also go to 128, 64, 32, etc. if you needed more registers or whatever for each thread.

If your dataset is very large (like the test you described), I would imagine that you would need to break the dataset up and run a large (max size, or close to it) grid on part 1, then another grid on part 2, and so on, until the data was reduced down a few levels. Using the reduction example:

If blocksize = 256x1x1, gridsize = 32768x1x1, then each grid will process 8388608 elements (2^8 * 2^15 = 2^23). If you use the method where two elements are loaded into the thread (one of the optimizations), then you will get another factor of 2 elements (2^8 * 2^15 * 2 = 2^24 = 16777216 ~ 16.7M).

If you wanted to do 2^25 elements, you could run the kernel once (with the listed parameters) on the first set of data to reduce it’s size by half, then run it once on the second set of data to reduce it’s size by half, then run the looped version on the two combined halfs to reduce it the rest of the way.

If you want to do even larger datasets, you’d have to extend this process to run a reduction of grids (so to speak), and once the dataset was reduced enough to where whatever is left fits in one grid, then you can loop to reduce it the rest of the way. Obviously, you’d have to make some changes to the code in the SDK in order to combine the results from the grids, and then the ‘standard’ reduction to read the results from the grid reduction (since they would have different memory locations/spacing than the standard kernel).