I'm doing my first step in CUDA and I want to make a kernell call using all the thread that I can with my graphical card (9600GT) for testing.
For getting the number of the Thread I use in the Host code the runtime API:
cudaDeviceProp Dispositivo; // It is define as global variable
// get the device properties
So, in the field Dispositivo.maxThreadsPerBlock I had the number of Thread that I can use per block.
Using the DeviceInfor form CUDA SDK I know that this number of threads are 512.
If I make a call in this way the code doesn't work.
If I fix the threads in the block to 128 all work fine
What can I make wrong ?
When I ask Memory in the device using cudaMalloc where is stored in global or share memory ?