I finally managed to write first program in CUDA :) (it finds max elment in an array).
Well I run into few questions while writting this example:
whenever i have some multiplication or division by 2, i should use “shift” instead of mul or div, this includes for sentences?
e.g.: for(…; …; currentDepth = currentDepth / 2)
If some graphic card has 256MB ram, how is this memory split between texture memory, global memory (this memory is usually named as device memory if i am right?)?
How would be faster to find some max element in 256MB big array. Would be better to pass array as texture or as cudaArray or on some other way (where would be array then located: global,texture,device…memory?)?
is measuring time like this ok for GPU and CPU or not:
findMaxKernel<<<dimGrid, dimBlock>>>(Ad, BLOCK_SIZE, BLOCK_SIZE, Cd);
printf(“GPU: %ld\n”, stopTime.tv_usec-startTime.tv_usec);
(in kernel i used __syncthread(); as (almost) last command)
when i was trying this example in emurelease mode i got wrong results, this is probably due to serial execution of threads? (but should this still be the case when i used __syncthreads() ?)
when exactly bank conflicts occur, only at writting or also at reading?
anything else to what i should also be paying special attention?
Thanks for answers :)