Optimal number of cores for a kernel, available memory a few questions from a beginner

Hi, I’ve just started playing with CUDA - I’ve written and compiled my first program yesterday, and I have two questions.

  1. What is the optimal number of cores per kernel?

My program computes average and variance for an array of integers using reduction. So basically it does something like this

int threads = 10;

int per_thread = 1000;

int N = 100000000;

int blocks = N / threads / per_thread;

compute_average<<blocks, threads>>(...)

In the actual code I do the computations a bit differently to handle the case when the array size (N) is not a multiple of threads*per_thread.

I was wondering how much of data and threads should I assign to each block, and I was thinking that because my GPU has 4x48 cores (GTS450), the best value is 48 threads. But to my surprise, lowering the number to 8 significantly improved the performance? Why does that happen? Is that somehow related to the warps?

  1. How much device memory is available for data?

My GPU has 1024MB or RAM but when I tried to copy this mucu data to it, the operation failed? I guess I can’t use all of the memory for data, but is there a way to find how much memory is available on the device? What if there are multiple CUDA programs running, each allocating memory on the device?