the 1024 threads can work concurrently?

I did a experinment for finding how many threads can work concurrently in one block.
I only use one block and increase the blocksize and get the execution time by cuda profiler
the program is a simple vector add
blocksize-----------------------------execution time

so can I get the conclusion all the 1024 thread can work concurrently in one block?
and I can not understand why they can work together, my gpu is gm107, it has 128sp in one sm, how 1024 threads work concurrently?

my code is following:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include “cuda_runtime.h”
#include “device_launch_parameters.h”

// CUDA kernel. Each thread takes care of one element of c
global void vecAdd(double *a, double *b, double c, int n)
// Get our global thread ID
int id = blockIdx.x
blockDim.x + threadIdx.x;

// Make sure we do not go out of bounds
int k;
while (id < n) {
	for (k = 0; k < 18000; k++)
		c[id] = a[id] + b[id];
	id += gridDim.x*blockDim.x;


int main(int argc, char* argv)
// Size of vectors
int n = 1024;

// Host input vectors
double *h_a;
double *h_b;
//Host output vector
double *h_c;

// Device input vectors
double *d_a;
double *d_b;
//Device output vector
double *d_c;

// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);

// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);

// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);

int i;
// Initialize vectors on host
for (i = 0; i < n; i++) {
	h_a[i] = sin(i)*sin(i);
	h_b[i] = cos(i)*cos(i);

// Copy host vectors to device
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

int blockSize, gridSize;

// Number of threads in each thread block
blockSize =1024;

// Number of thread blocks in grid
gridSize = 1;

// Execute the kernel
vecAdd << <gridSize, blockSize >> >(d_a, d_b, d_c, n);

// Copy array back to host
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for (i = 0; i<n; i++)
	sum += h_c[i];
printf("final result: %f\n", sum );

// Release device memory

// Release host memory

return 0;


Yes, all 1024 threads in a block can work concurrently, to some degree. For example they can all have instructions or transactions “in flight”. In fact, within an SM (taking into account more than 1 resident threadblock) up to 2048 threads can be “active” in this fashion.

This is a fairly typical question for someone who is unfamiliar with how GPUs work. It has been answered many times, in many places, and in fact is part of typical lecture material on GPU computing for those who are taking classes. With a bit of searching on this topic you will find a lot of materials.

In a nutshell, the number of SP “cores” in a GPU SM is not directly related to how many threads can be active, nor is it an overall limit on what can be issued in a single cycle. A GPU SM is composed of many different types of functional units, and the specific instruction to be issued will dictate to the type of functional unit needed for scheduling of that instruction. The SP “cores” are only used when the instruction is arithmetic in nature, and even then for only certain types of arithmetic instructions. Other types of instructions will use other types of functional units. Furthermore, not all 1024 threads are having an instruction issued in every clock cycle. The warp schedulers will select instructions from “available” warps (those that are active and not stalled) to issue, and there are usually at most 4 warp schedulers in a SM, so typically only at most 4 warps are actually issuing in a given clock cycle.

This is a complicated topic and I don’t expect the above paragraph to do justice to it. Again, my suggestion would be to avail yourself of resources. Here is one such example, which I consider to be relatively high quality:

ty for your help. YOUR resource is very useful

When a block runs on a SMP it has do other things in addition to the calculations. This includes read/write to the global or shared memory with the latencies associated and latencies associated with the use of register.

While a warp has to wait for data to arrive from the global memory another warp can work. If a local variable is written it can not be read for 24 cycles. All this waiting time is used efficiently.