Iam facing some problems understanding Threads, Iam providing the code iam working on, I have a card with compute capability of 1.1, means 768 threads, Its 9200M GS
In the code I have 2d arrays of size NxN which I copy from Host to device and then back to Host.
The point is this when I put N > 768 , there are unpredicted values in last row of my results. And most important of all my Gnome Crashes
Iam providing a working code --------------> With N = 768 ,blocks = 2, numberThreads per block = 384, It works If I put N = 768 ,blocks = 3, numberThreads per block = 256, It works
But if i put N = 1024, blocks = 4 , numberThreads per block = 256 , my gnome crashes and the last rows have some garbage.
I can understand I cannot spawn more than 768 threads on a compute capability 1.1 card, but Is there any other way, to make the code working with N > 768, do I have to change the kernel, say some threads do the same work twice , if thread 1 operates on array[1] it has to do for array[1+768] as wel.
I have checked, its working with N = 800, but at N = 1000 it gives garbage and now i can barely write my Gnome is showing horrible display.( iam also using Compiz)
Q1.1. when I put numberthreads = 500, blocks = 2, what do u expect in kernel this statement will return —> int idx = blockIdx.x * blockDim.x + threadIdx.x;
on a compute capability 1.1 card.
Any help I will appreciate,
Q2. if the kernel<<<…>>>() cal a blocking statement, or my code on host works independently
#include <stdio.h>
__global__ void multi( int *M1, int *M2, int *M3, size_t p_M1,size_t p_M2, size_t p_M3, int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int myrow = idx;
int j= 0,i=0;
//int point = idx % N;
/*if (idx < N ){
int* row_M3 = (int*)((char*)M3 + myrow * p_M3);
for(i = 0; i< N; i++)
row_M3[i] = (int) 20;
if ( idx < N ){
int k = 0;
int* row_M3 = (int*)((char*)M3 + myrow * p_M3);
int* row_M1 = (int*)((char*)M1 + myrow * p_M1);
for(j = 0; j <N; j++){
row_M3[j] = (int) 0;
int* row_M2 = (int*)((char*)M2 + k * p_M2);
row_M3[j] += row_M1[k] * row_M2[j];
int const N = 768;
int main(){
/* pointers to host memory */
int *Host_M1, *Host_M2, *Host_M3;
/* pointers to device memory */
int *GPU_M1, *GPU_M2, *GPU_M3;
size_t pitch_M1,pitch_M2,pitch_M3;
int i;
/* Allocate 2darrays on host*/
Host_M1 = (int*) malloc(N*N*sizeof(int));
Host_M2 = (int*) malloc(N*N*sizeof(int));
printf("OK mem 2d host\n ");
/* Allocate 2darrays on device*/
size_t width = N* sizeof(int);
size_t height = N;
cudaMallocPitch((void**)&GPU_M1, &pitch_M1,width,height);
cudaMallocPitch((void**)&GPU_M2, &pitch_M2,width,height);
cudaMallocPitch((void**)&GPU_M3, &pitch_M3,width,height);
printf("OK mem2d cuda\n ");
/* Initialize arrays a and b */
for (i=0; i<N*N; i++)
Host_M1[i] = (int) 1;
Host_M2[i] = (int) 1;
printf("OK initialize\n\n\n\n\n ");
/* Copy data from host memory to device memory */
cudaMemcpy2D(GPU_M1, pitch_M1,Host_M1,width, width,height, cudaMemcpyHostToDevice);
cudaMemcpy2D(GPU_M2, pitch_M2,Host_M2,width, width,height, cudaMemcpyHostToDevice);
printf("OK memcpy H to D\n ");
//cudaMemcpy(b_d, b, sizeof(int)*N, cudaMemcpyHostToDevice);
// Invoke kernel
// here the threads and blocks are stuctured in linear way
int threadsPerBlock = 256;
//int blocksPerGrid = (N + threadsPerBlock - 1)/threadsPerBlock;
cudaError_t erro = cudaGetLastError();
if(erro != cudaSuccess)
printf("ERROR PREP launch FAIL!\n");
printf("error is %s \n",cudaGetErrorString (erro) );
printf("OK Kernel\n ");
Host_M3 = (int*) malloc(N*N*sizeof(int));
cudaMemcpy2D(Host_M3,width,GPU_M3,pitch_M3,width,height ,cudaMemcpyDeviceToHost);
printf("OK memcp D to H\n ");
printf("OK done\n");
for(i = N*764; i < N*N; i++){
printf("%d(%d) ",Host_M3[i],i%N);
if(i%N == N-1)
// Time to free the memories
printf("OK freeHost\n ");
printf("OK freeDevice\n ");
please guide me