Cant modify data on the GPU

Hi, im starting with CUDA and i have a simple code to understand how everything works here.

Code:

extern “C” void
CudaInitMemory(float* h_A ,float* h_B, float* h_C, unsigned int mem_size)
{

// allocate device memory
float* d_A;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size));
float* d_B;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_B, mem_size));

// copy host memory to device
CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, mem_size,
                          cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL(cudaMemcpy(d_B, h_B, mem_size,
                          cudaMemcpyHostToDevice) );
// allocate device memory for result
float* d_C;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size));                    
 
// setup execution parameters

// execute the kernel
SUMA <<< 1 , 100 >>> ( d_A, d_C);
   
	CUDA_SAFE_CALL(cudaMemcpy(h_B,d_A, mem_size,
                          cudaMemcpyDeviceToHost) );
	CUDA_SAFE_CALL(cudaMemcpy(h_C,d_C, mem_size,
                          cudaMemcpyDeviceToHost) );                         
                          
// clean up memory

CUDA_SAFE_CALL(cudaFree(d_A));
CUDA_SAFE_CALL(cudaFree(d_B));
CUDA_SAFE_CALL(cudaFree(d_C));                          

}

global void SUMA (float* A , float* C)
{
float Aux = 0;
shared float s_A[100];
shared float s_C[100];

// Block index
int bx = blockIdx.x;
int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
int i = threadIdx.x;
C[i]=1; // I want to write "1" in each element to see if it makes anything!
__syncthreads();

}

If a us s_A as intermediate buffer it will show the same…

What am I doing wrong??
h_B and h_A are copied properly and after this code is run h_B contains the same as h_A
h_C contains 0 after the code is run…

Any help??
Thanks!

Are you running in Debug mode? CUDA_SAFE_CALL won’t report execution failures otherwise.

your problem is when you call:
SUMA <<< 1 , 100 >>> ( d_A, d_C); //you have only one block with 100 threads.

shared float s_A[100]; //the shared memory use only with the same block. but you don’t use s_A in your code.

int ty = threadIdx.y; //it hasn’t a logical because your threads are only on threadIdx.x => you can put this SUMA<<<1,(5,20)>>> //now you’ve one block but with threadIdx.x (0…4) and threadIdx.y (0…19)

and how is defined h_C? // float* h_C = (float*) malloc(mem_size); ??

p.d.: and how you know that h_C contains 0??? be carefull! you are using float! so printf float is with %f

I used int ty = threadIdx.y in other thread and grid configurations, ive tried so many configs…, 100blocks 1block, 1 block 100 threads, 10 blocks 10 threads…

I print the vectors in the main program this way:

for(a=0;a<SIZE*SIZE;a++)

{

  cout<<h_C[a]<<"  ";

}

h_A y copied properly into h_B but h_C is 0000000000…0 after GPU is used…

This is h_C : float* h_C = (float*) malloc(mem_size_C);

Dont know what is happening to be receiving 0 after GPU…

The problem isn’t arising due to threadIdx.y. threadIdx.y will be 0 in each of your threads, since you’re using a one-dimensional block. No big deal there.

Not sure if this is the problem, but you need a third configuration parameter. The configuration parameters are: <<<gridDim, blockDim, shared_mem_size>>> You aren’t indicating how much shared memory you are using. It should be 200*sizeof(float) since you are allocating two 100-element float arrays.

try copy paste. ask me if you don’t understand me.

#include <cutil.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

global void SUMA (float* C)
{
int i = blockIdx.x * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;
C[i]=C[i]+1.0;

}

int main(int argc, char** argv)
{

unsigned int size =100;
unsigned int mem_size = sizeof(float) * size;

float* h_C = (float*) malloc(mem_size);

for (int i = 0; i < size; i++)
{
h_C[i] = (float)i;
}

float* d_C;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size));
CUDA_SAFE_CALL(cudaMemcpy(d_C, h_C, mem_size, cudaMemcpyHostToDevice) );
SUMA <<< 1 , size >>> (d_C);
CUDA_SAFE_CALL(cudaMemcpy(h_C,d_C, mem_size, cudaMemcpyDeviceToHost) );
for (int g = 0; g<size; g++)
printf(“g:%d , balioa: %f\n”, g, h_C[g]);
//printf(“h_C[%d]=%f\n”,atoi(argv[1]),h_C[atoi(argv[1])]);
CUDA_SAFE_CALL(cudaFree(d_C));
free(h_C);

}

Wow, i apreciate so much that you want to help me.

This code is showing me h_C as [ 0 1 2 3…99]

I should see h_C = [ 1 2 3 4 …100] if i understood properly de code…

Other info, im running Vista 64bits, compiling 32 bits, 9800GX2 180.48 WHQL drivers.

I tried the code under XP32 and it did the same ( my code ).

can you change and then compile the examples of Nvidia?

The examples modify data properly on my GPU, i have modified them to print matrix and it contains the result properly, if i put h_C=1 in the kernell it will show 1 properly out of it.

There has to be something tipically bad done to return allways “0” in the data output.

Dont know what…

Have you paid any attention at all to my original response? If you weren’t specifying how much shared memory to allocate, it’s possible your kernel was failing when you trying declaring some of it inside your kernel, which would result in the kernel not being executed. This would leave the memory space undefined, which (very easily) could return an array of zeros.

The shared_mem_size parameter to the kernel launch is only for allocationg additional shared memory in extern shared arrays (see the programming guide). It is not needed when shared arrays are statically declared and sized at compile time.

My apologies. I suppose I’ve always allocated it as extern shared arrays, which is why I was thinking this was necessary. Thanks for the clarification.

Why the **** DOES - THIS - CODE - DONT - MODIFY - ANYTHING- ON - MY - GPU ???

I think everything is OK!!!

Ive also tried this:

global void SUMA (float* C)

{

shared float s_C[100];

int i = blockIdx.x * blockDim.x * blockDim.y + blockDim.x * threadIdx.y + threadIdx.x;

s_C[i]=C[i];

__syncthreads();

s_C[i]=0;

C[i]=s_C[i];

}

Could any of you pass me a code wich creates for example a vector[100] wich contains 11111111…1 and pass it to the GPU and after that will contain 22222…2¿?, if you can mak this simple code for me and try it on your GPU i will apreciate it so so much… im just near to leaving my CUDA project… i must be a full idiot to not understand this…
I nedd to know if that simple code works in your GPU and if it does or not in mine… , it would help me to understand CUDA to hace a better start…

There must be a problem with how you’re compiling, or it might be a very silly mistake like misplacing the outputfile.

You said you can compile the SDK samples and modify them successfully. Take an SDK sample project and modify it step-by-step, to eventually look like the code posted here. See if that will work.

Try to backtrack to something that works, and change it little by little. There is something very simple you’re doing wrong, and you’ll figure it out eventually. Sorry about your frustration.

P.S. How exactly are you compiling this, anyway? Are you starting from the “template” SDK project?

It WORKS!!! it was my new project configuration, can any of you explain me how to config the output file??

I dont have any idea of what is needed to config a CUDA project, ive only added the CUDA directories to the project.

Thank YOu!

For simple one file projects, just compile on the command line:

nvcc -o test test.cu

For more complicated projects, I use CMake (www.cmake.org) with FindCUDA.cmake (search the web to find)