Weird malloc problem

Hi guys,

So I am having an issue with the order in which I malloc my variables and being able to access them. Specifically for a code like:

[codebox] cudaMalloc((void**)&kxihlG,nElkxihl*sizeof(float));

    cudaMalloc((void**)&kxihrG,nElkxihr*sizeof(float));

    cudaMemcpy(kxihlG, kxihl, nElkxihl*sizeof(float), cudaMemcpyHostToDevice);

    cudaMemcpy(kxihrG, kxihr, nElkxihr*sizeof(float), cudaMemcpyHostToDevice);

.

.

. lots of variables

cudaMalloc((void**)&TARGETVAR,nElTARGETVAR*sizeof(float));

    cudaMemcpy(TARGETVARG, TARGETVAR, nElTARGETVAR*sizeof(float), cudaMemcpyHostToDevice);

      test<<<1,1>>>(TARGETVARG);

      Error = cudaThreadSynchronize();

      fprintf(stderr,"@TEST1 Error = %d \n",Error);[/codebox]

if I run it like this cudaThreadSynchronize returns a failure of ‘4’, but if move the cudaMalloc for TARGETVAR to the top of the list then the test<<<>>> kernel runs succesfuly:

THIS WORKS

[codebox]

cudaMalloc((void**)&TARGETVAR,nElTARGETVAR*sizeof(float));

cudaMalloc((void**)&kxihlG,nElkxihl*sizeof(float));

    cudaMalloc((void**)&kxihrG,nElkxihr*sizeof(float));

    cudaMemcpy(kxihlG, kxihl, nElkxihl*sizeof(float), cudaMemcpyHostToDevice);

    cudaMemcpy(kxihrG, kxihr, nElkxihr*sizeof(float), cudaMemcpyHostToDevice);

.

.

. lots of variables

cudaMemcpy(TARGETVARG, TARGETVAR, nElTARGETVAR*sizeof(float), cudaMemcpyHostToDevice);

      test<<<1,1>>>(TARGETVARG);

      Error = cudaThreadSynchronize();

      fprintf(stderr,"@TEST1 Error = %d \n",Error);[/codebox]

Can anyone explain this? I dont think the machine is full? Do I have to put some delay in? why does the order matter as long as malloc for a given variable is before the memcpy for it?

How can this be fixed because this problem is occuring elsewhere with other variable? Thank you for your time!

Did you think something happens with TARGETVAR?

why dont you debug the error of that statement?
Error = cudaMalloc((void**)&TARGETVAR,nElTARGETVAR*sizeof(float));
printf(“CUDA Error: %s\n”, cudaGetErrorString(Error));

If nElTARGETVAR is especially large, then it could be a classic problem of memory address space packing. It’s harder to allocate a contiguous large chunk than smaller chunks (which can fit into “cracks” better). This isn’t a GPU problem, it happens all the time on the CPU too.

Classic answer: always allocate from large to small. Do this as a reflex in all your coding on GPU and CPU.
When this fails, the next strategy is usually to start being fancier with your memory allocator, reusing old mallocs, etc, and/or changing your algorithm to need less large contiguous blocks.