If you mean 1024 threads (as opposed to 1024 threads per block) then this is a tiny little kernel. A key aspect of GPU acceleration is latency hiding. It would be quite rare for a kernel of 1024 threads to be able to do a good job of latency hiding, so I would expect latency (perhaps in several forms including arithmetic as well as memory latency) to be an issue for such a kernel. Subdividing such a kernel would not be the correct approach to improve performance.
As njuffa said, starting with a profiler is a good idea. There are good presentations on analysis-driven optimization, that are specific to CUDA. For example google this: “paulius analysis optimization gtc” and you will quickly find a few.
In a nutshell, you could start out with nvprof, and start to get a feel for some of the metrics. At a high level, you want to assess whether a kernel is compute-bound, memory-bound, or latency-bound. In particular, you could start with the gld_efficiency and sm_efficiency metrics as very high-level steering indicators to decide on the next steps.
The results of this analysis can/should suggest optimization strategies. For example, if the gld_efficiency metric is low (significantly less than 50%), you would want to analyze global memory access patterns for efficiency (e.g. coalescing). If you do your analysis on a modern (sm_52) architecture, you can even let the profiler trace efficiency down to a particular line of source code. But even without that, hopefully you have enough understanding of the code to figure out where the global load traffic is originating.
Likewise, if the sm_efficiency metric is low (significantly less than 90%), then to some degree your performance is bound by latency. The usual recipe here is to try to “expose more parallelism” possibly via increase in ILP, work per thread, occupancy, and/or simply increasing the number of threads in your grid. As a simple example related to your description, you mention that each thread processes 1000 data sets. Can this be further parallelized? Is there any inherent sequential dependency in the processing of these 1000 data sets per thread? If not, try increasing your total thread count (in the grid) by a factor of 10 and have each thread process only 100 data sets instead of 1000.
If your kernel is actually only 1024 threads, I’d be very surprised if it had a good sm_efficiency number, if you are running on anything but the smallest of GPUs.
My rule of thumb that I suggest to others is that problem sizes of less than 10,000 threads will often be latency-bound, and that number will increase as you move to larger GPUs and newer architectures.
As an additional comment, if the reason you have limited your kernel to 1024 threads is that you want to be able to do a kernel launch like this:
then you are seriously underutilizing almost any CUDA GPU available. Kernel launches like this:
are serious performance limiters in CUDA. If you ever sit down to write such a code where you care about performance, you should immediately stop and rethink things.