Nsight Guided Profiling¶

This webpage is directly generated from the nsight-guided-profiling.md file of j3soon/hpc-samples. Please refer to the repository for the mentioned examples.

Prerequisites¶

Downloading the two Nsight GUIs are sufficient, as we have provide pre-profiled reports for the examples in the repository.

(Optional) Profiler and Container Setup¶

System configuration following docs:

cat /proc/sys/kernel/perf_event_paranoid
sudo sh -c 'echo 2 >/proc/sys/kernel/perf_event_paranoid'

Launch container with SYS_ADMIN caps:

cd src
docker run --rm -it --gpus all \
  --cap-add=SYS_ADMIN \
  -v $PWD:/app \
  j3soon/hpc-samples:nvhpc-25.7-devel-cuda12.9-ubuntu24.04
# in the container
nsys status -e

Nsight¶

Nsight Systems

See the user guide for more details.

Default nsys profile flags:

nsys profile --stats=false -t cuda,opengl,nvtx,osrt --cudabacktrace=none [executable] [executable options]

Nsight Compute

See the documentation for more details.

Parallel Reduce Sum¶

In the container:

cd /app/cpp/cuda/reduce_sum

and run all tests:

./test_all.sh

If you don't have an environment, download the reports from here.

01_atomic_add_gmem.cu (653.09 ms)
- Summary: Drain Stalls (Est. Speedup: 49.96%)
- Source: L17 atomicAdd Long Scoreboard and L19 Drain.

02_atomic_add_smem.cu (164.72 ms)
- Improved: Details > Memory Workload Analysis > Memory Chart > L2 Cache Writes
- Summary: Thread Divergence (Est. Speedup: 31.03%), Short Scoreboard Stalls (Est. Speedup: 15.31%), Barrier Stalls (Est. Speedup: 15.31%)
- Source: L22 atomicAdd Short Scoreboard and L26 Barrier.

03_interleaved_addressing.cu (27.00 ms)
- Improved: Shared Memory Bottleneck
- Summary: Uncoalesced Shared Accesses (Est. Speedup: 37.79%), Shared Load Bank Conflicts (Est. Speedup: 24.17%), Thread Divergence (Est. Speedup: 18.76%)

04_interleaved_addressing_non_divergent.cu (20.98 ms)
- Improved: Thread Divergence
- Summary: Uncoalesced Shared Accesses (Est. Speedup: 70.86%), Shared Load Bank Conflicts (Est. Speedup: 60.72%), Shared Store Bank Conflicts (Est. Speedup: 51.40%)

05_sequential_addressing.cu (17.95 ms)
- Improved: Shared Memory Bank Conflicts
- Summary: Thread Divergence (Est. Speedup: 36.69%)

06_first_add_during_load.cu (9.28 ms)
- Improved: Thread Divergence (due to half of the threads in the block are idle after loading to shared memory). Details > Source Counter > Branch Instructions.
- Summary: Thread Divergence (Est. Speedup: 34.89%)

07_unroll_last_warp.cu (5.05 ms)
- Improved: Reduced thread synchronization at previous L22 and current L36 Barrier. Details > Source Counter > Branch Instructions.
- Summary: Achieved Occupancy (Est. Speedup: 8.14%), Long Scoreboard Stalls (Est. Speedup: 8.14%)

08_complete_unroll.cu (4.88 ms)
- Improved: Details > Source Counter > Branch Instructions.
- Summary: Achieved Occupancy (Est. Speedup: 5.02%), Long Scoreboard Stalls (Est. Speedup: 5.02%)

09_warp_shuffle.cu (4.79 ms)
- Improved: Details > Memory Workload Analysis > Memory Chart > Shared Memory
- Summary: Achieved Occupancy (Est. Speedup: 3.18%), Long Scoreboard Stalls (Est. Speedup: 3.18%)

10_grid_stride_loop.cu (4.78 ms)
- Improved: Details > Instruction Statistics > Executed Instructions
- Summary: Achieved Occupancy (Est. Speedup: 2.99%), Long Scoreboard Stalls (Est. Speedup: 2.99%)

11_grid_size.cu (4.75 ms)
- Improved: Details > Occupancy > Achieved Occupancy
- Summary: Long Scoreboard Stalls (Est. Speedup: 2.24%)

The main performance bottleneck is due to Long Scoreboard Stalls. Further optimizations could explore advanced CUDA features such as LDGSTS and TMA instructions.

Nsight Guided Profiling¶

Prerequisites¶

(Optional) Profiler and Container Setup¶

Nsight¶

Parallel Reduce Sum¶

References¶