The key metric here is instructions per cycle (insns per cycle: IPC), which shows on average how many instructions we were completed for each CPU clock cycle. The higher, the better (a simplification).
If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.
If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads. refer
task-clock: clock time for running tasks
context-switches: occur when the processor switches from one thread to another
cpu-migrations: happens when your process migrates from one core to another.
page-faults: When a program needs a memory page that is not yet mapped to process virtual memory, a page fault occurs.
cycles: They just happen regularly at a given frequency unless the processor is deliberately set idle.
instructions: number is a blunt count of instruction executed.
branches: count is the number of branching happened in the whole run.
branch-misses: a number of cases where branch prediction failed to guess the execution path right