Modern Intel CPUs are split into Performance cores (P-cores) and Efficient cores (E-cores), so I wanted to investigate the best way to run MPI-based scientific computations. Would it be faster to use only the P-cores, or is it better to include E-cores for higher parallelism? While most scientists probably use supercomputers or workstations with Xeon CPUs and don’t worry about this, I decided to test it out myself.
System Specs
- MB: ASUS ROG STRIX B760-I GAMING WIFI
- CPU: Intel Core i7 14700K ⧉
- Memory: 64GB
- GPU: RTX4080 Super
Computational Code
Preparation
To identify which cores are P-cores, run lscpu --extended.
$ lscpu --extendedCPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ 0 0 0 0 0:0:0:0 yes 5500.0000 800.0000 800.1480 1 0 0 0 0:0:0:0 yes 5500.0000 800.0000 871.6810 2 0 0 1 4:4:1:0 yes 5500.0000 800.0000 800.0000 3 0 0 1 4:4:1:0 yes 5500.0000 800.0000 800.0000 4 0 0 2 8:8:2:0 yes 5500.0000 800.0000 800.0000 5 0 0 2 8:8:2:0 yes 5500.0000 800.0000 800.0000 6 0 0 3 12:12:3:0 yes 5500.0000 800.0000 800.0000 7 0 0 3 12:12:3:0 yes 5500.0000 800.0000 800.0000 8 0 0 4 16:16:4:0 yes 5600.0000 800.0000 800.0000 9 0 0 4 16:16:4:0 yes 5600.0000 800.0000 800.0000 10 0 0 5 20:20:5:0 yes 5600.0000 800.0000 800.0000 11 0 0 5 20:20:5:0 yes 5600.0000 800.0000 800.0000 12 0 0 6 24:24:6:0 yes 5500.0000 800.0000 800.0000 13 0 0 6 24:24:6:0 yes 5500.0000 800.0000 800.0000 14 0 0 7 28:28:7:0 yes 5500.0000 800.0000 808.9900 15 0 0 7 28:28:7:0 yes 5500.0000 800.0000 800.0000 16 0 0 8 32:32:8:0 yes 4300.0000 800.0000 800.0000 17 0 0 9 33:33:8:0 yes 4300.0000 800.0000 4300.0010 18 0 0 10 34:34:8:0 yes 4300.0000 800.0000 800.0000 19 0 0 11 35:35:8:0 yes 4300.0000 800.0000 800.0000 20 0 0 12 36:36:9:0 yes 4300.0000 800.0000 800.0000 21 0 0 13 37:37:9:0 yes 4300.0000 800.0000 800.0000 22 0 0 14 38:38:9:0 yes 4300.0000 800.0000 800.0000 23 0 0 15 39:39:9:0 yes 4300.0000 800.0000 800.0000 24 0 0 16 40:40:10:0 yes 4300.0000 800.0000 800.0000 25 0 0 17 41:41:10:0 yes 4300.0000 800.0000 800.0000 26 0 0 18 42:42:10:0 yes 4300.0000 800.0000 800.0000 27 0 0 19 43:43:10:0 yes 4300.0000 800.0000 800.0000The cores with higher MaxMHz are likely the P-cores. In my case, CPUs 0–15 (CORE: 0–7) are the P-cores.
I confirmed that it’s possible to run simulations using only P-cores.
$ mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'[alpha-tauri:35406] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:35406] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:35406] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:35406] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.][alpha-tauri:35406] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:35406] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.][alpha-tauri:35406] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:35406] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]hellohellohellohellohellohellohellohelloI also confirmed that in some settings, both P and E-cores are used.
$ mpirun -np20 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'hellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohello[alpha-tauri:35826] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:35826] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.][alpha-tauri:35826] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.][alpha-tauri:35826] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.][alpha-tauri:35826] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.][alpha-tauri:35826] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.][alpha-tauri:35826] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.][alpha-tauri:35826] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.][alpha-tauri:35826] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.][alpha-tauri:35826] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.][alpha-tauri:35826] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.][alpha-tauri:35826] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.][alpha-tauri:35826] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B][alpha-tauri:35826] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:35826] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:35826] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.][alpha-tauri:35826] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:35826] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.][alpha-tauri:35826] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:35826] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.](Appendix) Thread Allocation Test
I wanted to see which cores threads were being scheduled on, so I wrote a small code snippet to test this. It turned out to be not very useful, but I’m including it here anyway — seems like a waste to just throw it out.
#define _GNU_SOURCE#include <stdio.h>#include <omp.h>#include <sched.h>#include <hwloc.h>
int main() { hwloc_topology_t topology; hwloc_topology_init(&topology); hwloc_topology_load(topology);
#pragma omp parallel { int tid = omp_get_thread_num(); int cpu = sched_getcpu(); // 現在実行中の論理CPU ID
hwloc_cpuset_t cpuset = hwloc_bitmap_alloc(); hwloc_get_last_cpu_location(topology, cpuset, HWLOC_CPUBIND_THREAD);
printf("Thread %d is running on CPU %d (bitmap:", tid, cpu); for (int i = 0; i <= hwloc_bitmap_last(cpuset); i++) { if (hwloc_bitmap_isset(cpuset, i)) { printf(" %d", i); } } printf(" )\n");
hwloc_bitmap_free(cpuset); }
hwloc_topology_destroy(topology); return 0;}Includes everything from compilation to execution.
$ sudo apt install hwloc hwloc libhwloc-dev$ gcc -fopenmp main.c -lhwloc -o hello_omp$ export OMP_NUM_THREADS=28$ ./hello_ompThread 0 is running on CPU 15 (bitmap: 15 )Thread 1 is running on CPU 17 (bitmap: 17 )Thread 13 is running on CPU 0 (bitmap: 0 )Thread 23 is running on CPU 7 (bitmap: 7 )Thread 16 is running on CPU 6 (bitmap: 6 )Thread 27 is running on CPU 16 (bitmap: 16 )Thread 2 is running on CPU 23 (bitmap: 23 )Thread 14 is running on CPU 4 (bitmap: 4 )Thread 4 is running on CPU 20 (bitmap: 6 )Thread 7 is running on CPU 22 (bitmap: 22 )Thread 8 is running on CPU 24 (bitmap: 24 )Thread 10 is running on CPU 26 (bitmap: 26 )Thread 11 is running on CPU 27 (bitmap: 27 )Thread 12 is running on CPU 2 (bitmap: 2 )Thread 21 is running on CPU 3 (bitmap: 3 )Thread 22 is running on CPU 5 (bitmap: 5 )Thread 17 is running on CPU 10 (bitmap: 10 )Thread 24 is running on CPU 9 (bitmap: 9 )Thread 26 is running on CPU 13 (bitmap: 13 )Thread 20 is running on CPU 1 (bitmap: 1 )Thread 9 is running on CPU 25 (bitmap: 25 )Thread 5 is running on CPU 19 (bitmap: 19 )Thread 15 is running on CPU 8 (bitmap: 8 )Thread 3 is running on CPU 18 (bitmap: 18 )Thread 25 is running on CPU 11 (bitmap: 11 )Thread 18 is running on CPU 12 (bitmap: 12 )Thread 6 is running on CPU 21 (bitmap: 21 )Thread 19 is running on CPU 14 (bitmap: 14 )Running LAMMPS
in.melt
I used the melt example from the LAMMPS examples directory, with slight modifications (increased cell size and number of steps).
# 3d Lennard-Jones melt
units ljatom_style atomic
lattice fcc 0.8442region box block 0 100 0 100 0 100create_box 1 boxcreate_atoms 1 boxmass 1 1.0
velocity all create 3.0 87287 loop geom
pair_style lj/cut 2.5pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 binneigh_modify every 20 delay 0 check no
fix 1 all nve
dump id all atom 100 dump_*.melt
thermo 100run 1000Execution Commands
I tested the following five configurations. The aim was to compare not only P vs. P+E cores, but also thread-based and process-based parallelism.
MPI=1, OMP=28
export OMP_NUM_THREADS=28lmp -sf gpu -pk gpu 1 -in in.melt(P-cores only) MPI=1, OMP=1
export OMP_NUM_THREADS=1mpirun -np 1 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.meltConfirmed that only P-cores were allocated.
[alpha-tauri:35977] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.](P-cores only) MPI=8, OMP=1
export OMP_NUM_THREADS=1mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.meltConfirmed that only P-cores were allocated.
[alpha-tauri:35997] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:35997] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.][alpha-tauri:35997] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:35997] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.][alpha-tauri:35997] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:35997] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:35997] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:35997] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.](P-cores only) MPI=8, OMP=2
export OMP_NUM_THREADS=2mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.meltConfirmed that only P-cores were allocated.
[alpha-tauri:36060] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:36060] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.][alpha-tauri:36060] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:36060] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:36060] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:36060] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.][alpha-tauri:36060] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:36060] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.](P+E-cores) MPI=20, OMP=1
export OMP_NUM_THREADS=1mpirun -np 20 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.meltConfirmed that both P and E-cores were allocated.
[alpha-tauri:36257] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:36257] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:36257] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.][alpha-tauri:36257] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:36257] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.][alpha-tauri:36257] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:36257] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.][alpha-tauri:36257] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:36257] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.][alpha-tauri:36257] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.][alpha-tauri:36257] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.][alpha-tauri:36257] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.][alpha-tauri:36257] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.][alpha-tauri:36257] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.][alpha-tauri:36257] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.][alpha-tauri:36257] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.][alpha-tauri:36257] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.][alpha-tauri:36257] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.][alpha-tauri:36257] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.][alpha-tauri:36257] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B]Results
Below is a summary of the output from log.lammps.
| MPI/OMP | Loop Time | tau/day | timesteps/s | Matom-step/s | Total wall time |
|---|---|---|---|---|---|
| 1/28 | 48.7217 | 8866.691 | 20.525 | 82.099 | 0:00:51 |
| 1/1 | 65.5721 | 6588.166 | 15.250 | 61.002 | 0:01:07 |
| 8/1 | 25.7729 | 16761.781 | 38.800 | 155.202 | 0:00:26 |
| 8/2 | 25.5147 | 16931.387 | 39.193 | 156.772 | 0:00:26 |
| 20/1 | 26.3222 | 16411.973 | 37.991 | 151.963 | 0:00:28 |
The results show that process-based parallelism (e.g., 8/1 or 20/1) is significantly faster than thread-based (e.g., 1/28). According to ChatGPT, here are some possible reasons:
Advantages of MPI parallelism:
- Standard LAMMPS is optimized for MPI, so it benefits more from process-based parallelism.
- Better cache locality and less contention compared to threads.
- asks like force calculations can be efficiently and completely divided across processes.
Challenges with OpenMP (thread) parallelism:
- Unless explicitly optimized using packages like USER-OMP or KOKKOS, thread overhead can increase.
- Speedup is limited due to partial parallelism (Amdahl’s Law).
- Threads share memory, making memory bandwidth and cache contention a bottleneck.
There was a slight speedup when comparing 1/28 vs. 1/1, but the difference was marginal.
The 8-process configuration gave the fastest result. There was no significant difference between OMP=1 and OMP=2 in this case.
The 20-process configuration was slightly slower than the 8-process one — likely because the E-cores were dragging it down.
Conclusion: It seems that using only the P-cores for parallel computing yields the fastest results.
That said, the ideal number of processes or threads may vary depending on the number of atoms, the potential used, and the specific LAMMPS commands. Some trial and error is required in each case.