Which Is More Efficient for Scientific Computing: Using Only P-Cores or Both P and E-Cores?


Which Is More Efficient for Scientific Computing: Using Only P-Cores or Both P and E-Cores?

Modern Intel CPUs are split into Performance cores (P-cores) and Efficient cores (E-cores), so I wanted to investigate the best way to run MPI-based scientific computations. Would it be faster to use only the P-cores, or is it better to include E-cores for higher parallelism? While most scientists probably use supercomputers or workstations with Xeon CPUs and don’t worry about this, I decided to test it out myself.

System Specs

Computational Code

Lammps

Preparation

To identify which cores are P-cores, run lscpu --extended.

Terminal window
$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 5500.0000 800.0000 800.1480
1 0 0 0 0:0:0:0 yes 5500.0000 800.0000 871.6810
2 0 0 1 4:4:1:0 yes 5500.0000 800.0000 800.0000
3 0 0 1 4:4:1:0 yes 5500.0000 800.0000 800.0000
4 0 0 2 8:8:2:0 yes 5500.0000 800.0000 800.0000
5 0 0 2 8:8:2:0 yes 5500.0000 800.0000 800.0000
6 0 0 3 12:12:3:0 yes 5500.0000 800.0000 800.0000
7 0 0 3 12:12:3:0 yes 5500.0000 800.0000 800.0000
8 0 0 4 16:16:4:0 yes 5600.0000 800.0000 800.0000
9 0 0 4 16:16:4:0 yes 5600.0000 800.0000 800.0000
10 0 0 5 20:20:5:0 yes 5600.0000 800.0000 800.0000
11 0 0 5 20:20:5:0 yes 5600.0000 800.0000 800.0000
12 0 0 6 24:24:6:0 yes 5500.0000 800.0000 800.0000
13 0 0 6 24:24:6:0 yes 5500.0000 800.0000 800.0000
14 0 0 7 28:28:7:0 yes 5500.0000 800.0000 808.9900
15 0 0 7 28:28:7:0 yes 5500.0000 800.0000 800.0000
16 0 0 8 32:32:8:0 yes 4300.0000 800.0000 800.0000
17 0 0 9 33:33:8:0 yes 4300.0000 800.0000 4300.0010
18 0 0 10 34:34:8:0 yes 4300.0000 800.0000 800.0000
19 0 0 11 35:35:8:0 yes 4300.0000 800.0000 800.0000
20 0 0 12 36:36:9:0 yes 4300.0000 800.0000 800.0000
21 0 0 13 37:37:9:0 yes 4300.0000 800.0000 800.0000
22 0 0 14 38:38:9:0 yes 4300.0000 800.0000 800.0000
23 0 0 15 39:39:9:0 yes 4300.0000 800.0000 800.0000
24 0 0 16 40:40:10:0 yes 4300.0000 800.0000 800.0000
25 0 0 17 41:41:10:0 yes 4300.0000 800.0000 800.0000
26 0 0 18 42:42:10:0 yes 4300.0000 800.0000 800.0000
27 0 0 19 43:43:10:0 yes 4300.0000 800.0000 800.0000

The cores with higher MaxMHz are likely the P-cores. In my case, CPUs 0–15 (CORE: 0–7) are the P-cores.

I confirmed that it’s possible to run simulations using only P-cores.

Terminal window
$ mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'
[alpha-tauri:35406] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35406] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35406] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
hello
hello
hello
hello
hello
hello
hello
hello

I also confirmed that in some settings, both P and E-cores are used.

Terminal window
$ mpirun -np20 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
[alpha-tauri:35826] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35826] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.]
[alpha-tauri:35826] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.]
[alpha-tauri:35826] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.]
[alpha-tauri:35826] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.]
[alpha-tauri:35826] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.]
[alpha-tauri:35826] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.]
[alpha-tauri:35826] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.]
[alpha-tauri:35826] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.]
[alpha-tauri:35826] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.]
[alpha-tauri:35826] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.]
[alpha-tauri:35826] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.]
[alpha-tauri:35826] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B]
[alpha-tauri:35826] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35826] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]

(Appendix) Thread Allocation Test

I wanted to see which cores threads were being scheduled on, so I wrote a small code snippet to test this. It turned out to be not very useful, but I’m including it here anyway — seems like a waste to just throw it out.

main.c
#define _GNU_SOURCE
#include <stdio.h>
#include <omp.h>
#include <sched.h>
#include <hwloc.h>
int main() {
hwloc_topology_t topology;
hwloc_topology_init(&topology);
hwloc_topology_load(topology);
#pragma omp parallel
{
int tid = omp_get_thread_num();
int cpu = sched_getcpu(); // 現在実行中の論理CPU ID
hwloc_cpuset_t cpuset = hwloc_bitmap_alloc();
hwloc_get_last_cpu_location(topology, cpuset, HWLOC_CPUBIND_THREAD);
printf("Thread %d is running on CPU %d (bitmap:", tid, cpu);
for (int i = 0; i <= hwloc_bitmap_last(cpuset); i++) {
if (hwloc_bitmap_isset(cpuset, i)) {
printf(" %d", i);
}
}
printf(" )\n");
hwloc_bitmap_free(cpuset);
}
hwloc_topology_destroy(topology);
return 0;
}

Includes everything from compilation to execution.

Terminal window
$ sudo apt install hwloc hwloc libhwloc-dev
$ gcc -fopenmp main.c -lhwloc -o hello_omp
$ export OMP_NUM_THREADS=28
$ ./hello_omp
Thread 0 is running on CPU 15 (bitmap: 15 )
Thread 1 is running on CPU 17 (bitmap: 17 )
Thread 13 is running on CPU 0 (bitmap: 0 )
Thread 23 is running on CPU 7 (bitmap: 7 )
Thread 16 is running on CPU 6 (bitmap: 6 )
Thread 27 is running on CPU 16 (bitmap: 16 )
Thread 2 is running on CPU 23 (bitmap: 23 )
Thread 14 is running on CPU 4 (bitmap: 4 )
Thread 4 is running on CPU 20 (bitmap: 6 )
Thread 7 is running on CPU 22 (bitmap: 22 )
Thread 8 is running on CPU 24 (bitmap: 24 )
Thread 10 is running on CPU 26 (bitmap: 26 )
Thread 11 is running on CPU 27 (bitmap: 27 )
Thread 12 is running on CPU 2 (bitmap: 2 )
Thread 21 is running on CPU 3 (bitmap: 3 )
Thread 22 is running on CPU 5 (bitmap: 5 )
Thread 17 is running on CPU 10 (bitmap: 10 )
Thread 24 is running on CPU 9 (bitmap: 9 )
Thread 26 is running on CPU 13 (bitmap: 13 )
Thread 20 is running on CPU 1 (bitmap: 1 )
Thread 9 is running on CPU 25 (bitmap: 25 )
Thread 5 is running on CPU 19 (bitmap: 19 )
Thread 15 is running on CPU 8 (bitmap: 8 )
Thread 3 is running on CPU 18 (bitmap: 18 )
Thread 25 is running on CPU 11 (bitmap: 11 )
Thread 18 is running on CPU 12 (bitmap: 12 )
Thread 6 is running on CPU 21 (bitmap: 21 )
Thread 19 is running on CPU 14 (bitmap: 14 )

Running LAMMPS

in.melt

I used the melt example from the LAMMPS examples directory, with slight modifications (increased cell size and number of steps).

in.melt
# 3d Lennard-Jones melt
units lj
atom_style atomic
lattice fcc 0.8442
region box block 0 100 0 100 0 100
create_box 1 box
create_atoms 1 box
mass 1 1.0
velocity all create 3.0 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify every 20 delay 0 check no
fix 1 all nve
dump id all atom 100 dump_*.melt
thermo 100
run 1000

Execution Commands

I tested the following five configurations. The aim was to compare not only P vs. P+E cores, but also thread-based and process-based parallelism.

MPI=1, OMP=28

Terminal window
export OMP_NUM_THREADS=28
lmp -sf gpu -pk gpu 1 -in in.melt

(P-cores only) MPI=1, OMP=1

Terminal window
export OMP_NUM_THREADS=1
mpirun -np 1 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

Confirmed that only P-cores were allocated.

Terminal window
[alpha-tauri:35977] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]

(P-cores only) MPI=8, OMP=1

Terminal window
export OMP_NUM_THREADS=1
mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

Confirmed that only P-cores were allocated.

Terminal window
[alpha-tauri:35997] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35997] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:35997] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35997] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]

(P-cores only) MPI=8, OMP=2

Terminal window
export OMP_NUM_THREADS=2
mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

Confirmed that only P-cores were allocated.

Terminal window
[alpha-tauri:36060] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:36060] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:36060] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:36060] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]

(P+E-cores) MPI=20, OMP=1

Terminal window
export OMP_NUM_THREADS=1
mpirun -np 20 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

Confirmed that both P and E-cores were allocated.

Terminal window
[alpha-tauri:36257] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:36257] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:36257] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:36257] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.]
[alpha-tauri:36257] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.]
[alpha-tauri:36257] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.]
[alpha-tauri:36257] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.]
[alpha-tauri:36257] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.]
[alpha-tauri:36257] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.]
[alpha-tauri:36257] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.]
[alpha-tauri:36257] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.]
[alpha-tauri:36257] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.]
[alpha-tauri:36257] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.]
[alpha-tauri:36257] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.]
[alpha-tauri:36257] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B]

Results

Below is a summary of the output from log.lammps.

MPI/OMPLoop Timetau/daytimesteps/sMatom-step/sTotal wall time
1/2848.72178866.69120.52582.0990:00:51
1/165.57216588.16615.25061.0020:01:07
8/125.772916761.78138.800155.2020:00:26
8/225.514716931.38739.193156.7720:00:26
20/126.322216411.97337.991151.9630:00:28

The results show that process-based parallelism (e.g., 8/1 or 20/1) is significantly faster than thread-based (e.g., 1/28). According to ChatGPT, here are some possible reasons:

Advantages of MPI parallelism:

  • Standard LAMMPS is optimized for MPI, so it benefits more from process-based parallelism.
  • Better cache locality and less contention compared to threads.
  • asks like force calculations can be efficiently and completely divided across processes.

Challenges with OpenMP (thread) parallelism:

  • Unless explicitly optimized using packages like USER-OMP or KOKKOS, thread overhead can increase.
  • Speedup is limited due to partial parallelism (Amdahl’s Law).
  • Threads share memory, making memory bandwidth and cache contention a bottleneck.

There was a slight speedup when comparing 1/28 vs. 1/1, but the difference was marginal.

The 8-process configuration gave the fastest result. There was no significant difference between OMP=1 and OMP=2 in this case.

The 20-process configuration was slightly slower than the 8-process one — likely because the E-cores were dragging it down.

Conclusion: It seems that using only the P-cores for parallel computing yields the fastest results.

That said, the ideal number of processes or threads may vary depending on the number of atoms, the potential used, and the specific LAMMPS commands. Some trial and error is required in each case.


Author

me

Fumito Iriya

Scientist (Ph.D.), Programer, Web Developer, Guitarist, Photographer

more...