Which Is More Efficient for Scientific Computing: Using Only P-Cores or Both P and E-Cores?

Modern Intel CPUs are split into Performance cores (P-cores) and Efficient cores (E-cores), so I wanted to investigate the best way to run MPI-based scientific computations. Would it be faster to use only the P-cores, or is it better to include E-cores for higher parallelism? While most scientists probably use supercomputers or workstations with Xeon CPUs and don’t worry about this, I decided to test it out myself.

System Specs

MB: ASUS ROG STRIX B760-I GAMING WIFI
CPU: Intel Core i7 14700K ⧉
Memory: 64GB
GPU: RTX4080 Super

Computational Code

Lammps ⧉

Preparation

To identify which cores are P-cores, run lscpu --extended.

$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ       MHZ
  0    0      0    0 0:0:0:0          yes 5500.0000 800.0000  800.1480
  1    0      0    0 0:0:0:0          yes 5500.0000 800.0000  871.6810
  2    0      0    1 4:4:1:0          yes 5500.0000 800.0000  800.0000
  3    0      0    1 4:4:1:0          yes 5500.0000 800.0000  800.0000
  4    0      0    2 8:8:2:0          yes 5500.0000 800.0000  800.0000
  5    0      0    2 8:8:2:0          yes 5500.0000 800.0000  800.0000
  6    0      0    3 12:12:3:0        yes 5500.0000 800.0000  800.0000
  7    0      0    3 12:12:3:0        yes 5500.0000 800.0000  800.0000
  8    0      0    4 16:16:4:0        yes 5600.0000 800.0000  800.0000
  9    0      0    4 16:16:4:0        yes 5600.0000 800.0000  800.0000
 10    0      0    5 20:20:5:0        yes 5600.0000 800.0000  800.0000
 11    0      0    5 20:20:5:0        yes 5600.0000 800.0000  800.0000
 12    0      0    6 24:24:6:0        yes 5500.0000 800.0000  800.0000
 13    0      0    6 24:24:6:0        yes 5500.0000 800.0000  800.0000
 14    0      0    7 28:28:7:0        yes 5500.0000 800.0000  808.9900
 15    0      0    7 28:28:7:0        yes 5500.0000 800.0000  800.0000
 16    0      0    8 32:32:8:0        yes 4300.0000 800.0000  800.0000
 17    0      0    9 33:33:8:0        yes 4300.0000 800.0000 4300.0010
 18    0      0   10 34:34:8:0        yes 4300.0000 800.0000  800.0000
 19    0      0   11 35:35:8:0        yes 4300.0000 800.0000  800.0000
 20    0      0   12 36:36:9:0        yes 4300.0000 800.0000  800.0000
 21    0      0   13 37:37:9:0        yes 4300.0000 800.0000  800.0000
 22    0      0   14 38:38:9:0        yes 4300.0000 800.0000  800.0000
 23    0      0   15 39:39:9:0        yes 4300.0000 800.0000  800.0000
 24    0      0   16 40:40:10:0       yes 4300.0000 800.0000  800.0000
 25    0      0   17 41:41:10:0       yes 4300.0000 800.0000  800.0000
 26    0      0   18 42:42:10:0       yes 4300.0000 800.0000  800.0000
 27    0      0   19 43:43:10:0       yes 4300.0000 800.0000  800.0000

The cores with higher MaxMHz are likely the P-cores. In my case, CPUs 0–15 (CORE: 0–7) are the P-cores.

I confirmed that it’s possible to run simulations using only P-cores.

$ mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'
[alpha-tauri:35406] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35406] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35406] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
hello
hello
hello
hello
hello
hello
hello
hello

I also confirmed that in some settings, both P and E-cores are used.

$ mpirun -np20 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
[alpha-tauri:35826] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35826] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.]
[alpha-tauri:35826] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.]
[alpha-tauri:35826] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.]
[alpha-tauri:35826] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.]
[alpha-tauri:35826] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.]
[alpha-tauri:35826] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.]
[alpha-tauri:35826] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.]
[alpha-tauri:35826] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.]
[alpha-tauri:35826] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.]
[alpha-tauri:35826] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.]
[alpha-tauri:35826] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.]
[alpha-tauri:35826] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B]
[alpha-tauri:35826] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35826] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]

(Appendix) Thread Allocation Test

I wanted to see which cores threads were being scheduled on, so I wrote a small code snippet to test this. It turned out to be not very useful, but I’m including it here anyway — seems like a waste to just throw it out.

1
#define _GNU_SOURCE
2
#include <stdio.h>
3
#include <omp.h>
4
#include <sched.h>
5
#include <hwloc.h>
6

7
int main() {
8
    hwloc_topology_t topology;
9
    hwloc_topology_init(&topology);
10
    hwloc_topology_load(topology);
11

12
    #pragma omp parallel
13
    {
14
        int tid = omp_get_thread_num();
15
        int cpu = sched_getcpu();  // 現在実行中の論理CPU ID
16

17
        hwloc_cpuset_t cpuset = hwloc_bitmap_alloc();
18
        hwloc_get_last_cpu_location(topology, cpuset, HWLOC_CPUBIND_THREAD);
19

20
        printf("Thread %d is running on CPU %d (bitmap:", tid, cpu);
21
        for (int i = 0; i <= hwloc_bitmap_last(cpuset); i++) {
22
            if (hwloc_bitmap_isset(cpuset, i)) {
23
                printf(" %d", i);
24
            }
25
        }
26
        printf(" )\n");
27

28
        hwloc_bitmap_free(cpuset);
29
    }
30

31
    hwloc_topology_destroy(topology);
32
    return 0;
33
}

Includes everything from compilation to execution.

$ sudo apt install hwloc hwloc libhwloc-dev
$ gcc -fopenmp main.c -lhwloc -o hello_omp
$ export OMP_NUM_THREADS=28
$ ./hello_omp
Thread 0 is running on CPU 15 (bitmap: 15 )
Thread 1 is running on CPU 17 (bitmap: 17 )
Thread 13 is running on CPU 0 (bitmap: 0 )
Thread 23 is running on CPU 7 (bitmap: 7 )
Thread 16 is running on CPU 6 (bitmap: 6 )
Thread 27 is running on CPU 16 (bitmap: 16 )
Thread 2 is running on CPU 23 (bitmap: 23 )
Thread 14 is running on CPU 4 (bitmap: 4 )
Thread 4 is running on CPU 20 (bitmap: 6 )
Thread 7 is running on CPU 22 (bitmap: 22 )
Thread 8 is running on CPU 24 (bitmap: 24 )
Thread 10 is running on CPU 26 (bitmap: 26 )
Thread 11 is running on CPU 27 (bitmap: 27 )
Thread 12 is running on CPU 2 (bitmap: 2 )
Thread 21 is running on CPU 3 (bitmap: 3 )
Thread 22 is running on CPU 5 (bitmap: 5 )
Thread 17 is running on CPU 10 (bitmap: 10 )
Thread 24 is running on CPU 9 (bitmap: 9 )
Thread 26 is running on CPU 13 (bitmap: 13 )
Thread 20 is running on CPU 1 (bitmap: 1 )
Thread 9 is running on CPU 25 (bitmap: 25 )
Thread 5 is running on CPU 19 (bitmap: 19 )
Thread 15 is running on CPU 8 (bitmap: 8 )
Thread 3 is running on CPU 18 (bitmap: 18 )
Thread 25 is running on CPU 11 (bitmap: 11 )
Thread 18 is running on CPU 12 (bitmap: 12 )
Thread 6 is running on CPU 21 (bitmap: 21 )
Thread 19 is running on CPU 14 (bitmap: 14 )

Running LAMMPS

in.melt

I used the melt example from the LAMMPS examples directory, with slight modifications (increased cell size and number of steps).

1
# 3d Lennard-Jones melt
2

3
units    lj
4
atom_style  atomic
5

6
lattice    fcc 0.8442
7
region    box block 0 100 0 100 0 100
8
create_box  1 box
9
create_atoms  1 box
10
mass    1 1.0
11

12
velocity  all create 3.0 87287 loop geom
13

14
pair_style  lj/cut 2.5
15
pair_coeff  1 1 1.0 1.0 2.5
16

17
neighbor  0.3 bin
18
neigh_modify  every 20 delay 0 check no
19

20
fix    1 all nve
21

22
dump    id all atom 100 dump_*.melt
23

24
thermo    100
25
run    1000

Execution Commands

I tested the following five configurations. The aim was to compare not only P vs. P+E cores, but also thread-based and process-based parallelism.

MPI=1, OMP=28

export OMP_NUM_THREADS=28
lmp -sf gpu -pk gpu 1 -in in.melt

(P-cores only) MPI=1, OMP=1

export OMP_NUM_THREADS=1
mpirun -np 1 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

Confirmed that only P-cores were allocated.

[alpha-tauri:35977] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]

(P-cores only) MPI=8, OMP=1

export OMP_NUM_THREADS=1
mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

Confirmed that only P-cores were allocated.

[alpha-tauri:35997] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35997] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:35997] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35997] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]

(P-cores only) MPI=8, OMP=2

export OMP_NUM_THREADS=2
mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

Confirmed that only P-cores were allocated.

[alpha-tauri:36060] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:36060] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:36060] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:36060] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]

(P+E-cores) MPI=20, OMP=1

export OMP_NUM_THREADS=1
mpirun -np 20 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

Confirmed that both P and E-cores were allocated.

[alpha-tauri:36257] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:36257] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:36257] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:36257] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.]
[alpha-tauri:36257] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.]
[alpha-tauri:36257] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.]
[alpha-tauri:36257] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.]
[alpha-tauri:36257] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.]
[alpha-tauri:36257] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.]
[alpha-tauri:36257] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.]
[alpha-tauri:36257] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.]
[alpha-tauri:36257] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.]
[alpha-tauri:36257] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.]
[alpha-tauri:36257] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.]
[alpha-tauri:36257] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B]

Results

Below is a summary of the output from log.lammps.

MPI/OMP	Loop Time	tau/day	timesteps/s	Matom-step/s	Total wall time
1/28	48.7217	8866.691	20.525	82.099	0:00:51
1/1	65.5721	6588.166	15.250	61.002	0:01:07
8/1	25.7729	16761.781	38.800	155.202	0:00:26
8/2	25.5147	16931.387	39.193	156.772	0:00:26
20/1	26.3222	16411.973	37.991	151.963	0:00:28

The results show that process-based parallelism (e.g., 8/1 or 20/1) is significantly faster than thread-based (e.g., 1/28). According to ChatGPT, here are some possible reasons:

Advantages of MPI parallelism:

Standard LAMMPS is optimized for MPI, so it benefits more from process-based parallelism.

Better cache locality and less contention compared to threads.

asks like force calculations can be efficiently and completely divided across processes.

Challenges with OpenMP (thread) parallelism:

Unless explicitly optimized using packages like USER-OMP or KOKKOS, thread overhead can increase.

Speedup is limited due to partial parallelism (Amdahl’s Law).

Threads share memory, making memory bandwidth and cache contention a bottleneck.

There was a slight speedup when comparing 1/28 vs. 1/1, but the difference was marginal.

The 8-process configuration gave the fastest result. There was no significant difference between OMP=1 and OMP=2 in this case.

The 20-process configuration was slightly slower than the 8-process one — likely because the E-cores were dragging it down.

Conclusion: It seems that using only the P-cores for parallel computing yields the fastest results.

That said, the ideal number of processes or threads may vary depending on the number of atoms, the potential used, and the specific LAMMPS commands. Some trial and error is required in each case.

Which Is More Efficient for Scientific Computing: Using Only P-Cores or Both P and E-Cores?

System Specs

Computational Code

Preparation

(Appendix) Thread Allocation Test

Running LAMMPS

in.melt

Execution Commands

MPI=1, OMP=28

(P-cores only) MPI=1, OMP=1

(P-cores only) MPI=8, OMP=1

(P-cores only) MPI=8, OMP=2

(P+E-cores) MPI=20, OMP=1

Results

Author

Fumito Iriya