最近のintelのCPUはPコアEコアに分かれているので、科学計算したいときにMPIの走らせ方をどうしたらいいのか気になる。Pコアだけで走らせた方が早いのか、Eコアも使って並列数増やした方が早いのか。科学計算する人は普通スパコンやXeonとか積んだワークステーション使うだろうから、こんなこと気にしているのは私だけかもしれないが、試してみる。
スペック
- MB: ASUS ROG STRIX B760-I GAMING WIFI
- CPU: Intel Core i7 14700K ⧉
- Memory: 64GB
- GPU: RTX4080 Super
計算コード
前準備
どれがPコアか調べるには、lscpu --extendedと打つ。
$ lscpu --extendedCPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ 0 0 0 0 0:0:0:0 yes 5500.0000 800.0000 800.1480 1 0 0 0 0:0:0:0 yes 5500.0000 800.0000 871.6810 2 0 0 1 4:4:1:0 yes 5500.0000 800.0000 800.0000 3 0 0 1 4:4:1:0 yes 5500.0000 800.0000 800.0000 4 0 0 2 8:8:2:0 yes 5500.0000 800.0000 800.0000 5 0 0 2 8:8:2:0 yes 5500.0000 800.0000 800.0000 6 0 0 3 12:12:3:0 yes 5500.0000 800.0000 800.0000 7 0 0 3 12:12:3:0 yes 5500.0000 800.0000 800.0000 8 0 0 4 16:16:4:0 yes 5600.0000 800.0000 800.0000 9 0 0 4 16:16:4:0 yes 5600.0000 800.0000 800.0000 10 0 0 5 20:20:5:0 yes 5600.0000 800.0000 800.0000 11 0 0 5 20:20:5:0 yes 5600.0000 800.0000 800.0000 12 0 0 6 24:24:6:0 yes 5500.0000 800.0000 800.0000 13 0 0 6 24:24:6:0 yes 5500.0000 800.0000 800.0000 14 0 0 7 28:28:7:0 yes 5500.0000 800.0000 808.9900 15 0 0 7 28:28:7:0 yes 5500.0000 800.0000 800.0000 16 0 0 8 32:32:8:0 yes 4300.0000 800.0000 800.0000 17 0 0 9 33:33:8:0 yes 4300.0000 800.0000 4300.0010 18 0 0 10 34:34:8:0 yes 4300.0000 800.0000 800.0000 19 0 0 11 35:35:8:0 yes 4300.0000 800.0000 800.0000 20 0 0 12 36:36:9:0 yes 4300.0000 800.0000 800.0000 21 0 0 13 37:37:9:0 yes 4300.0000 800.0000 800.0000 22 0 0 14 38:38:9:0 yes 4300.0000 800.0000 800.0000 23 0 0 15 39:39:9:0 yes 4300.0000 800.0000 800.0000 24 0 0 16 40:40:10:0 yes 4300.0000 800.0000 800.0000 25 0 0 17 41:41:10:0 yes 4300.0000 800.0000 800.0000 26 0 0 18 42:42:10:0 yes 4300.0000 800.0000 800.0000 27 0 0 19 43:43:10:0 yes 4300.0000 800.0000 800.0000MaxMHzが高いのがPコアだろう。ここでは、CPU:0-15, CORE:0-7まで。
$ mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'[alpha-tauri:35406] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:35406] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:35406] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:35406] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.][alpha-tauri:35406] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:35406] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.][alpha-tauri:35406] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:35406] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]hellohellohellohellohellohellohellohelloとPコアだけで走らせることができることがわかる。
$ mpirun -np20 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'hellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohello[alpha-tauri:35826] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:35826] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.][alpha-tauri:35826] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.][alpha-tauri:35826] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.][alpha-tauri:35826] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.][alpha-tauri:35826] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.][alpha-tauri:35826] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.][alpha-tauri:35826] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.][alpha-tauri:35826] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.][alpha-tauri:35826] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.][alpha-tauri:35826] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.][alpha-tauri:35826] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.][alpha-tauri:35826] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B][alpha-tauri:35826] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:35826] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:35826] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.][alpha-tauri:35826] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:35826] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.][alpha-tauri:35826] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:35826] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]と、P・E全部使っていることがわかる。
(おまけ) スレッド並列の割り当て
スレッド並列がどのコアで走っているのか確認したいと思い、下記のコードを作成したがあまり意味がないことに気づく。捨てるのも悔しいので一応残しておく。
#define _GNU_SOURCE#include <stdio.h>#include <omp.h>#include <sched.h>#include <hwloc.h>
int main() { hwloc_topology_t topology; hwloc_topology_init(&topology); hwloc_topology_load(topology);
#pragma omp parallel { int tid = omp_get_thread_num(); int cpu = sched_getcpu(); // 現在実行中の論理CPU ID
hwloc_cpuset_t cpuset = hwloc_bitmap_alloc(); hwloc_get_last_cpu_location(topology, cpuset, HWLOC_CPUBIND_THREAD);
printf("Thread %d is running on CPU %d (bitmap:", tid, cpu); for (int i = 0; i <= hwloc_bitmap_last(cpuset); i++) { if (hwloc_bitmap_isset(cpuset, i)) { printf(" %d", i); } } printf(" )\n");
hwloc_bitmap_free(cpuset); }
hwloc_topology_destroy(topology); return 0;}以下、コンパイルから実行まで。
$ sudo apt install hwloc hwloc libhwloc-dev$ gcc -fopenmp main.c -lhwloc -o hello_omp$ export OMP_NUM_THREADS=28$ ./hello_ompThread 0 is running on CPU 15 (bitmap: 15 )Thread 1 is running on CPU 17 (bitmap: 17 )Thread 13 is running on CPU 0 (bitmap: 0 )Thread 23 is running on CPU 7 (bitmap: 7 )Thread 16 is running on CPU 6 (bitmap: 6 )Thread 27 is running on CPU 16 (bitmap: 16 )Thread 2 is running on CPU 23 (bitmap: 23 )Thread 14 is running on CPU 4 (bitmap: 4 )Thread 4 is running on CPU 20 (bitmap: 6 )Thread 7 is running on CPU 22 (bitmap: 22 )Thread 8 is running on CPU 24 (bitmap: 24 )Thread 10 is running on CPU 26 (bitmap: 26 )Thread 11 is running on CPU 27 (bitmap: 27 )Thread 12 is running on CPU 2 (bitmap: 2 )Thread 21 is running on CPU 3 (bitmap: 3 )Thread 22 is running on CPU 5 (bitmap: 5 )Thread 17 is running on CPU 10 (bitmap: 10 )Thread 24 is running on CPU 9 (bitmap: 9 )Thread 26 is running on CPU 13 (bitmap: 13 )Thread 20 is running on CPU 1 (bitmap: 1 )Thread 9 is running on CPU 25 (bitmap: 25 )Thread 5 is running on CPU 19 (bitmap: 19 )Thread 15 is running on CPU 8 (bitmap: 8 )Thread 3 is running on CPU 18 (bitmap: 18 )Thread 25 is running on CPU 11 (bitmap: 11 )Thread 18 is running on CPU 12 (bitmap: 12 )Thread 6 is running on CPU 21 (bitmap: 21 )Thread 19 is running on CPU 14 (bitmap: 14 )Lammps実行
in.melt
exampleにあるmeltにちょっと手を加えて使用する(セルサイズとステップ数を増やす)。
# 3d Lennard-Jones melt
units ljatom_style atomic
lattice fcc 0.8442region box block 0 100 0 100 0 100create_box 1 boxcreate_atoms 1 boxmass 1 1.0
velocity all create 3.0 87287 loop geom
pair_style lj/cut 2.5pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 binneigh_modify every 20 delay 0 check no
fix 1 all nve
dump id all atom 100 dump_*.melt
thermo 100run 1000実行コマンド
下記の5種類を試した。PコアEコアの比較だけではなく、スレッド並列とかの組み合わせも検討する。
MPI=1, OMP=28
export OMP_NUM_THREADS=28lmp -sf gpu -pk gpu 1 -in in.melt(Pコアのみ) MPI=1, OMP=1
export OMP_NUM_THREADS=1mpirun -np 1 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt下記の通り、Pコアが割り当てられている。
[alpha-tauri:35977] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.](Pコアのみ) MPI=8, OMP=1
export OMP_NUM_THREADS=1mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt下記の通り、Pコアが割り当てられている。
[alpha-tauri:35997] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:35997] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.][alpha-tauri:35997] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:35997] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.][alpha-tauri:35997] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:35997] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:35997] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:35997] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.](Pコアのみ) MPI=8, OMP=2
export OMP_NUM_THREADS=2mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt下記の通り、Pコアが割り当てられている。
[alpha-tauri:36060] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:36060] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.][alpha-tauri:36060] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:36060] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:36060] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:36060] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.][alpha-tauri:36060] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:36060] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.](Pコアのみ) MPI=20, OMP=1
export OMP_NUM_THREADS=1mpirun -np 20 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt下記の通り、Pコア、Eコアが割り当てられている。
[alpha-tauri:36257] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.][alpha-tauri:36257] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.][alpha-tauri:36257] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.][alpha-tauri:36257] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.][alpha-tauri:36257] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.][alpha-tauri:36257] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.][alpha-tauri:36257] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.][alpha-tauri:36257] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.][alpha-tauri:36257] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.][alpha-tauri:36257] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.][alpha-tauri:36257] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.][alpha-tauri:36257] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.][alpha-tauri:36257] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.][alpha-tauri:36257] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.][alpha-tauri:36257] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.][alpha-tauri:36257] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.][alpha-tauri:36257] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.][alpha-tauri:36257] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.][alpha-tauri:36257] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.][alpha-tauri:36257] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B]結果
log.lammpsに出力された結果を下記にまとめる。
| MPI/OMP | Loop Time | tau/day | timesteps/s | Matom-step/s | Total wall time |
|---|---|---|---|---|---|
| 1/28 | 48.7217 | 8866.691 | 20.525 | 82.099 | 0:00:51 |
| 1/1 | 65.5721 | 6588.166 | 15.250 | 61.002 | 0:01:07 |
| 8/1 | 25.7729 | 16761.781 | 38.800 | 155.202 | 0:00:26 |
| 8/2 | 25.5147 | 16931.387 | 39.193 | 156.772 | 0:00:26 |
| 20/1 | 26.3222 | 16411.973 | 37.991 | 151.963 | 0:00:28 |
スレッド並列(1/28)よりも、プロセス並列(8/1や20/1)を使った方が圧倒的に早い結果となった。本当かどうかはわからないが、chatgptによれば以下のような理由のようだ。
MPIプロセスの利点
- 標準のLAMMPSはMPIに最適化されており、MPI並列の効率が良い。
- OpenMPに比べてデータのキャッシュ局所性が高く、スレッド間での競合が少ない。
- 力計算などが完全に分割可能なため、MPIで効率的に並列化される。
OpenMP(スレッド)並列の課題
- LAMMPSのUSER-OMPやKOKKOSパッケージで明示的に最適化されていないと、スレッド間のオーバーヘッドが増える。
- スレッドでの並列化は部分的で、全体のスピードアップ率が小さくなる(Amdahlの法則)。
- OpenMPはスレッド間でメモリを共有するため、メモリ帯域やキャッシュ競合がボトルネックになりやすい。
一応スレッド並列も1/28と1/1を比べれば若干速くなることはわかるが、微妙である。
8プロセス並列がもっとも高速な結果となった。この場合、スレッド並列が1の場合も2の場合も変わらなかった。
20プロセス並列は8プロセス並列よりもわずかに時間がかかる結果となった。やはり、Eコアが足を引っ張っているのだろう。
ということで、Pコアのみを使って並列計算する方が計算が早く終わるようだ。
あとは、並列数やスレッド並列の使い所などは、原子数、ポテンシャル、コマンドで変わるので都度試行錯誤する必要はある。