科学計算はPコアのみ使った場合とPE両方使った場合のどちらが最適なのか?


科学計算はPコアのみ使った場合とPE両方使った場合のどちらが最適なのか?

最近のintelのCPUはPコアEコアに分かれているので、科学計算したいときにMPIの走らせ方をどうしたらいいのか気になる。Pコアだけで走らせた方が早いのか、Eコアも使って並列数増やした方が早いのか。科学計算する人は普通スパコンやXeonとか積んだワークステーション使うだろうから、こんなこと気にしているのは私だけかもしれないが、試してみる。

スペック

計算コード

Lammps

前準備

どれがPコアか調べるには、lscpu --extendedと打つ。

Terminal window
$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 5500.0000 800.0000 800.1480
1 0 0 0 0:0:0:0 yes 5500.0000 800.0000 871.6810
2 0 0 1 4:4:1:0 yes 5500.0000 800.0000 800.0000
3 0 0 1 4:4:1:0 yes 5500.0000 800.0000 800.0000
4 0 0 2 8:8:2:0 yes 5500.0000 800.0000 800.0000
5 0 0 2 8:8:2:0 yes 5500.0000 800.0000 800.0000
6 0 0 3 12:12:3:0 yes 5500.0000 800.0000 800.0000
7 0 0 3 12:12:3:0 yes 5500.0000 800.0000 800.0000
8 0 0 4 16:16:4:0 yes 5600.0000 800.0000 800.0000
9 0 0 4 16:16:4:0 yes 5600.0000 800.0000 800.0000
10 0 0 5 20:20:5:0 yes 5600.0000 800.0000 800.0000
11 0 0 5 20:20:5:0 yes 5600.0000 800.0000 800.0000
12 0 0 6 24:24:6:0 yes 5500.0000 800.0000 800.0000
13 0 0 6 24:24:6:0 yes 5500.0000 800.0000 800.0000
14 0 0 7 28:28:7:0 yes 5500.0000 800.0000 808.9900
15 0 0 7 28:28:7:0 yes 5500.0000 800.0000 800.0000
16 0 0 8 32:32:8:0 yes 4300.0000 800.0000 800.0000
17 0 0 9 33:33:8:0 yes 4300.0000 800.0000 4300.0010
18 0 0 10 34:34:8:0 yes 4300.0000 800.0000 800.0000
19 0 0 11 35:35:8:0 yes 4300.0000 800.0000 800.0000
20 0 0 12 36:36:9:0 yes 4300.0000 800.0000 800.0000
21 0 0 13 37:37:9:0 yes 4300.0000 800.0000 800.0000
22 0 0 14 38:38:9:0 yes 4300.0000 800.0000 800.0000
23 0 0 15 39:39:9:0 yes 4300.0000 800.0000 800.0000
24 0 0 16 40:40:10:0 yes 4300.0000 800.0000 800.0000
25 0 0 17 41:41:10:0 yes 4300.0000 800.0000 800.0000
26 0 0 18 42:42:10:0 yes 4300.0000 800.0000 800.0000
27 0 0 19 43:43:10:0 yes 4300.0000 800.0000 800.0000

MaxMHzが高いのがPコアだろう。ここでは、CPU:0-15, CORE:0-7まで。

Terminal window
$ mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'
[alpha-tauri:35406] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35406] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35406] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35406] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
hello
hello
hello
hello
hello
hello
hello
hello

とPコアだけで走らせることができることがわかる。

Terminal window
$ mpirun -np20 --map-by ppr:1:core --bind-to core --report-bindings echo 'hello'
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
[alpha-tauri:35826] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35826] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.]
[alpha-tauri:35826] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.]
[alpha-tauri:35826] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.]
[alpha-tauri:35826] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.]
[alpha-tauri:35826] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.]
[alpha-tauri:35826] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.]
[alpha-tauri:35826] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.]
[alpha-tauri:35826] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.]
[alpha-tauri:35826] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.]
[alpha-tauri:35826] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.]
[alpha-tauri:35826] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.]
[alpha-tauri:35826] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B]
[alpha-tauri:35826] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35826] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35826] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]

と、P・E全部使っていることがわかる。

(おまけ) スレッド並列の割り当て

スレッド並列がどのコアで走っているのか確認したいと思い、下記のコードを作成したがあまり意味がないことに気づく。捨てるのも悔しいので一応残しておく。

main.c
#define _GNU_SOURCE
#include <stdio.h>
#include <omp.h>
#include <sched.h>
#include <hwloc.h>
int main() {
hwloc_topology_t topology;
hwloc_topology_init(&topology);
hwloc_topology_load(topology);
#pragma omp parallel
{
int tid = omp_get_thread_num();
int cpu = sched_getcpu(); // 現在実行中の論理CPU ID
hwloc_cpuset_t cpuset = hwloc_bitmap_alloc();
hwloc_get_last_cpu_location(topology, cpuset, HWLOC_CPUBIND_THREAD);
printf("Thread %d is running on CPU %d (bitmap:", tid, cpu);
for (int i = 0; i <= hwloc_bitmap_last(cpuset); i++) {
if (hwloc_bitmap_isset(cpuset, i)) {
printf(" %d", i);
}
}
printf(" )\n");
hwloc_bitmap_free(cpuset);
}
hwloc_topology_destroy(topology);
return 0;
}

以下、コンパイルから実行まで。

Terminal window
$ sudo apt install hwloc hwloc libhwloc-dev
$ gcc -fopenmp main.c -lhwloc -o hello_omp
$ export OMP_NUM_THREADS=28
$ ./hello_omp
Thread 0 is running on CPU 15 (bitmap: 15 )
Thread 1 is running on CPU 17 (bitmap: 17 )
Thread 13 is running on CPU 0 (bitmap: 0 )
Thread 23 is running on CPU 7 (bitmap: 7 )
Thread 16 is running on CPU 6 (bitmap: 6 )
Thread 27 is running on CPU 16 (bitmap: 16 )
Thread 2 is running on CPU 23 (bitmap: 23 )
Thread 14 is running on CPU 4 (bitmap: 4 )
Thread 4 is running on CPU 20 (bitmap: 6 )
Thread 7 is running on CPU 22 (bitmap: 22 )
Thread 8 is running on CPU 24 (bitmap: 24 )
Thread 10 is running on CPU 26 (bitmap: 26 )
Thread 11 is running on CPU 27 (bitmap: 27 )
Thread 12 is running on CPU 2 (bitmap: 2 )
Thread 21 is running on CPU 3 (bitmap: 3 )
Thread 22 is running on CPU 5 (bitmap: 5 )
Thread 17 is running on CPU 10 (bitmap: 10 )
Thread 24 is running on CPU 9 (bitmap: 9 )
Thread 26 is running on CPU 13 (bitmap: 13 )
Thread 20 is running on CPU 1 (bitmap: 1 )
Thread 9 is running on CPU 25 (bitmap: 25 )
Thread 5 is running on CPU 19 (bitmap: 19 )
Thread 15 is running on CPU 8 (bitmap: 8 )
Thread 3 is running on CPU 18 (bitmap: 18 )
Thread 25 is running on CPU 11 (bitmap: 11 )
Thread 18 is running on CPU 12 (bitmap: 12 )
Thread 6 is running on CPU 21 (bitmap: 21 )
Thread 19 is running on CPU 14 (bitmap: 14 )

Lammps実行

in.melt

exampleにあるmeltにちょっと手を加えて使用する(セルサイズとステップ数を増やす)。

in.melt
# 3d Lennard-Jones melt
units lj
atom_style atomic
lattice fcc 0.8442
region box block 0 100 0 100 0 100
create_box 1 box
create_atoms 1 box
mass 1 1.0
velocity all create 3.0 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify every 20 delay 0 check no
fix 1 all nve
dump id all atom 100 dump_*.melt
thermo 100
run 1000

実行コマンド

下記の5種類を試した。PコアEコアの比較だけではなく、スレッド並列とかの組み合わせも検討する。

MPI=1, OMP=28

Terminal window
export OMP_NUM_THREADS=28
lmp -sf gpu -pk gpu 1 -in in.melt

(Pコアのみ) MPI=1, OMP=1

Terminal window
export OMP_NUM_THREADS=1
mpirun -np 1 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

下記の通り、Pコアが割り当てられている。

Terminal window
[alpha-tauri:35977] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]

(Pコアのみ) MPI=8, OMP=1

Terminal window
export OMP_NUM_THREADS=1
mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

下記の通り、Pコアが割り当てられている。

Terminal window
[alpha-tauri:35997] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:35997] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:35997] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:35997] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:35997] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]

(Pコアのみ) MPI=8, OMP=2

Terminal window
export OMP_NUM_THREADS=2
mpirun -np 8 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

下記の通り、Pコアが割り当てられている。

Terminal window
[alpha-tauri:36060] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:36060] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:36060] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:36060] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:36060] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]

(Pコアのみ) MPI=20, OMP=1

Terminal window
export OMP_NUM_THREADS=1
mpirun -np 20 --map-by ppr:1:core --bind-to core --report-bindings lmp -sf gpu -pk gpu 1 -in in.melt

下記の通り、Pコア、Eコアが割り当てられている。

Terminal window
[alpha-tauri:36257] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../.././././././././././././.]
[alpha-tauri:36257] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../.././././././././././././.]
[alpha-tauri:36257] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/.././././././././././././.]
[alpha-tauri:36257] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/./././././././././././.]
[alpha-tauri:36257] MCW rank 8 bound to socket 0[core 8[hwt 0]]: [../../../../../../../../B/././././././././././.]
[alpha-tauri:36257] MCW rank 9 bound to socket 0[core 9[hwt 0]]: [../../../../../../../.././B/./././././././././.]
[alpha-tauri:36257] MCW rank 10 bound to socket 0[core 10[hwt 0]]: [../../../../../../../../././B/././././././././.]
[alpha-tauri:36257] MCW rank 11 bound to socket 0[core 11[hwt 0]]: [../../../../../../../.././././B/./././././././.]
[alpha-tauri:36257] MCW rank 12 bound to socket 0[core 12[hwt 0]]: [../../../../../../../../././././B/././././././.]
[alpha-tauri:36257] MCW rank 13 bound to socket 0[core 13[hwt 0]]: [../../../../../../../.././././././B/./././././.]
[alpha-tauri:36257] MCW rank 14 bound to socket 0[core 14[hwt 0]]: [../../../../../../../../././././././B/././././.]
[alpha-tauri:36257] MCW rank 15 bound to socket 0[core 15[hwt 0]]: [../../../../../../../.././././././././B/./././.]
[alpha-tauri:36257] MCW rank 16 bound to socket 0[core 16[hwt 0]]: [../../../../../../../../././././././././B/././.]
[alpha-tauri:36257] MCW rank 17 bound to socket 0[core 17[hwt 0]]: [../../../../../../../.././././././././././B/./.]
[alpha-tauri:36257] MCW rank 18 bound to socket 0[core 18[hwt 0]]: [../../../../../../../../././././././././././B/.]
[alpha-tauri:36257] MCW rank 19 bound to socket 0[core 19[hwt 0]]: [../../../../../../../.././././././././././././B]

結果

log.lammpsに出力された結果を下記にまとめる。

MPI/OMPLoop Timetau/daytimesteps/sMatom-step/sTotal wall time
1/2848.72178866.69120.52582.0990:00:51
1/165.57216588.16615.25061.0020:01:07
8/125.772916761.78138.800155.2020:00:26
8/225.514716931.38739.193156.7720:00:26
20/126.322216411.97337.991151.9630:00:28

スレッド並列(1/28)よりも、プロセス並列(8/1や20/1)を使った方が圧倒的に早い結果となった。本当かどうかはわからないが、chatgptによれば以下のような理由のようだ。

MPIプロセスの利点

  • 標準のLAMMPSはMPIに最適化されており、MPI並列の効率が良い。
  • OpenMPに比べてデータのキャッシュ局所性が高く、スレッド間での競合が少ない。
  • 力計算などが完全に分割可能なため、MPIで効率的に並列化される。

OpenMP(スレッド)並列の課題

  • LAMMPSのUSER-OMPやKOKKOSパッケージで明示的に最適化されていないと、スレッド間のオーバーヘッドが増える。
  • スレッドでの並列化は部分的で、全体のスピードアップ率が小さくなる(Amdahlの法則)。
  • OpenMPはスレッド間でメモリを共有するため、メモリ帯域やキャッシュ競合がボトルネックになりやすい。

一応スレッド並列も1/28と1/1を比べれば若干速くなることはわかるが、微妙である。

8プロセス並列がもっとも高速な結果となった。この場合、スレッド並列が1の場合も2の場合も変わらなかった。

20プロセス並列は8プロセス並列よりもわずかに時間がかかる結果となった。やはり、Eコアが足を引っ張っているのだろう。

ということで、Pコアのみを使って並列計算する方が計算が早く終わるようだ。

あとは、並列数やスレッド並列の使い所などは、原子数、ポテンシャル、コマンドで変わるので都度試行錯誤する必要はある。


Author

me

入谷 史人

Scientist (Ph.D.), Programer, Web Developer, Guitarist, Photographer

more...