HPL

背景介绍

HPL的全称是 High-Performance Linpack Benchmark（高性能Linpack基准测试），是一个专门设计用于评估高性能计算（HPC）系统和超级计算机集群浮点计算能力的软件基准测试，用于衡量一个计算系统的速度和效率。

HPL测试的核心任务是采用高斯消元法求解一元N次稠密线性代数方程组，通过模拟科学和工程领域中常见的大规模计算问题来评估系统的性能。

具体介绍参见https://www.netlib.org/benchmark/hpl/index.html

本次赛题使用并行科技算力平台，在H800集群上进行，限制在单节点运行，最多使用4卡。

清华同学：通过清华专属算例--并行科技领取 500 元代金券，进入并行智算云，选择“集群”进行集群资源申请，选择 H800 即可。登入后点击“SSH”连接集群。
登录集群，从网盘链接获取sif文件，下载到集群工作目录。(参考集群使用说明)

该文档使用NVIDIA HPC Benchmark Container作为运行环境，并使用singularity运行容器。这里提供已经转换好的sif文件hpc-benchmarks_25.02.sif https://cloud.tsinghua.edu.cn/f/d1628b26fc924eaca407/

可在运行目录输入以下指令，直接下载到目录
```
wget https://cloud.tsinghua.edu.cn/f/d1628b26fc924eaca407/?dl=1 -O hpc-benchmarks_25.02.sif
```
与HPCG所需文件相同，如已下载到集群目录，无需重复操作

另，如需加载singularity如下：
```
module load singularity
```
在当前目录准备输入文件hpl.dat,一个可能的参数配置如下：
```
Innovative HPL.out 6 1 92800 96304 1 1024 1 1 1 1 16.0 1 0 1 2 1 2 8 1 2 1 0 1 2 1 3 2 1 1 0 1 192 1 0 0 8 
```
name="__codelineno-2-1" href="#__codelineno-2-1">HPLinpack benchmark input file Computing Laboratory, University of Tennessee output file name (if any) device out (6=stdout,7=stderr,file) # of problems sizes (N) Ns # of NBs NBs PMAP process mapping (0=Row-,1=Column-major) # of process grids (P x Q) Ps Qs threshold # of panel fact PFACTs (0=left, 1=Crout, 2=Right) # of recursive stopping criterium NBMINs (>= 1) # of panels in recursion NDIVs # of recursive panel fact. RFACTs (0=left, 1=Crout, 2=Right) # of broadcast BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) # of lookahead depth DEPTHs (>=0) SWAP (0=bin-exch,1=long,2=mix) swapping threshold L1 in (0=transposed,1=no-transposed) form U in (0=transposed,1=no-transposed) form Equilibration (0=no,1=yes) memory alignment in double (> 0)
之后，我们可以创建用于运行的脚本，一个可能的示例run.sh如下

#!/bin/bash
#----------------------------------------------------------
# SBATCH Directives
#----------------------------------------------------------
#SBATCH --job-name=test
#SBATCH --output=./logs/test_hpl_%j.out
#SBATCH --error=./logs/test_hpl_%j.err
#SBATCH --gpus=1
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1

date

module load singularity
module load openmpi/4.1.5_ucx1.17.0_nvhpc24.9_cuda12.4

# 指定镜像与挂载目录

CONT="$(pwd)/hpc-benchmarks_25.02.sif"
MOUNT="$(pwd):/my-dat-files"

`which mpirun` -np 1 --bind-to none \
        singularity run --nv \
        -B "${MOUNT}" "${CONT}" \
        /workspace/hpl.sh --dat /my-dat-files/hpl.dat \

注：所使用的进程数需要与输入文件中 p * q 相等

在当前目录使用sbatch提交脚本（关于sbatch与slurm的用法参考集群使用文档）
```
sbatch run.sh
```
注：集群某些计算节点可能存在环境问题，导致运行出错，此时可使用-x选项排除指定节点并提交

之后，在运行目录下将会生成文件夹logs，logs中test_hpl_xxx.out文件即为输出结果

调优提示

可以自行探索尝试的调优方向包括但不限于： - mpi及其他运行环境的选择。不同运行环境可能对运行效果有所影响。 - mpi的运行时选项和进程数配比。进程数对于程序的运行效率和性能有很大的影响，配合mpi的不同通信策略可能带来不同的效果。 - 调整脚本运行参数。在官网及相关文件中有对hpl.sh脚本运行参数的说明，可以进行调整来优化运行性能。 - 调整输入规模和计算策略。在hpl顶层目录中的TUNING文件中对各个输入参数有比较详细的介绍，例如问题规模大小、问题分块大小、进程划分方式等对整个程序的性能都有很大的影响，你可以对这些参数进行调整来获得最佳的测试性能。

提交要求

基本提交

优化完成后，请在gpu上运行并获取输出，内容类似于

=========================================================
================= NVIDIA HPC Benchmarks =================
=========================================================
NVIDIA Release 25.02
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

...
...

================================================================================
T/V                N    NB     P     Q         Time          Gflops (   per GPU)
--------------------------------------------------------------------------------
WC0           190464  1024     2     2      1118.00       4.120e+03 ( 1.030e+03)

HPL_pdgesv() start time Sat Oct 11 21:03:05 2025
HPL_pdgesv() end time   Sat Oct 11 21:21:43 2025

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   0.000553521720 ...... PASSED
||Ax-b||_oo  . . . . . . . . . . . . . . . . . = 0.0000000163628932
||A||_oo . . . . . . . . . . . . . . . . . . . = 47891.7234336114634061
||x||_oo . . . . . . . . . . . . . . . . . . . = 29.1905095333805278
||b||_oo . . . . . . . . . . . . . . . . . . . = 0.4999915975420884
================================================================================

注意*.out文件中只允许包含一个测试规模下的运行结果。该文件需上传到测评网站，用于判断正确性与性能。

完整提交

优化报告，总结该题目的优化过程，包括发现的问题、优化方法、加速比等，鼓励加入图表等形象化呈现
配置文件（包括输入参数、环境配置等）
运行脚本（包括软件环境、运行参数等）
运行后的输出log*.out
如有其他修改，提交相关文件，并在报告中说明

评分标准

正确运行 70%
性能分 30% ，计算方式为 \(0.3 \times \frac{FLOPS_{team}}{FLOPS_{fastest}}\)

注意事项

运行环境限制在单机，最多使用4卡，多于4卡将不计性能分
所使用的进程数需要与输入文件中 p * q 相等
可使用其他版本的NVIDIA HPC Benchmark Container，但需记录完整流程，将相关修改一概提交，以验证合法性
不允许修改输出log，需包含原始标准输出的完整内容
最终提交时，需提上述要求完整提交的所有文件，否则不计分