HPC Basics by SIGHPCSYSPROS

Topics

Introduction to HPC
Designing a cluster
Introduction to HPC Storage
Parallel Filesystems
Cluster Stack Basics
Provisioning
Configuration Management
Scheduling and Resource Management
Introduction to Slurm
Monitoring HPC systems and infrastructure components
HPC User support
High speed Networks
Account Management
LMOD
User software management
Node Health Check
Spack
Using IPMI for oob management of servers

Advanced Topics

Problems in Scalability
Process pinning
Benchmarking
Developing acceptance tests
Using compliance testing to verify environments
Stateless provisioning
Debugging tools and when to use them

About HPCSYSPROS

In order to meet the demands of high performance computing (HPC) researchers, large-scale computational and storage machines require many staff members who design, install, and maintain these systems. These HPC systems professionals include system engineers, system administrators, network administrators, storage administrators and operations staff all who face problems that are specific to high performance systems.

The ACM SIGHPC SYSPROS chapter intends to be a platform for discussing the unique challenges that come from supporting large-scale, high performance systems. We speak directly to the state of the practice of standing up and operating high performance systems with an emphasis on solutions that can be implemented by systems staff at other institutions.