HPC Systems Administrator

HPC Systems Administrator

Overview

The Centre for Modelling & Simulation (CFMS) has an exciting opportunity for an HPC Systems Administrator (HPCSA) looking to further develop their career.  CFMS is an independent, not-for-profit specialist in digital engineering capability. CFMS provides a managed environment which synergises High Performance Computing (HPC) facilities, delivering expertise in Modelling and Simulation (M&S), offering a research-supportive environment that enables organisations to increase competitive advantage.  We seek an enthusiastic and self-motivated individual to join our team.

Job Purpose

CFMS’ Engineering Computing Services (ECS) team provides infrastructure and technical systems to internal stakeholders and external clients, making use of leading edge scientific computing services including:

●      Two HPC clusters totalling ~5000 cores

●      Openstack private cloud systems

●      High performance ethernet networking

●      General purpose business computing

The HPCSA role is to support and develop the technical running of the company’s HPC systems in alignment with its business objectives.  The role will support the Head of ECS in the maintenance and continued operation of the company’s existing systems, and developing strategies for running HPC applications on external systems and public cloud. 

The HPCSA will be exposed to a wide variety of exciting and leading edge technologies and systems, with many opportunities to learn new skills. The HPCSA may be given a task with limited input as to the specific technology that should be used, and should be comfortable researching new tools and working without direct supervision.

Key Responsibilities

The HPCSA will take a lead role on day to day running of CFMS HPC systems, including: 

●      Deploying new applications and tools to the clusters

●      Planning and implementing OS and software upgrades to the HPC clusters

●      Developing and implementing workflows and solutions for CFMS use

●      Work with users one-on-one to solve their access and application problems

●      Monitor and modify system performance

The HPCSA will also take a lead role in developing CFMS cloud HPC strategy:

●      Developing and supporting deployment of CFMS applications and workloads onto external HPC systems

●      Developing deployment strategies for deploying traditional and modern HPC workloads to public cloud environments.

Key Relationships

The HPCSA will work closely with the Head of ECS and the ECS team, but also with the wider CFMS technical team and directly with client end users.

Profile

The successful candidate will be a confident and motivated individual that is passionate about all things HPC and Linux, keen to learn new technologies and be able to take a project from an abstract concept through to completion with minimal input. 

Experience

●      A-levels, with minimum grade B (Science and/or maths essential)

●      A degree in a technical subject is desirable/beneficial

●      Minimum of 3 years HPC Systems Administrator experience

Skills, Knowledge and Competencies

Essential:

●      Minimum of three years RedHat or CentOS Linux system administration in an HPC environment

●      Demonstrate administration management and installation of Linux based HPC systems and technologies

o  xCAT

o  IBM GPFS/Spectrum Scale

o  SLURM

o  Infiniband

●      Knowledge of HPC software packaging and deployment tools, building software from source

o  Git

o  GCC/Intel Cluster Studio

o  Variety of build/install tools (e.g. CMake, Makefile, autoconf).

●      Ansible

 Desirable:

●      Lustre/BeeGFS

●      DevOps methodologies (including CI/CD)

●      Docker/Podman/Kubernetes

●      Automation/Orchestration of private/public cloud (Packer/Terraform)

●      IP/Ethernet networking

 

To apply for this position send a curriculum vitae and covering letter to careers@cfms.org.uk

Direct applications only - no agencies

For the full job description, please download the file below.

 

 

 

Sign Up To Our Newsletter and Events