Scientific Computing Competence Center – Infrastructure team supporting scientific research


HPC Systems Administrator


· Cluster and Systems Administration: Manage and administer production systems used by researchers.

· Maintenance of software environments: effective installation and configuration of open-source and commercial scientific applications.

· Deploy and maintain hardware and/or cloud solutions for research scientific computing. This includes CPU and GPU-based grid compute, high speed networking and GPFS data storage.

· Leverage industry standard system monitoring and reporting tools to ensure the maintainability, scalability and availability of the infrastructure environment.

· Provide assistance to researchers in running applications (support/installation/configuration)

· Troubleshoot scheduler submission problems, manage space usage, and assist with user access and Linux command line help. Applications include R, Python, etc.

· Analyze and resolve customer and technical problems: Tuning cluster scheduling parameters, memory / CPU contention, scientific application compilation and run-time issues.

· Analyzes result of server monitoring and implement changes to improve performance, processing and utilization. Proposes, maintains and enforces polices, practices and security procedures.

· Provide break/fix support, setup/installation support, escalation support, and solutions support.

· Develop and maintain system documentation as well as user-facing knowledge base articles and how-to guides.

· Responsible for the inventory and tracking of HPC computer related equipment.

· Perform other duties as required by the situation and circumstances.


Education Requirement(s):

· BS Degree in computer Science or MIS

· Or 5+ years of relevant work experience

Skill & Competency Requirements:

· Minimum of 5 years’ experience in managing/administering Linux server environments for scientific computing

· Strong technical skills including knowledge of cluster management software and administrative processes. Ability to deeply understand application functionality and technical documentation required.

· Strong verbal and written communication and interpersonal skills

· 3+ years of demonstrated experience using computational clusters, HPC and/or grid computing environments.

· Research experience, RHEL certifications a plus.

Preferences: (list nice-to-haves such a specific knowledge items, certifications, skills, abilities, etc.)

· Must be capable of contributing within a team, exhibit a high level of initiative, and have an eagerness to learn new technologies.

· Demonstrated ability in providing systems administration, HPC cluster and scheduler troubleshooting to a community with diverse computing needs.

· Knowledge of PuppetLabs Puppet management software and cloud computing platforms is desirable but not required

· Candidate must possess advanced knowledge and understanding of Linux server configurations including networking (RHEL, CentOS, or equivalent), systems scripting (Unix shells, Python, Bash)

· Knowledge and understanding of security and monitoring software packages including but not limited to O/S, network, application (nagios, Argus, Ganglia), intrusion detection.

· Understanding of common network protocols like DHCP, DNS, SMTP, HTTP

· Ability to multitask and prioritize work requirements, keeping team and management

· informed.

· Excellent interpersonal skills to effectively communicate with cross functional teams

· including staff at all levels of the organization including both technical and non-technical

· Cloud Experience