Scientific Computing Competence Center – Infrastructure team supporting scientific research
HPC Systems Administrator
POSITION DUTIES & RESPONSIBILITIES
· Cluster and Systems Administration: Manage and administer production systems used by researchers.
· Maintenance of software environments: effective installation and configuration of open-source and commercial scientific applications.
· Deploy and maintain hardware and/or cloud solutions for research scientific computing. This includes CPU and GPU-based grid compute, high speed networking and GPFS data storage.
· Leverage industry standard system monitoring and reporting tools to ensure the maintainability, scalability and availability of the infrastructure environment.
· Provide assistance to researchers in running applications (support/installation/configuration)
· Troubleshoot scheduler submission problems, manage space usage, and assist with user access and Linux command line help. Applications include R, Python, etc.
· Analyze and resolve customer and technical problems: Tuning cluster scheduling parameters, memory / CPU contention, scientific application compilation and run-time issues.
· Analyzes result of server monitoring and implement changes to improve performance, processing and utilization. Proposes, maintains and enforces polices, practices and security procedures.
· Provide break/fix support, setup/installation support, escalation support, and solutions support.
· Develop and maintain system documentation as well as user-facing knowledge base articles and how-to guides.
· Responsible for the inventory and tracking of HPC computer related equipment.
· Perform other duties as required by the situation and circumstances.
· BS Degree in computer Science or MIS
· Or 5+ years of relevant work experience
Skill & Competency Requirements:
· Minimum of 5 years’ experience in managing/administering Linux server environments for scientific computing
· Strong technical skills including knowledge of cluster management software and administrative processes. Ability to deeply understand application functionality and technical documentation required.
· Strong verbal and written communication and interpersonal skills
· 3+ years of demonstrated experience using computational clusters, HPC and/or grid computing environments.
· Research experience, RHEL certifications a plus.
Preferences: (list nice-to-haves such a specific knowledge items, certifications, skills, abilities, etc.)
· Must be capable of contributing within a team, exhibit a high level of initiative, and have an eagerness to learn new technologies.
· Demonstrated ability in providing systems administration, HPC cluster and scheduler troubleshooting to a community with diverse computing needs.
· Knowledge of PuppetLabs Puppet management software and cloud computing platforms is desirable but not required
· Candidate must possess advanced knowledge and understanding of Linux server configurations including networking (RHEL, CentOS, or equivalent), systems scripting (Unix shells, Python, Bash)
· Knowledge and understanding of security and monitoring software packages including but not limited to O/S, network, application (nagios, Argus, Ganglia), intrusion detection.
· Understanding of common network protocols like DHCP, DNS, SMTP, HTTP
· Ability to multitask and prioritize work requirements, keeping team and management
· Excellent interpersonal skills to effectively communicate with cross functional teams
· including staff at all levels of the organization including both technical and non-technical
· Cloud Experience