Job Requirements
We are seeking a HPC Systems Engineer to maintain G42 state-of-the-art computational and data science infrastructure.
As a member of our HPC Team, you will participate in the deployment, management, and optimization of systems, and processes. You will work with G42 s community to identify and provide solutions and technical support that enable our cloud customers to deploy and develop their AI applications at scale.
Responsibilities:
• Provide tier-2 technical O&M;support and administration of 24*7*365 always available production environment
• Configure, install, maintain and upgrade HPC clusters (compute, storage, and network) and applications in support of research computing environments
• Lead and collaborate on projects to maintain and enhance system functionality in areas such as systems monitoring, scheduling and resource management, configuration management, backups, HPC system management utilities/tools, HPC cluster performance and resiliency
• Diagnose, isolate and resolve complex application and system technical problems (hardware, software, network)
• Develop scripts and automation to enhance operational services and service quality
• Perform system tuning based upon proactive performance analysis
• Build, install, and support scientific software (Commercial and Open Source)
• Develop and maintain technical documentation for customer use and contribute to the internal knowledge base.
Work Experience
• Experience configuring, managing, and optimizing large Linux clusters and servers;
• Experience with management tools (e.g. PBS, SLURM, Moab, TORQUE, etc.);
• Comfortable with configuring, managing, and optimizing distributed and parallel file systems such as Lustre, GPFS, NFS, Ceph and protocols FC, iSCSI, NFS, CIFS, etc.;
• Knowledge of networks, routers, switches, firewalls and familiarity with high-performance networks such as Infiniband;
• Scripting/programming capabilities ( e.g. Python, Bash, Perl);
• Knowledge of virtualization platforms (e.g. VMWare, KVM, oVirt);
• Solid knowledge of RedHat or Debian based distributions and experience with maintaining, upgrading, and tuning the Linux kernel;
• Experience with system configuration management tools such as Puppet, Ansible, Chef, Cobbler;
• Experience with monitoring/alerting tools (e.g. Ganglia, Nagios, Zabbix, Grafana);
• Experience with compiling and building packages tools (e.g. Spack, Conda, EasyBuild);
• Knowledge and experience using containerized workflows based on docker, singularity, Kubernetes;
• Comfortable with configuring, installing and troubleshooting MPI.
Desired skills:
• Experience with Nvidia DGX servers and Nvidia tools;
• Experience with Linux kernel development and the Linux development community;
• Experience with on-prem cloud technologies such as OpenStack;
• Knowledge of one or more programming languages such as C, C++.
https://www.naukrigulf.com/hpc-systems-engineer-jobs-in-abu-dhabi-uae-in-group-42-2-to-5-years-n-cd-10008188-jid-050221500118