Hippocratic AI is hiring a

Lead Systems Engineer - GPU Management (AI/HPC)

Job Overview

  • Posted 4 weeks ago
  • Full Time
  • Palo Alto, CA, USA
  • 75000

Roles & Responsibilities

GPU Cluster Management:

Run high-performance compute services in public cloud environments (AWS, GCP, and Azure) like Sagemaker.
Knowledge of hardware components, such as GPUs (including high-end models like NVIDIA A100 and H100), and familiarity with NVIDIA Container Toolkit.
Experience in managing GPU nodes in cloud environments, ensuring optimal performance and reliability.

Orchestration and Automation:

Proficiency in Kubernetes for container orchestration and Slurm for workload management to efficiently distribute tasks across the GPU cluster.
Experience in setting up and configuring these orchestration tools to ensure high availability and scalability of cluster resources.

Troubleshooting and Debugging:

Ability to provide in-depth technical support for complex issues, including debugging and troubleshooting high-end GPUs.
Familiarity with debugging tools and techniques specific to GPU hardware and software.

Performance Optimization:

Continuous monitoring of system performance to identify bottlenecks and implement solutions to optimize resource utilization and throughput.
Knowledge of performance tuning techniques for GPU clusters and the ability to apply them effectively.

Security and Compliance:

Ensure adherence to security best practices and compliance requirements for GPU cluster infrastructure.
Implementation and management of security protocols and disaster recovery strategies to safeguard cluster resources and data.

Collaboration and Support:

Work closely with other engineering, research and applied science teams to understand and support their computational needs.
Offer guidance and expertise on utilizing the GPU cluster efficiently for various tasks and applications.
Participate in planning and executing future expansion or enhancement of cluster capabilities to meet evolving computational requirements.

Requirements:

Education:

Bachelor’s degree in Computer Science, Electrical Engineering, or a related field. Master’s degree preferred.

Experience:

At least 3 years of experience in managing and maintaining GPU clusters, preferably in the cloud, with hands-on experience with NVIDIA A100 and H100 GPUs or similar high-end models.

Technical Skills:

Proficiency in Kubernetes for container orchestration and management, with experience in deploying, scaling, and managing containerized applications within Kubernetes clusters, including familiarity with AWS Kubernetes services for cloud deployment and management.
Experience with Slurm for workload management in GPU cluster environments.
Deep understanding of GPU hardware, including experience with debugging and troubleshooting GPU issues.
Strong background in Linux/Unix administration, scripting (e.g., Bash, Python), and automation tools, with expertise in Ansible for configuration management and automation tasks.
Familiarity with network configuration, storage systems, and security protocols relevant to GPU clusters.

Problem-Solving:

Exceptional analytical and problem-solving skills, with the ability to handle complex technical challenges effectively.

Excellent communication and documentation skills, capable of collaborating effectively across diverse teams.
Β 

About Hippocratic AI

Hippocratic AI is dedicated to developing a safety-focused large language model (LLM) tailored for the healthcare sector. We firmly believe in the potential of generative AI to significantly enhance global healthcare accessibility, provided it is developed and tested responsibly. Mirroring the principles of the Hippocratic oath that guides medical professionals, our model is designed with the ethos of “Do no Harm.”

Skills Required

  • Machine Learning
  • Python

Find more jobs at Hippocratic AI

The First Safety Focused LLM for Healthcare

There are no results matching your search.

Reset
Ankore Β© 2024 All rights reserved