The Company We are currently recruiting for an innovative company at the forefront of the autonomous driving industry, focusing on creating cutting-edge solutions to revolutionize the future of transportation. The team is dedicated to building advanced technologies that enable smarter, safer, and more efficient mobility experiences. As the organization continues to expand, we are looking for talented engineers who share their passion for transforming the way the world moves. The Role
We are hiring for an engineer who will be focused on solving challenging technical problems within the autonomous driving space. In this role, you’ll be responsible for building and optimizing a high-performance model inference platform for large-scale production of electric vehicles. This is an exciting opportunity to work with groundbreaking technologies in the autonomous driving field, collaborating with top experts to shape the future of mobility.
Key Responsibilities:- Design, implement, and manage key components of their model inference platform, including quota management, job scheduling, and queuing systems.
- Optimize the allocation and performance of GPU resources to ensure efficiency at scale.
- Troubleshoot, monitor, and maintain system health to ensure continuous operation.
- Collaborate closely with Machine Learning Engineers to refine and evolve the platform for a variety of applications.
- Identify and address performance bottlenecks to ensure optimal system performance.
- Develop and maintain comprehensive documentation for system components and infrastructure.
- Advanced degree preferably a PHD in Computer Science, Engineering, or a related field.
- 5+ years of experience in machine learning infrastructure or model inference.
- Proficiency in programming languages such as Python, Java, or C++.
- Strong understanding of distributed computing frameworks.
- Expertise in designing fault-tolerant and high-throughput systems.
- Experience with containerization tools such as Docker and Kubernetes.
- Familiarity with CI/CD tools like Jenkins and GitHub.
- Experience with monitoring systems like Prometheus and Grafana.
- Excellent problem-solving abilities and attention to detail.
- Strong communication and collaboration skills.
- Proven experience in building and maintaining large-scale distributed systems.
- Expertise in performance optimization and system scaling.
- Experience with job scheduling on diverse computational resources.
- Deep understanding of cloud platforms.
- Familiarity with observability practices and monitoring.
- Experience with CUDA and machine learning frameworks such as PyTorch or TensorFlow.