London Office
·
Hybrid
Machine Learning Engineer
Oriole is seeking talented Machine Learning Engineers to help co-optimize our AI/ML software stack with cutting-edge network hardware. You’ll be a key contributor to a high-impact, agile team focused on integrating middleware communication libraries and modelling the performance of large-scale AI/ML workloads.
Key Responsibilities:
- Design and optimize custom GPU communication kernels to enhance performance and scalability across multi-node environments
- Develop and maintain distributed communication frameworks for large-scale deep learning models, ensuring efficient parallelization and optimal resource utilization.
- Profile, benchmark, and debug GPU applications to identify and resolve bottlenecks in communication and computation pipelines.
- Collaborate closely with hardware and software teams to integrate optimized kernels with Oriole’s next-generation network hardware and software stack.
- Contribute to system-level architecture decisions for large-scale GPU clusters, with a focus on communication efficiency, fault tolerance, and novel architectures for advanced optical network infrastructure.
Required Skills & Experience:
- Proficient in C++ and Python, with a strong track record in high-performance computing or machine learning projects.
- Expertise in GPU programming with CUDA, including deep knowledge of GPU memory hierarchies and kernel optimization.
- Hands-on experience debugging GPU kernels using tools such as Cuda-gdb, Cuda Memcheck, NSight Systems, PTX, and SASS.
- Strong understanding of communication libraries and protocols, including NCCL, NVSHMEM, OpenMPI, UCX, or custom collective communication implementations.
- Familiarity with HPC networking protocols/libraries such as RoCE, Infiniband, Libibverbs, and libfabric.
- Experience with distributed deep learning/MoE frameworks, including PyTorch Distributed, vLLM, or DeepEP.
- Solid understanding of deploying and optimizing large-scale distributed deep learning workloads in production environments, including Linux, Kubernetes, SLURM, OpenMPI, GPU drivers, Docker, and CI/CD automation.
- Locations
- London Office
- Remote status
- Hybrid
About Oriole Networks
Accelerating AI in a Low Carbon World – Oriole Networks is a photonic networking company, developing disruptive technologies for AI/ML and HPC networking that will revolutionise data centres.
Already working at Oriole Networks?
Let’s recruit together and find your next colleague.