<h3 dir="ltr"><strong>About Us:</strong></h3><p dir="ltr">We have a cutting-edge AI chatbot powered by Llama 3.2 in production. As we scale, reducing infrastructure GPU costs is a critical priority. Our current setup runs on Nvidia A100 80GB via Runpod, but we’re actively exploring new hardware solutions, including <strong>Nvidia Decimals, Apple M4 with unified memory, Tenstorrent, and other emerging AI chips</strong>.</p><p dir="ltr">We’re looking for an <strong>AI Software & Hardware Optimization Engineer</strong> who can <strong>analyze, adapt, and optimize our existing CUDA-based AI models</strong> to run efficiently across different hardware architectures. This is a unique opportunity to work at the intersection of AI software, performance engineering, and next-generation AI hardware.</p><p dir="ltr"></p><h3 dir="ltr"><strong>Key Responsibilities:</strong></h3><p dir="ltr">✅ <strong>Optimize AI Model Performance Across Different Hardware</strong></p><ul><li dir="ltr"><p dir="ltr">Adapt and optimize <strong>CUDA-dependent AI models</strong> for alternative architectures such as Apple M4 (Metal), Tenstorrent, and other non-Nvidia accelerators.</p></li><li dir="ltr"><p dir="ltr">Implement <strong>low-level performance optimizations</strong> for AI inference across different memory architectures (GDDR6, unified memory, LPDDR5X, etc.).</p></li><li dir="ltr"><p dir="ltr">Convert and optimize models for various inference runtimes (e.g., <strong>TensorRT, ONNX, Metal Performance Shaders, Triton Inference Server, vLLM</strong>).</p></li></ul><p dir="ltr">✅ <strong>AI Hardware Benchmarking & Cost Reduction</strong></p><ul><li dir="ltr"><p dir="ltr">Conduct rigorous benchmarking of AI workloads on <strong>Nvidia CUDA GPUs, Apple Silicon, AMD ROCm, and specialized AI chips</strong>.</p></li><li dir="ltr"><p dir="ltr">Compare <strong>memory bandwidth, latency, power efficiency, and inference throughput</strong> across different architectures.</p></li><li dir="ltr"><p dir="ltr">Identify cost-effective alternatives to <strong>high-cost cloud GPUs</strong> without sacrificing performance.</p></li></ul><p dir="ltr">✅ <strong>Model Optimization for Efficient Deployment</strong></p><ul><li dir="ltr"><p dir="ltr">Implement <strong>quantization (INT8, FP16, BF16)</strong> and <strong>model distillation</strong> to enhance efficiency.</p></li><li dir="ltr"><p dir="ltr">Develop <strong>custom AI kernels</strong> optimized for different hardware types.</p></li><li dir="ltr"><p dir="ltr">Improve multi-threading, batching, and caching strategies to reduce inference latency.</p></li></ul><p dir="ltr">✅ <strong>Infrastructure & Deployment</strong></p><ul><li dir="ltr"><p dir="ltr">Deploy AI models efficiently using <strong>Docker, Kubernetes, and serverless AI inference</strong> platforms.</p></li><li dir="ltr"><p dir="ltr">Implement <strong>compilation pipelines</strong> (TVM, XLA, MLIR) to target diverse hardware backends.</p></li><li dir="ltr"><p dir="ltr">Work closely with DevOps to integrate inference optimization techniques into production workflows.</p></li></ul><p></p><h3 dir="ltr"><strong>Required Skills & Experience:</strong></h3><p dir="ltr">🔹 <strong>Deep AI Model Optimization Experience</strong></p><ul><li dir="ltr"><p dir="ltr">Strong expertise in <strong>PyTorch, TensorFlow, and JAX</strong> with deep understanding of model transformation for different backends.</p></li><li dir="ltr"><p dir="ltr">Experience optimizing AI models with <strong>CUDA, Metal, ROCm, and other accelerator-specific libraries</strong>.</p></li></ul><p dir="ltr">🔹 <strong>Hardware & System-Level Knowledge</strong></p><ul><li dir="ltr"><p dir="ltr"><strong>Expert understanding of GPU architectures, unified memory models, tensor cores, and AI-specific accelerators</strong>.</p></li><li dir="ltr"><p dir="ltr">Experience working with <strong>alternative AI hardware</strong>, such as <strong>Apple Silicon, Tenstorrent, Graphcore, or Groq</strong>.</p></li><li dir="ltr"><p dir="ltr">Deep knowledge of <strong>memory architectures (GDDR6, LPDDR5X, HBM, Unified Memory) and their impact on AI workloads</strong>.</p></li></ul><p dir="ltr">🔹 <strong>Inference Optimization & Acceleration</strong></p><ul><li dir="ltr"><p dir="ltr">Hands-on experience with <strong>TensorRT, ONNX Runtime, Triton Inference Server, vLLM, and Hugging Face Optimum</strong>.</p></li><li dir="ltr"><p dir="ltr">Knowledge of <strong>low-level parallelism (SIMD, VLIW, MIMD) and AI chip architectures</strong>.</p></li></ul><p dir="ltr">🔹 <strong>Benchmarking & Profiling</strong></p><ul><li dir="ltr"><p dir="ltr">Experience with <strong>AI performance profiling tools (Nsight, ROCm SMI, Metal Profiler, perf)</strong>.</p></li><li dir="ltr"><p dir="ltr">Ability to analyze <strong>power efficiency, latency, memory bandwidth, and FLOPS utilization</strong> across different chips.</p></li></ul><p></p><h3 dir="ltr"><strong>Nice-to-Have Skills:</strong></h3><ul><li dir="ltr"><p dir="ltr">Experience with <strong>LLM-specific optimizations</strong>, such as <strong>speculative decoding, paged attention, and tensor parallelism</strong>.</p></li><li dir="ltr"><p dir="ltr">Knowledge of <strong>compiler optimization techniques (MLIR, XLA, TVM, Glow)</strong> for AI workloads.</p></li><li dir="ltr"><p dir="ltr">Familiarity with <strong>emerging AI accelerators</strong> beyond mainstream options.</p></li></ul><p><br></p><h3 dir="ltr"><strong>Why Join Us?</strong></h3><p dir="ltr">🚀 Work on cutting-edge AI infrastructure & next-gen hardware.<br>🌍 Fully remote, flexible work environment.<br>💰 Competitive salary with potential bonuses for cost reductions.<br>🎯 Opportunity to shape the future of AI model deployment.</p><p dir="ltr">Join an innovative team in the AI industry! Our client is seeking an AI Software & Hardware Optimization Engineer to contribute to their dynamic organization. Further information will be disclosed as you advance in the recruitment process. If you’re passionate about <strong>optimizing AI workloads across diverse hardware ecosystems</strong> and want to push the limits of AI performance, we’d love to hear from you!</p><p><br></p> •
Last updated on Feb 4, 2025