Browse
Employers / Recruiters

Part Time AI Software & Hardware Optimization Engineer (Remote/Flexible)

livit · 30+ days ago
Remote
Negotiable
Full-time
Remote
Continue
By pressing the button above, you agree to our Terms and Privacy Policy, and agree to receive email job alerts. You can unsubscribe anytime.
<h3 dir="ltr"><strong>About Us:</strong></h3><p dir="ltr">We have a cutting-edge AI chatbot powered by Llama 3.2 in production. As we scale, reducing infrastructure GPU costs is a critical priority. Our current setup runs on Nvidia A100 80GB via Runpod, but we’re actively exploring new hardware solutions, including <strong>Nvidia Decimals, Apple M4 with unified memory, Tenstorrent, and other emerging AI chips</strong>.</p><p dir="ltr">We’re looking for an <strong>AI Software &amp; Hardware Optimization Engineer</strong> who can <strong>analyze, adapt, and optimize our existing CUDA-based AI models</strong> to run efficiently across different hardware architectures. This is a unique opportunity to work at the intersection of AI software, performance engineering, and next-generation AI hardware.</p><p dir="ltr"></p><h3 dir="ltr"><strong>Key Responsibilities:</strong></h3><p dir="ltr">✅ <strong>Optimize AI Model Performance Across Different Hardware</strong></p><ul><li dir="ltr"><p dir="ltr">Adapt and optimize <strong>CUDA-dependent AI models</strong> for alternative architectures such as Apple M4 (Metal), Tenstorrent, and other non-Nvidia accelerators.</p></li><li dir="ltr"><p dir="ltr">Implement <strong>low-level performance optimizations</strong> for AI inference across different memory architectures (GDDR6, unified memory, LPDDR5X, etc.).</p></li><li dir="ltr"><p dir="ltr">Convert and optimize models for various inference runtimes (e.g., <strong>TensorRT, ONNX, Metal Performance Shaders, Triton Inference Server, vLLM</strong>).</p></li></ul><p dir="ltr">✅ <strong>AI Hardware Benchmarking &amp; Cost Reduction</strong></p><ul><li dir="ltr"><p dir="ltr">Conduct rigorous benchmarking of AI workloads on <strong>Nvidia CUDA GPUs, Apple Silicon, AMD ROCm, and specialized AI chips</strong>.</p></li><li dir="ltr"><p dir="ltr">Compare <strong>memory bandwidth, latency, power efficiency, and inference throughput</strong> across different architectures.</p></li><li dir="ltr"><p dir="ltr">Identify cost-effective alternatives to <strong>high-cost cloud GPUs</strong> without sacrificing performance.</p></li></ul><p dir="ltr">✅ <strong>Model Optimization for Efficient Deployment</strong></p><ul><li dir="ltr"><p dir="ltr">Implement <strong>quantization (INT8, FP16, BF16)</strong> and <strong>model distillation</strong> to enhance efficiency.</p></li><li dir="ltr"><p dir="ltr">Develop <strong>custom AI kernels</strong> optimized for different hardware types.</p></li><li dir="ltr"><p dir="ltr">Improve multi-threading, batching, and caching strategies to reduce inference latency.</p></li></ul><p dir="ltr">✅ <strong>Infrastructure &amp; Deployment</strong></p><ul><li dir="ltr"><p dir="ltr">Deploy AI models efficiently using <strong>Docker, Kubernetes, and serverless AI inference</strong> platforms.</p></li><li dir="ltr"><p dir="ltr">Implement <strong>compilation pipelines</strong> (TVM, XLA, MLIR) to target diverse hardware backends.</p></li><li dir="ltr"><p dir="ltr">Work closely with DevOps to integrate inference optimization techniques into production workflows.</p></li></ul><p></p><h3 dir="ltr"><strong>Required Skills &amp; Experience:</strong></h3><p dir="ltr">🔹 <strong>Deep AI Model Optimization Experience</strong></p><ul><li dir="ltr"><p dir="ltr">Strong expertise in <strong>PyTorch, TensorFlow, and JAX</strong> with deep understanding of model transformation for different backends.</p></li><li dir="ltr"><p dir="ltr">Experience optimizing AI models with <strong>CUDA, Metal, ROCm, and other accelerator-specific libraries</strong>.</p></li></ul><p dir="ltr">🔹 <strong>Hardware &amp; System-Level Knowledge</strong></p><ul><li dir="ltr"><p dir="ltr"><strong>Expert understanding of GPU architectures, unified memory models, tensor cores, and AI-specific accelerators</strong>.</p></li><li dir="ltr"><p dir="ltr">Experience working with <strong>alternative AI hardware</strong>, such as <strong>Apple Silicon, Tenstorrent, Graphcore, or Groq</strong>.</p></li><li dir="ltr"><p dir="ltr">Deep knowledge of <strong>memory architectures (GDDR6, LPDDR5X, HBM, Unified Memory) and their impact on AI workloads</strong>.</p></li></ul><p dir="ltr">🔹 <strong>Inference Optimization &amp; Acceleration</strong></p><ul><li dir="ltr"><p dir="ltr">Hands-on experience with <strong>TensorRT, ONNX Runtime, Triton Inference Server, vLLM, and Hugging Face Optimum</strong>.</p></li><li dir="ltr"><p dir="ltr">Knowledge of <strong>low-level parallelism (SIMD, VLIW, MIMD) and AI chip architectures</strong>.</p></li></ul><p dir="ltr">🔹 <strong>Benchmarking &amp; Profiling</strong></p><ul><li dir="ltr"><p dir="ltr">Experience with <strong>AI performance profiling tools (Nsight, ROCm SMI, Metal Profiler, perf)</strong>.</p></li><li dir="ltr"><p dir="ltr">Ability to analyze <strong>power efficiency, latency, memory bandwidth, and FLOPS utilization</strong> across different chips.</p></li></ul><p></p><h3 dir="ltr"><strong>Nice-to-Have Skills:</strong></h3><ul><li dir="ltr"><p dir="ltr">Experience with <strong>LLM-specific optimizations</strong>, such as <strong>speculative decoding, paged attention, and tensor parallelism</strong>.</p></li><li dir="ltr"><p dir="ltr">Knowledge of <strong>compiler optimization techniques (MLIR, XLA, TVM, Glow)</strong> for AI workloads.</p></li><li dir="ltr"><p dir="ltr">Familiarity with <strong>emerging AI accelerators</strong> beyond mainstream options.</p></li></ul><p><br></p><h3 dir="ltr"><strong>Why Join Us?</strong></h3><p dir="ltr">🚀 Work on cutting-edge AI infrastructure &amp; next-gen hardware.<br>🌍 Fully remote, flexible work environment.<br>💰 Competitive salary with potential bonuses for cost reductions.<br>🎯 Opportunity to shape the future of AI model deployment.</p><p dir="ltr">Join an innovative team in the AI industry! Our client is seeking an AI Software &amp; Hardware Optimization Engineer to contribute to their dynamic organization. Further information will be disclosed as you advance in the recruitment process. If you’re passionate about <strong>optimizing AI workloads across diverse hardware ecosystems</strong> and want to push the limits of AI performance, we’d love to hear from you!</p><p><br></p> •

Last updated on Feb 4, 2025

See more
Developed by Blake and Linh in the US and Vietnam.
We're interested in hearing what you like and don't like! Live chat with our founder or join our Discord
Changelog
🚀 LaunchpadNov 27
Create a site and sell services based on your resume.
🔥 Job search dashboardNov 13
Revamped job search UI with a sortable grid, live filtering, bookmarks, and application tracking.
🫡 Cover letter instructionsSep 27
New Studio settings give you control over AI output.
✨ Cover Letter StudioAug 9
Automatically generate cover letters for any job.
🎯 Suggested filtersAug 6
Copilot suggests additional filters above the results.
⚡️ Quick applicationsAug 2
Apply to jobs using info from your resume. Initial coverage of ~200k jobs in Spain, Germany, Austria, Switzerland, France, and the Netherlands.
🧠 Job AnalysisJul 12
Have Copilot read job descriptions and extract out key info you want to know. Click "Analyze All" to try it out. Click on the Copilot's gear icon to customize the prompt.
© 2024 RemoteAmbitionAffiliate · Privacy · Terms · Sitemap · Status