Browse
Employers / Recruiters

Site Reliability Engineer (SRE) - LLM and Machine Learning

techruiter · 30+ days ago
Negotiable
Full-time
Apply
We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and performance of our LLM and Machine Learning infrastructure.

As an SRE, you will play a critical role in maintaining the stability and efficiency of our LLM and Machine Learning platforms. You will work closely with cross-functional teams to design, implement, and optimize infrastructure, monitor system health, and respond to incidents, enabling our researchers and engineers to focus on innovation.

Responsibilities

  • Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability.
  • Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services.
  • Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance.
  • Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence.
  • Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency.
  • Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems.
  • Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimisation.
  • Documentation: Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.

Requirements

  • Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
  • Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure.
  • Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines.
  • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).Scripting and automation skills (e.g., Python, Bash).Excellent problem-solving and troubleshooting skills.
  • Strong communication and collaboration skills.

Last updated on Dec 19, 2023

See more

About the company

More jobs at techruiter

Analyzing

London, England

 · 

30+ days ago

London, England

 · 

30+ days ago

London, England

 · 

30+ days ago

 · 

30+ days ago

Edinburgh, Scotland

 · 

30+ days ago

Developed by Blake and Linh in the US and Vietnam.
We're interested in hearing what you like and don't like! Live chat with our founder or join our Discord
Changelog
🚀 LaunchpadNov 27
Create a site and sell services based on your resume.
🔥 Job search dashboardNov 13
Revamped job search UI with a sortable grid, live filtering, bookmarks, and application tracking.
🫡 Cover letter instructionsSep 27
New Studio settings give you control over AI output.
✨ Cover Letter StudioAug 9
Automatically generate cover letters for any job.
🎯 Suggested filtersAug 6
Copilot suggests additional filters above the results.
⚡️ Quick applicationsAug 2
Apply to jobs using info from your resume. Initial coverage of ~200k jobs in Spain, Germany, Austria, Switzerland, France, and the Netherlands.
🧠 Job AnalysisJul 12
Have Copilot read job descriptions and extract out key info you want to know. Click "Analyze All" to try it out. Click on the Copilot's gear icon to customize the prompt.
© 2024 RemoteAmbitionAffiliate · Privacy · Terms · Sitemap · Status