Browse
Employers / Recruiters

Site Reliability Engineer (US)

walt-labs · 30+ days ago
Negotiable
Full-time
Continue
By pressing the button above, you agree to our Terms and Privacy Policy, and agree to receive email job alerts. You can unsubscribe anytime.

At WALT Labs, we are committed to empowering businesses to leverage the transformative power of cloud technology, facilitating innovation and operational efficiency. Specializing in managed services across Google Cloud Platform (GCP) and Amazon Web Services (AWS), we seek a dedicated local Site Reliability Engineer (SRE) who is passionate about technology, excels in problem-solving, and is dedicated to providing unparalleled customer service. You will become the SME to the scale, resiliency and uptime of our own and the customer environments we support.

Role Summary

As a critical member of our team, the SRE will provide technical support and expertise to our managed services clients. This role involves diagnosing and resolving complex issues across diverse cloud environments and technologies, ensuring high performance and reliability. The ideal candidate is a tech enthusiast, eager to expand their knowledge and skills daily, committed to problem-solving and delivering customer-focused solutions within defined Service Level Agreement (SLA) guidelines.

Key Responsibilities:

  • Ensure high availability and reliability of software systems and infrastructure. Building out SLOs & SLAs and constantly improving reliability of systems.
  • Design, implement, and maintain monitoring and alerting systems to detect and address issues proactively, using mainly Datadog, GCP Cloud Monitoring and Pagerduty/Incident.io.
  • Debug and troubleshoot production issues across various customer environments, technology stacks, and cloud providers, primarily focusing on GCP and AWS.
  • Participate in an on-call rotation to respond to and resolve production incidents and conduct RCAs/Post Mortems to identify and address issues.
  • Develop and maintain runbooks and playbooks for incident response and troubleshooting.
  • Proactively optimize systems and application environments to identify bottlenecks and areas of improvements.
  • Conduct load testing and capacity planning to ensure systems can handle expected traffic and growth.
  • Develop and maintain IaC (Terraform) and Configuration Management (Ansible, Helm as examples)
  • Work closely with development teams to understand system architecture, identify potential reliability risks, and implement solutions.
  • Collaborate with operations teams to ensure smooth deployment and operation of software systems.
  • Master a broad range of technologies, including but not limited to VMs, container orchestration, networking, security, databases, data warehouses, serverless technologies, and storage solutions.
  • Proficiently deploy applications into Kubernetes using Helm, and manage Kubernetes administration and troubleshooting.
  • Provide direct support to clients during production outages, offering expert assistance to swiftly rectify issues, adhering to SLA expectations.
  • Diligently document solutions and processes, constantly seeking to improve knowledge, skills, and operational efficiency.

Requirements

  • Prefer candidate to be located in the Houston, TX area. We are accepting fully remote candidates within the United States.
  • 3+ years experience in an SRE role
  • From your core you understand how important SLOs, SLIs and KPIs are to the systems you support, using observability to be your grounding point on a daily basis.
  • Extensive knowledge of all major services in GCP (Cloud Run, BigQuery, GKE etc)
  • In-depth knowledge of all major services in AWS
  • Experience in setting up and managing monitoring solutions like Datadog, Google Cloud Operations Suite, Cloudwatch, Nagios, and Zabbix.
  • Familiarity with various CI/CD systems (Jenkins, Codefresh, GitLab CI, GitHub Actions, Argo CD).
  • Exceptional problem-solving capabilities, the ability to work under pressure, and strong critical thinking skills.
  • Be the voice and commander of incidents managed internally and externally to customers
  • A passion for technology and an unquenchable thirst for learning new skills.
  • A customer-focused mindset, dedicated to delivering the highest level of service.

Benefits

  • We cover 100% of your base medical plan!
  • Dental, vision, disability, and life insurance available
  • Generous PTO policy that increases with longevity
  • 401k
  • Professional development and advancement opportunities
  • Bonus incentives

Last updated on Jun 21, 2024

See more

About the company

More jobs at walt-labs

Analyzing

Spring, Texas

 · 

30+ days ago

Letchworth Garden City, England

 · 

30+ days ago

Letchworth Garden City, England

 · 

30+ days ago

Spring, Texas

 · 

30+ days ago

Letchworth Garden City, England

 · 

30+ days ago

More jobs like this

Analyzing

Dallas, Texas

 · 

30+ days ago

Senior Cloud Database Architect
B
b6jdnwcpcemgg8el3r9winlpunj8hc038b1vkhowrzxn9gitznreodi38t7rirkp

Atlanta, Georgia

 · 

30+ days ago

Consulting Software Engineer (715669)
R
rsjdnwc9jel4i3xyjsm3m8vnhrmayk037bphn44zg3i1bl3dcjtqhqlclsisinpr

Cambridge, Massachusetts

 · 

30+ days ago

Database Architect
TT
The Talently ·  AI recruitment platform

San Jose, California

 · 

30+ days ago

Front End Developer
G
Grapevine ·  Influencer marketing platform for YouTube

Boston, Massachusetts

 · 

30+ days ago

Salesforce Developer
B
b8jdnwfetm91aeh4xxktytk2xff310011dbi7c94iwf3w4g8qka7cjkc4daepyd7

 · 

30+ days ago

Senior Software Engineer, Infrastructure Security$202-316k
Asana ·  Collaboration software for teams

San Francisco, California

 · 

30+ days ago

San Francisco, California

 · 

30+ days ago

JDA Developer
C
crjdnwsnowo2i4nz45b1teboszrxlg0351vr73gpqw7yanury9u287prckhdnkww

Alpharetta, Georgia

 · 

30+ days ago

Developed by Blake and Linh in the US and Vietnam.
We're interested in hearing what you like and don't like! Live chat with our founder or join our Discord
Changelog
🚀 LaunchpadNov 27
Create a site and sell services based on your CV.
🔥 Job search dashboardNov 13
Revamped job search UI with a sortable grid, live filtering, bookmarks, and application tracking.
🫡 Cover letter instructionsSep 27
New Studio settings give you control over AI output.
✨ Cover Letter StudioAug 9
Automatically generate cover letters for any job.
🎯 Suggested filtersAug 6
Copilot suggests additional filters above the results.
⚡️ Quick applicationsAug 2
Apply to jobs using info from your CV. Initial coverage of ~200k jobs in Spain, Germany, Austria, Switzerland, France, and the Netherlands.
🧠 Job AnalysisJul 12
Have Copilot read job descriptions and extract out key info you want to know. Click "Analyze All" to try it out. Click on the Copilot's gear icon to customize the prompt.
© 2024 RemoteAmbitionAffiliate · Privacy · Terms · Sitemap · Status