Browse
Employers / Recruiters

Site Reliability Engineering Leader

dlocal · 30+ days ago
Negotiable
Full-time
Continue
By pressing the button above, you agree to our Terms and Privacy Policy, and agree to receive email job alerts. You can unsubscribe anytime.
Why should you join dLocal?
dLocal enables the biggest companies in the world to collect payments in 40 countries in emerging markets. Global brands rely on us to increase conversion rates and simplify payment expansion effortlessly. As both a payments processor and a merchant of record where we operate, we make it possible for our merchants to make inroads into the world’s fastest-growing, emerging markets. 

By joining us you will be a part of an amazing global team that makes it all happen, in a flexible, remote-first dynamic culture with travel, health and learning benefits, among others. Being a part of dLocal means working with 1000+ teammates from 30+ different nationalities and developing an international career that impacts millions of people’s daily lives. We are builders, we never run from a challenge, we are customer-centric, and if this sounds like you, we know you will thrive in our team.

What's the opportunity?

We are looking for a Site Reliability Engineering (SRE) Leader to join our team!  In this role, you will lead a team responsible for designing, implementing, and maintaining our centralized observability platform based on OpenTelemetry (OTEL) as its backend.  You will be part of a talented team that works on mission-critical applications with big customers like Netflix, Amazon, Nike, Facebook & more!

As an SRE Leader, you will work closely with engineering, security, and incident management teams to embed reliability principles across the organization, empowering teams to leverage observability through automation rather than relying on ticket-based support.

What will you do?

  • Be a servant leader: Above all else, you're expected to lead by example, and make sure that your team of 4~6 people has the necessary tools and capabilities to perform well in their challenges.
  • Prioritize accordingly: Manage priorities, rites and performance of the team to the internal and external stakeholders.
  • Own OpenTelemetry Pipelines: Design, implement, and maintain observability pipelines across the three main signals—logs, metrics, and traces—ensuring standardized, scalable, and efficient data ingestion. Optimize ingestion strategies to balance cost, performance, and usability.
  • Empower Engineering Teams: Build self-service automation and tooling that enables development teams to instrument and leverage observability without requiring manual intervention from the SRE team. Drive adoption of best practices while ensuring teams own their telemetry.
  • Support Incident Management: Be the Engineering side of our Incident Management Team, designing the processes, playbooks, checklists, and automations for them and other engineers to follow during an incident.
  • Collaborate Across Teams: Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements. Influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development.
  • Automate Observability Infrastructure: Leverage Infrastructure-as-Code (IaC) to provision and manage monitoring tools, alerting rules, and our observability configurations across OTEL Pipelines.
  • Define Baseline Observability Standards: Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level.
  • Own Technical and Security Health: Take full ownership of dLocal’s infrastructure reliability, ensuring adherence to key availability and security KPIs.
  • Optimize Alerting Systems: Continuously refine alerting signals to minimize noise and ensure them are always actionable, reducing fatigue and improving response efficiency.

Which skill do you need?

  • 6+ years of experience in SRE, Platform Engineering, or a similar role with a focus on observability and reliability.
  • 2+ years of experience in leadership, including mentoring engineers and defining technical strategies.
  • Expertise in Kubernetes, including its core components, deployment methodologies, and monitoring best practices.
  • Some understanding of OpenTelemetry, including setting up OTEL collectors, instrumentation, and pipeline optimization.
  • Proficiency with monitoring and logging tools such as Grafana, Prometheus, Loki, New Relic, or Datadog.
  • Hands-on experience with IaC tools (Terraform) and GitOps CI/CD solutions (ArgoCD, GitHub Actions, or similar).
  • Experience integrating incident management platforms (PagerDuty, Jira) with automated alerting workflows.
  • Strong scripting abilities (Python, Go, or similar) for automating observability tasks.
  • A problem-solving mindset, with the ability to collaborate across multi-functional teams to drive reliability improvements.

  • You’ll Stand Out If You Have:
  • Cloud experience, especially AWS and ECS-based workloads.
  • Experience managing observability pipelines at scale in high-throughput environments.
  • Familiarity with Configuration-as-Code (Ansible, Chef, or SaltStack)  for managing configurations across legacy instances
  • Database performance monitoring experience, particularly in large-scale distributed environments.
  • A track record of building self-service observability platforms, reducing reliance on SRE as a ticket-driven team.
What do we offer?
Besides the tailored benefits we have for each country, dLocal will help you thrive and go that extra mile by offering you:
- Remote work: work from anywhere or one of our offices around the globe!*
- Flexibility: we have flexible schedules and we are driven by performance.
- Fintech industry: work in a dynamic and ever-evolving environment, with plenty to build and boost your creativity.
Referral bonus program: our internal talents are the best recruiters - refer someone ideal for a role and get rewarded.
Learning & development: get access to a Premium Coursera subscription.
Language classes: we provide free English, Spanish, or Portuguese classes.
Social budget: you'll get a monthly budget to chill out with your team (in person or remotely) and deepen your connections!
dLocal Houses: want to rent a house to spend one week anywhere in the world coworking with your team? We’ve got your back!

*For people based in Montevideo (Uruguay) applying to non-IT roles, 55% monthly attendance to the office is required.

What happens after you apply?
Our Talent Acquisition team is invested in creating the best candidate experience possible, so don’t worry, you will definitely hear from us. We will review your CV and keep you posted by email at every step of the process!

Also, you can check out our webpageLinkedinInstagram, and Youtube for more about dLocal!

Last updated on Dec 28, 2024

See more

About the company

More jobs at dlocal

Analyzing

Bangkok, Bangkok

 · 

30+ days ago

Madrid, Community of Madrid

 · 

30+ days ago

Manila, Metro Manila

 · 

30+ days ago

Valletta

 · 

30+ days ago

Developed by Blake and Linh in the US and Vietnam.
We're interested in hearing what you like and don't like! Live chat with our founder or join our Discord
Changelog
🚀 LaunchpadNov 27
Create a site and sell services based on your resume.
🔥 Job search dashboardNov 13
Revamped job search UI with a sortable grid, live filtering, bookmarks, and application tracking.
🫡 Cover letter instructionsSep 27
New Studio settings give you control over AI output.
✨ Cover Letter StudioAug 9
Automatically generate cover letters for any job.
🎯 Suggested filtersAug 6
Copilot suggests additional filters above the results.
⚡️ Quick applicationsAug 2
Apply to jobs using info from your resume. Initial coverage of ~200k jobs in Spain, Germany, Austria, Switzerland, France, and the Netherlands.
🧠 Job AnalysisJul 12
Have Copilot read job descriptions and extract out key info you want to know. Click "Analyze All" to try it out. Click on the Copilot's gear icon to customize the prompt.
© 2024 RemoteAmbitionAffiliate · Privacy · Terms · Sitemap · Status