Lead Site Reliability Engineer

Lead Site Reliability Engineer

Toronto, Ontario, Canada  - Permanent

Job Description

We are looking for a Lead SRE to take ownership of all things infrastructure and deliver a highly scalable, performant, and available platform that our portfolio of applications can rely on.

- Own SLOs/SLIs across all services and applications to provide metrics to the development teams and facilitate continuous improvement
- Work together with the engineering team to improve CI/CD pipeline with a focus on successful deployment of services and applications
- Drive improvements in the infrastructure, ensure that all the infrastructure can be consistently reproduced with Terraform
- Maintain Incident Playbooks and ensure that a consistent process is followed to guarantee a rapid response
- Enforce regular Infrastructure Security Audits, drive automation where appropriate
- Continuously improve user experience as it relates to deployment and delivery
- Optimize Production and lower environments and infrastructure through monitoring and automation
- Drive platform management and capacity planning discussions
- Assist with setup and deployment of new services as needed
- Relentlessly eliminate false positive alerts
- Perform application load testing/scalability
- Participate in an on-call rotation to provide rapid response to critical issues in production

Must Have Skills:

- 2+ years in a lead role
- 4-7 years in a SRE or related role
- Intellectual curiosity and a strong desire to learn
- Problem solving skills, including the ability to disaggregate complex problems and incrementally implement solutions
- Great communication skills to lead post-incident reviews, writing client-facing communication
- A passion to efficiently support always-available applications
- Able to multitask, prioritize, and manage time efficiently
- Write and review application code: Python/TypeScript/JavaScript
- Experience with Django web framework
- Experience with configuration management and infrastructure deployment using Terraform
- Experience with monitoring and visualization tools like Prometheus and Grafana
- Experience with deployment, logging, monitoring, securing services on GCP, AWS cloud providers
- Experience with containerization and deployment automation tools: Docker, Kubernetes
- Experience writing, maintaining, optimizing CI/CD pipelines
- Experience with databases
- Experience with Linux
- Experience using Git

Nice to Have Skills:

- Experience setting up an Application Platform Monitoring tool (New Relic, Datadog, Splunk, Dynatrace, etc.)
- Write and review application code: Elixir


Starting: ASAP
Dress Code: Casual

Similar jobs in Toronto:

Similar jobs in other locations: