Site Reliability Engineer
Site Reliability Engineer
Montreal, Quebec, Canada
This job allows you to work remotely
Our client are one of the first networks devoted to helping B2B enterprise. Their first-of-its-kind Sales Analytics platform combines a proprietary, self-learning network with applications that is ready to use, data backed, and built on predictive analyses.
SRE ensures that our internally critical and our externally-visible systems—have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
SRE is also a mindset and a set of engineering approaches to running better production systems—we build our own creative engineering solutions to operations problems. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation.
Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running.
What your day to day will look like:
•Obsess over graphing of metrics: must know how to read and create meaningful graphs
•Mastery of delivering HTTP applications: App platforms, Load balancers, Proxy servers, Protocols used in web delivery, Security features, API
•Work with engineering teams to support applications, QA and deployments as needed
•Write scripts in various languages, puppet modules, metrics collection, etc: Calling a rest API to manipulate systems, working with snmp
•Complete deep investigations to surface solutions to complex layered systems
•Centos Linux: automated builds, design and maintenance of operating systems
•Configuration management: Puppet
•Monitor systems: Nagios/Icinga/ etc
•Maintain and troubleshoot the virtualization environment
•Own security and implement Server/Web/Compliance best practices
•Troubleshoot Email delivery issues
•Design new or improved systems and work with the team to execute on it
•Maintain services once they are live by measuring and monitoring availability, latency and overall system health
•Strong sense of ownership and passion for engineering great products with stellar user experiences
•On-call rotation as a first line of defense during production issues
Must Have Skills:
•Degree in Computer Science or a related technical field involving coding or equivalent practical experience
•Strong desire to learn more about the applications and systems
•Knowledge of Linux/UNIX fundamentals, and OS tuning
•Solid understanding of databases, how they work, what are the differences between different database technologies. We run a data heavy application.
•Experience coding in higher-level languages (e.g. Ruby, Go, Python, C++, or Java)
•You have used Puppet, Ansible, Chef or another config management suite
•Experience delivering 24/7 applications to the Internet
•Experience in configuration and maintenance of web servers, load balancers, relational databases, storage systems and messaging systems
•Experience working in an environment maintaining compliance (ex: PCI, SOC)
•Clear thinking, action oriented, good communicator, team player
Nice to Have Skills:
•Expertise in designing, analyzing and troubleshooting complex systems
•Systematic problem-solving approach, coupled with strong communication skills
•Demonstrated knowledge and understanding about infrastructure as code
•Ability to quickly learn, understand, and work with new and emerging technologies, methodologies
•Experience with Big Data related technologies (at least 2 ): Apache Hadoop, Spark, Kafka, Cassandra, MongoDB Sharded Cluster, Elasticsearch Cluster