Sr. Cloud Engineer – Production Support
As a Sr. Cloud Engineer (Site Reliability Engineer), you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Much of our support and software development focuses on optimizing existing systems, building infrastructure, and reducing work through automation. You will join a team of curious problem solvers with a diverse set of perspective who are thinking big and taking risks. In this environment, you will take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. As an SRE, you will be focused on running better production applications and systems.
• Develop, test and debug automated tasks (Apps, Systems, Infrastructure)
• Troubleshoot priority incidents, facilitate blameless post-mortems
• Work with development teams throughout the software life cycle ensuring sustainable software releases
• Perform analytics on previous incidents and usage patterns to better predict issues and take proactive
• Build and drive adoption for greater self-healing and resiliency patterns
• Lead and participate in performance tests; identify bottlenecks, opportunities for optimization, and
• Assist with the design and building of reliable, fault tolerant cloud infrastructure
following industry best practices.
• Perform hands on management of cloud infrastructure
• Document cloud infrastructure and policies
• Mentor and train other members of the MAXIMUS IT
• Understand on premise policies, solutions, and technologies and integrate with the
cloud infrastructure where applicable.
• Understanding cloud security best practices and work with Security teams to design
and implement a security infrastructure
• Serve as a technical point of contract for cloud services and help with
communication of cloud services to projects
• Participate in the 24x7 support coverage as needed
• Amazon Web Services (AWS) Solutions Architect – Associate, Professional (preferred)
• 3 years of solid AWS experience
• Mastery in at least two or more software languages (e.g. Python, Java, etc.) with respect to designing,
coding, testing, and software delivery
• Adept in the development of automated tools (e.g. Ansible, Chef, etc.), systems, and services in multiple
• Advanced knowledge of infrastructure components (e.g. networking, cloud services, orchestration tools,
containerization, compute, and storage systems)
• Proficiency in service-level changes to a system and troubleshooting components
• Experience with Splunk or other monitoring tools
• Experience in engineering solutions for metrics gathering/publishing and event collection/correlation
across distributed architectures, automation, monitoring, intelligent alerting, random fault injections (Chaos
Engineering), and self-healing
• Experience in a production support environment
• 8+ years of overall IT experience
• 5 years of scripting/automation experience
• Excellent interpersonal skills to interact with customers, senior-level personnel and
• Ability to work well both independently and in teams
• Ability to multi-task and to prioritize rapidly-changing task assignments
• Experience working in a fast-paced and deadline-oriented environment
• Excellent organization and communication skills, both written and verbal
• Bachelor's Degree from an accredited college or university in Computer Science,
Information Technology, or a related field. Equivalent experience considered in lieu
• Ability to sit for up to 80% of time
• Frequent use of computer, telephone, and office equipment (copier, fax, scanner)
Essential Duties and Responsibilities:
- Responsible for the computer systems analysis, requirements analysis, modeling, configuration, monitoring and support for mission-critical applications supporting enrollment and call center operations.
- Proactively ensure the highest levels of systems and infrastructure availability. Perform daily system monitoring, verifying the integrity and availability of systems and key processes, reviewing system and application logs, and verifying completion of scheduled jobs.
- Provide 3rd level support for operations to resolve critical issues quickly. This may include occasional off-hours and weekend work and periodic on-call support.
- Provide training and mentoring to others to help them be successful in a system support role.
- Lead the monitoring and testing of application performance for potential bottlenecks, identify possible solutions, and work with developers to implement those fixes.
- Create and maintain documentation including standard operating procedures, work instructions, system diagrams.
- Work closely with business and systems analysts to define system requirements to meet both internal and external customer needs.
- Drive system improvements that optimize the quality, performance, and growth metrics for the enrollment and call center operations.
- Facilitate and oversee meetings with the testing and QA teams to formulate test strategies, test plans, and test cases particularly for load/stress testing scenarios.
- Provide status updates to manager, and project manager on project progress, challenges, issues, and end-user satisfaction.
- Research and recommend innovative and automated approaches for system administration tasks.
- Write and maintain custom scripts to increase efficiency and reduce effort for maintenance, monitoring and trouble-shooting tasks.
- Perform other duties as assigned by management.
- Typically requires a minimum of 8 years of related experience with a Bachelor's degree; or 6 years and a Master's degree; or a PhD with 3 years experience; or equivalent experience.
- Works on complex issues where analysis of situations or data requires an in-depth evaluation of variable factors.
- Exercises judgement in selecting methods, techniques and evaluation criteria for obtaining results.
- Networks with key contacts outside own area of expertise.
- Determines methods and procedures on new assignments and may coordinate activities of other personnel as a team lead.