The Internet Archive has over 30PB of unique digital information... all running across an integrated cluster of over 700 VMs on over 600 "bare-metal" hosts in 2 data centers. We are looking for a "hands-on" operations manager with proven experience effectively managing and participating in a high-performance team of system administrators and technical operations staff. The ideal candidate will be looking to take on a "player-coach" role and have demonstrated experience improving and maintaining the reliability, performance, and security of both internal and publicly-facing infrastructure.
Responsibilities & Duties
- Manage, contribute to, and mentor the technical team responsible for monitoring, maintaining, and restoring the health of all Internet Archive networks and online services. This includes all publicly-facing services, the storage and compute cluster, as well as key internal services related to crawling, indexing, and access to archived web content
- Ensure service continuity and performance. Maintain and expand monitoring and reporting systems to communicate current and historical activity for multiple publicly facing services.
- Analyze, implement, and manage effective improvements in the maintenance and operations processes and infrastructure.
- Plan and coordinate the transition of new software systems and service applications from a development into a production footing. This includes establishing procedures and policy that will ensure sustainable deployment, monitoring, upgrade and expansion of services.
- Assign, support, recruit, hire, schedule, and fire staff as needed to sustain operational objectives and efficiency.
- Recommend the purchase of equipment needed to sustain responsive services and cost-effective operations.
Qualifications & Skills
- Experience managing large server cluster infrastructure
- Experience as lead manager and mentor of a technical operations team
- Ability to document and share critical knowledge with others
- "Customer Service" mentality - advocate for the end user experience of web-delivered services
- Passion for automation, data-driven decision making, and information reporting
- Fluency in Linux system administration
- Creative problem solver.
- Passion for staying current with industry trends
- Experience with Kubernetes
- Experience with Elastic stack
- Experience deploying and maintaining Hadoop
- Excellent oral/written communication and documentation skills
- Flexibility and a sense of humor
Reporting Structure: The Manager of Site Reliability and Infrastructure reports to the Director of Engineering and works closely with the Head Librarian and Founder.
Location: Inner Richmond, San Francisco, CA and City of Richmond, CA; ON-SITE PRESENCE IN SF/RICHMOND IS REQUIRED! Remote employment is not available for this position
Benefits & Perks
The Internet Archive provides a comprehensive benefits package including; PTO, paid holidays, medical, dental, vision, FSA, commuter, STD, LTD, 403B/Roth accounts and Friday lunches at IA HQ.
Internet Archive is an Equal Opportunity Employer M/F/D/V/L/G/B/T and will consider for employment, qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.