Researchers update supercomputing infrastructure that powers research at Notre Dame
Twice each year, the High Performance Computing (HPC) team at the Center for Research Computing (CRC), and a ten-person team work 150–200 hours to ensure that Notre Dame’s research computing is stable, up-to-date, and as fast as possible.
No downtime
The CRC schedules two maintenance windows each year to make important updates while minimizing the impact of the outages on students and staff. The dates include the first working weekend in January and the Commencement weekend in May. Together, the Operations and User Support teams begin planning for these outages six to eight weeks in advance.
The Operations team focuses on firmware, hardware, and system software while the User Support team manages user applications and performance testing. User-facing login nodes receive weekly systems software and security updates, but our compute nodes are only updated during maintenance weekends with the rare exception made for critical security patches. As such, operating system (OS) and firmware updates across our 1,400+ servers are an outage necessity. Even a minor OS version update requires both teams to test dozens of different hardware and software combinations to look for application incompatibilities, missing features, or noticeable performance degradation. Upgrading major OS versions—such as moving from RHEL 8 to RHEL 9—can sometimes result in retiring entire servers if key hardware support has been dropped.
In the weeks leading up to the outage, both teams converge on a set of updates and then begin working on the timeline for the maintenance weekend. By carefully mapping out tasks, meticulously planning, and creating an ordered checklist, the odds of having to extend the outage due to unforeseen circumstances can be minimized.
Down to business
Outages begin around 6:00 a.m. on Friday. Systems are cleared of users and running tasks and allow only essential staff access for the remainder of the outage. The first task involves shutting down all servers and non-essential infrastructure to apply new firmware and security updates. Even with automation and a team of 5-7 people on hand, this step alone can take several hours.
Once most of the hardware updates are finished, the system's software updates or Operating System (OS) rebuilds can begin. Each server rebuild can take 10-25 minutes to complete and often requires multiple reboots. Fortunately, the onsite team can rebuild 50–100 servers at once. The challenge begins, however, with servers that either do not accept the firmware updates, will not rebuild with the new OS, or in some cases, do not turn back on. With almost 1,500 servers containing thousands of hard drives and memory DIMMS that can fail, network cables that can work loose, fan bearings that can burn out, and power supplies that can overheat, there are many things to check and many things that can go wrong.
The Operations team commonly works late into Friday evening to have as many systems as possible online and ready for User Support to begin their work in the morning.
The User Support team takes over the bulk of the work on Saturday as they begin updating user applications and tools. Even for minor OS updates, this involves going through more than 100 different software modules to check for consistency and expected behavior. During major OS upgrades or default compiler updates, dozens of software packages may need to be recompiled from scratch.
Once the software stack has been verified, an intensive validation and verification process begins across all active servers. Suites of benchmark tests are run and compared to previous results to look for performance anomalies. This ensures that the system or software changes do not adversely affect user results. At this stage, it is common to find servers without obvious hardware or software failures that need extra attention. In addition, the Operations team continues working on various tasks and updates backend infrastructure hardware such as networking routers, license servers, and facility power and cooling.
Access granted
Assuming the process goes according to plan, Sunday mornings are reserved for last-minute updates or work on problematic servers. The team will also conduct any additional work required on login nodes and backend infrastructure systems. Once everything appears to be working as expected, the team reopens access to users and shares a follow-up with a summary of tasks completed over the weekend.
Originally published by crc.nd.edu on April 16, 2024.
atLatest Research
- Notre Dame researchers demonstrate AI-powered remote health monitoring tool at Capitol Hill exhibitionThe AI tool can detect real-time vital signs through facial video.
- Adm. Christopher Grady, Vice Chairman of the Joint Chiefs of Staff, to deliver Notre Dame’s 2025 Commencement addressAdm. Christopher Grady, the Vice Chairman and Acting Chairman of the Joint Chiefs of Staff, will be the principal speaker and receive an honorary degree at the University of Notre Dame’s 180th University Commencement Ceremony on May 18, Notre Dame President Rev. Robert A. Dowd, C.S.C., announced today. Grady, currently serving as the 12th Vice Chairman of the Joint Chiefs of Staff and the nation’s second-highest-ranking military officer, graduated from Notre Dame in 1984 and received his commission through Notre Dame’s Naval Reserve Officers Training Corps.
- Finding alternatives for fighting viral infection in natural immune responseHuman Cytomegalovirus (CMV) is a virus found in more than 70% of the population in the United States. The virus remains dormant in healthy people. But for those with a weakened immune system, CMV can cause severe illness and death. CMV is the primary focus of Pilar Pérez Romero, associate professor in the Department of Biological Sciences who came to the university in 2023.
- Antenna expert Jonathan Chisum named ONR Senior Research FellowThe Office of Naval Research (ONR) has selected Jonathan Chisum, associate professor of electrical engineering and affiliate of the Wireless Institute, to be a Summer Faculty Fellow at the Naval Research Laboratory (NRL) Radar Division in Washington, D.C.
- Pipe dreams achieved: Notre Dame’s Industry Labs supports local company Insulation Components in expanding business through automationInsulation Components, a family-owned company that has called Mishawaka, Indiana, its home for over 20 years, provides the unique parts that builders need for insulating pipe joints. Specializing in plumbing, HVAC (heating, ventilation, and air-conditioning)…
- Diverging views of democracy fuel support for authoritarian politicians, Notre Dame study showsA new study from Marc Jacob, assistant professor of democracy and global affairs at Notre Dame’s Keough School of Global Affairs, found that diverse understandings of democracy among voters shape their ability to recognize democratic violations and, in turn, affect their voting choices.