University of Notre Dame and IBM Research build tools for AI governance

Main Building (Photo by Matt Cashore/University of Notre Dame) …

Wed, Jul 16, 2025

Expanding into virtually all aspects of modern society, AI systems are transforming everything from education to healthcare, but how trustworthy are the vast data landscapes that are fueling them?

The BenchmarkCards framework, a collection of datasets, benchmarks, and mitigations that serves as a guide for developers to build safe and transparent AI systems, was recently incorporated into IBM’s Risk Atlas Nexus, the company’s open-source AI toolkit for governance of foundation models.

Through support from the Notre Dame-IBM Technology Ethics Lab, researchers at the University of Notre Dame’s Lucy Family Institute for Data & Society and IBM Research jointly developed the framework, targeting the entire community of researchers and developers, and providing a practical guideline for improved evaluation and mitigation of potential risks when developing AI models.

The development of Large Language Models (LLMs) and assessment of their capabilities is guided by their performance in benchmarks – the combination of datasets, evaluation metrics, and associated processing steps. Although these benchmarks serve this critical role, when misused, they can provide a false sense of safety or performance, leading to serious ethical and practical implications.

In education, the heavy emphasis on popular benchmarks such as Massive Multitask Language Understanding (MMLU) and Grade School Math 8K (GSM8K) to evaluate large language models like ChatGPT has contributed to the development of AI tutors and test proctoring systems that, while innovative, may be limited in promoting deep conceptual understanding, sometimes yield inconsistent results, and raise important questions about the use of personal biometric data and informed consent.

To address these concerns, the BenchmarkCards framework is designed with a standardized documentation system that records essential benchmark metadata, including factual accuracy and bias detection. This enables researchers and developers to make more informed decisions about which benchmarks best suit their specific needs for AI system design — a recognized need in the AI community that was identified during a user study of BenchmarkCards.

“There is a growing discussion about large language models and concern about how these tools behave when compared to certain benchmarks,” said Nuno Moniz, associate research professor at the Lucy Family Institute for Data & Society and former director of the Notre Dame-IBM Technology Ethics Lab. “Benchmarks are designed with specific uses in mind and, in addition to not being exempt from risks, we observe a growing practice of assessing LLM capabilities outside of such intended uses. The BenchmarkCards framework is a significant contribution to the community, allowing developers to be more intentional and better guided when assessing the capabilities of these tools,” he added.

This project is led by Moniz, Michael Hind, Distinguished Research Staff Member at IBM Research, and Elizabeth Daly, Research Scientist and Lead of the Interactive AI Group at the IBM Research Laboratory (Dublin, Ireland), with Anna Sokol, doctoral student in the Department of Computer Science and Engineering, and Lucy Graduate Scholar.

“Current documentation of benchmarks is ad hoc and often incomplete,” said Hind. “By having more standardized documentation, researchers and developers will be able to choose the most appropriate benchmark for their use case, resulting in better evaluations and ultimately more accurate and safe AI systems.”

Collaborators included Notre Dame faculty Nitesh Chawla, founding director of the Lucy Family Institute for Data and Society, Xiangliang Zhang, the Leonard C. Bettex Collegiate Professor of Computer Science in the Department of Computer Science and Engineering; and David Piorkowski, Staff Research Scientist at IBM.

“The work we did with our partners at the University of Notre Dame is helping to set a standard that we hope the community adopts to improve transparency and documentation around these benchmarks,” said Daly. “At IBM Research, it is now an integral part of our ontology as part of Risk Atlas Nexus.”

Since 2023, the University of Notre Dame has also been a member of the AI Alliance—a global consortium led by IBM and Meta—dedicated to advancing AI research, education, and governance, with a focus on open innovation and the development of safe and trustworthy AI systems.

“The future of AI lies not just in advancing algorithms, but in aligning them with human values—ensuring innovation fosters societal benefit,” said Chawla, who is also the Frank M. Freimann Professor of Computer Science & Engineering. “To actualize this alignment, we are deepening industry and academia partnerships by working together to design tools that promote transparency and empower researchers and developers to build more responsible AI models."

To learn more about other AI projects and activities within the Lucy Family Institute, please visit the Lucy Family Institute website.

Contact:

Christine Grashorn, Program Director, Engagement and Strategic Storytelling
Lucy Family Institute for Data & Society / University of Notre Dame
cgrashor@nd.edu / 574.631.4856
lucyinstitute.nd.edu / @lucy_institute

About the Lucy Family Institute for Data & Society

Guided by Notre Dame’s Mission, the Lucy Family Institute adventurously collaborates on advancing data-driven and artificial intelligence (AI) convergence research, translational solutions, and education to ethically address society’s vexing problems. As an innovative nexus of academia, industry, and the public, the Institute also fosters data science and AI access to strengthen diverse and inclusive capacity building within communities.

About the Notre Dame-IBM Technology Ethics Lab

The Notre Dame–IBM Technology Ethics Lab, a critical component of the Institute for Ethics and the Common Good and the Notre Dame Ethics Initiative, advances ethical, human-centered approaches to the design, development, and use of artificial intelligence and emerging technologies. Through applied, interdisciplinary research and broad stakeholder engagement, the Lab fosters dialogue, builds collaborative communities, and shapes policies and practices for responsible innovation and governance at scale.

Latest Research

View all