Link Search Menu Expand Document

References

Table of contents
  1. References
    1. Trust and Safety Frameworks, Principles and Tools
      1. ACM Europe Technology Policy Committee
      2. Adobe
      3. Alignment Forum
      4. Berryville Institute of Machine Learning
      5. Coalition for Secure AI
      6. EleutherAI
      7. European Union
      8. Google
      9. Hugging Face
      10. IBM
      11. Infosys Responsible AI Toolkit
      12. International AI Safety Report 2025
      13. Meta
      14. Mitre
      15. MLCommons
      16. Mozilla Foundation
      17. Organization for Economic Co-operation and Development (OECD)
      18. Pacific Northwest Laboratory
      19. ServiceNow
      20. Stanford University
        1. Center for Research on Foundation Models (CRFM)
        2. Human-centered Artificial Intelligence (HAI)
      21. United States Government, Department of Commerce, National Institute of Standards and Technology (NIST)
      22. United States Government, Executive Branch
      23. United States Government, Department of State, Bureau of Arms Control, Deterrence, and Stability
      24. University of California, Berkeley
      25. University of Illinois at Chicago (UIUC) Secure Learning Lab
      26. University of Notre Dame, et al.
    2. Other Resources
      1. AI Leaderboards Are No Longer Useful
      2. ClairBot from the Responsible AI Team at Ekimetrics
      3. Foundational Challenges in Assuring Alignment and Safety of Large Language Models
      4. OODA loop
      5. Prompt Engineering
      6. Your AI Product Needs Evals

Trust and Safety Frameworks, Principles and Tools

An alphabetical list of links to AI trust and safety information from governments, corporations, universities, and non-profit institutions, organized by those organizations. Some of the references here were discussed more fully under Exploring AI Trust and Safety .

ACM Europe Technology Policy Committee

Comments in Response to European Commission Call for Evidence Survey on “Artificial Intelligence - Implementing Regulation Establishing a Scientific Panel of Independent Experts” PDF was prepared by the ACM Europe Technology Policy Committee (November 15, 2024). It is one of many ACM Public Policy Products.

Adobe

The AI inflection point provides Adobe’s recommendations for responsible AI in organizations (published December 2024).

Alignment Forum

The Alignment Forum brings together researchers to discuss all facets of AI model and system alignment.

Berryville Institute of Machine Learning

Berryville Institute of Machine Learning (BIML) is a group of cybersecurity experts exploring the security implications for ML/AI.

BIML resources include the following:

Coalition for Secure AI

The Coalition for Secure AI (CoSAI) is a relatively-new initiative of the Oasis Open Projects. CoSAI is an open ecosystem of AI and security experts from industry leading organizations dedicated to sharing best practices for secure AI deployment and collaborating on AI security research and product development.

Specific work groups are focused on these areas:

  • Software supply chain security for AI systems
  • Preparing defenders for a changing security landscape
  • AI security risk governance
  • Secure design patterns for agentic systems

Resources will be published by the work groups as they become available. See, for example, their CoSAI Principles for Secure-by-Design Agentic Systems.

EleutherAI

In What We Mean by Trust and Safety, we discussed the definition of alignment published by EleutherAI.

lm-evaluation-harness is their popular, de-facto standard open-source framework for performing evaluations, including, but not limited to safety.

European Union

The EU AI Act is the first act to regulate AI in the EU. It uses a risk-based approach to regulating AI, including a unique approach that specifies different rules for more powerful generative AI models. Like GDPR regulations for data, the EU AI Act is expected to impact AI practices far beyond the EU’s borders.

Google

Google has many resources for trust and safety. Here are some examples:

Hugging Face

The evaluate framework from Hugging Face is another popular tool for executing evaluations.

IBM

IBM offers many resources for AI trust and safety:

  • Responsible AI: IBM’s description of responsible AI, as informed by IBM product offerings and services. In particular, see the Responsible Use Guide (PDF)
  • What is retrieval-augmented generation?: One of many introductions to the popular RAG Pattern.
  • Granite Guardian: IBM models that provide a robust suite of safeguards designed to detect risks in both prompts and responses, ensuring safe and responsible use with any large language model while promoting responsible AI development.
  • Unitxt: A library for portable Evaluation definitions. Integrated with many safety projects and tools, including lm-evaluation-harness (discussed above).
  • Risk Atlas Nexus: Aimed a relatively-novice users of trust and safety tools, it provides a query prompt to search for risk categories that are most relevant to the user’s needs. The user can then browse for more details on each category. It is hosted on the IBM Hugging Face Community.
  • BlueBench Leaderboard: An easy-to-use suite of benchmarks for different domains. It is hosted on the IBM Research Hugging Face Community.
  • Safety BAT Leaderboard: A benchmark that uses BenchBench to rate benchmarks according to their agreement with a defined Aggregate Benchmark, an enhanced representation of many benchmarks that are available. Since benchmarks can be expensive to run yourself, it is useful for selecting a representative set of benchmarks that cover the areas of concern, but don’t overlap with each other too much. It is hosted on the AI Alliance Hugging Face Community
  • Kepler: Sustainability benchmarks, e.g., for estimating carbon consumption. An example of an Evaluation that isn’t focused on safety.

Infosys Responsible AI Toolkit

The Responsible AI Toolkit from Infosys incorporates features including safety, security, explainability, fairness, bias and hallucination detection to ensure AI solutions are trustworthy and transparent.

International AI Safety Report 2025

The International AI Safety Report 2025 is a report on the state of advanced AI capabilities and risks. Written by 100 AI experts including representatives nominated by 33 countries and intergovernmental organizations, it is the latest annual update to this report that has been published for several years. The 2025 report was published January 29, 2025, just before the Artificial Intelligence Action Summit, February 10 and 11, 2025, in Paris.

Meta

The Responsible Use Guide from Meta is their comprehensive guide for responsible use of AI in applications.

Meta Trust and Safety is their set of tools for ensuring trust and safety, reflecting the best practices in Meta’s Responsible Use Guide. Released in conjunction with the Meta Llama 3 family of open models.

Open Source AI Can Help America Lead in AI and Strengthen Global Security is a recent statement from Meta on allowing US government agencies working on national security to use Llama models, and the importance of open models for security and retaining US leadership.

Mitre

MITRE Enterprise ATT&CK ontology is a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations.

Common Weakness Enumeration is the industry-standard database of known vulnerabilities of all kinds, not just AI-related vulnerabilities.

MLCommons

MLCommons AI Safety is the work group at ML Commons that defined an influential taxonomy of harms and benchmarks that we discussed here.

Mozilla Foundation

Accelerating Progress Toward Trustworthy AI is the [Mozilla Foundation’s]((https://foundation.mozilla.org){:target=”mozilla-tai”} guide to AI trust and safety. It makes the argument that open innovation for AI is the best way to ensure safety and wide accessibility.

Organization for Economic Co-operation and Development (OECD)

The OECD has published many resources on AI, including the following:

Pacific Northwest Laboratory

The laboratory’s Interactive OODA Processes for Operational Joint Human-Machine Decision Making, by Blaha and Leslie, explores machine vs. human approaches to OODA (Observe, Orient, Decide, Act)

ServiceNow

Responsible AI Guidelines: A Practical Handbook for Human-Centered AI is ServiceNow’s guide to responsible AI, reflecting their experiences building and providing AI technologies.

Stanford University

Several important projects are ongoing at Stanford University.

Center for Research on Foundation Models (CRFM)

The Holistic Evaluation of Language Models (HELM) project is an early and influential platform and tool set for general evaluation of AI models and systems, including work targeting domains like healthcare.

Recently, HELM released AIR-Bench 2024. From their website:

We introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI Risks study, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-Bench 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-Bench 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.

Human-centered Artificial Intelligence (HAI)

Artificial Intelligence Index, The AI Index Report 2024: Measuring trends in AI describes a wide range of trends in AI. In particular, it discusses how there are no current standards for responsible AI. All model and systems builders use different evaluations. Note that MLCommons is one organization that is attempting to fix this problem.

United States Government, Department of Commerce, National Institute of Standards and Technology (NIST)

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations offers a taxonomy and general guidance on the unique security challenges of AI Systems.

We discussed the NIST guidance Artificial Intelligence Risk Management Framework (AI RMF 1.0) here. It is their recommendations for assessing and managing AI Risk.

This following were published during the Biden administration, but they have largely been superseded by the current administration:

NIST’s Responsibilities Under the October 30, 2023 Executive Order.

NIST’s clarification of its roles and responsibilities under the executive order (next reference), including a Request for Information (RFI) to which the AI Alliance responded.

United States Government, Executive Branch

Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence was the Biden administration’s view on AI safety, superseded by the current administration.

United States Government, Department of State, Bureau of Arms Control, Deterrence, and Stability

Political Declaration on Responsible Military Use of Artificial Intelligence and Autonomy was a State Department “one-page” statement of responsible use of AI by governments, including by military forces.

University of California, Berkeley

AI Risk-Management Standards Profile for General-Purpose AI Systems (GPAIS) and Foundation Models offers guidance on risk assessment and management.

Chatbot Arena is a very popular, crowd-sourced platform for gauging the performance of ChatBots. See also the related LMArena project.

University of Illinois at Chicago (UIUC) Secure Learning Lab

AI Secure, Decoding Trust is a comprehensive assessment of trustworthiness in GPT models.

University of Notre Dame, et al.

Trusted AI (TAI) Frameworks Project is a consortium of universities and United States Department of Defense (DoD) agencies researching the requirements for trustworthy AI, which we discussed here.

Other Resources

AI Leaderboards Are No Longer Useful

AI Leaderboards Are No Longer Useful is an informative and influential blog post about the difficulties of relying on leaderboards to choose the best performing models or systems, because they often ignore total cost, rely on benchmarks that have limited scope, and other challenges.

ClairBot from the Responsible AI Team at Ekimetrics

ClairBot from the Responsible AI Team at Ekimetrics is a research project that compares several model responses for domain-specific questions, where each of the models has been tuned for a particular domain, in this case ad serving, laws and regulations, and social sciencies and ethics.

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Foundational Challenges in Assuring Alignment and Safety of Large Language Models is a comprehensive survey of current challenges for LLMs.

OODA loop

OODA loop is a loop of steps that should be constantly performed, consisting of these steps: Observe, Orient, Decide, Act. It was originally developed by United States Air Force Colonel John Boyd for combat operations, it has been applied in other areas, like industrial applications, project assessment, etc.

Prompt Engineering

The Wikipedia page on prompt engineering provides one of many overviews of techniques used to manipulate Prompts in order to achieve responses that are more desirable, when used by good actors, or less desirable, when used by bad actors to undermine an AI system.

Your AI Product Needs Evals

Your AI Product Needs Evals is an engineer’s guide to various techniques for ensuring alignment of AI system.