STREAM

A Standard for Transparently Reporting Evaluations in AI Model Reports

Paper Expanded Summary Authors

Introduction

Leading AI companies regularly publish the results of their dangerous capability evaluations in model cards, and these results are often cited to support important claims about a model’s level of risk. However, model cards often lack important details about the evaluations conducted, which undermines the credibility of safety claims and impedes third-party scrutiny.

To address this, we propose STREAM, a reporting standard outlining the information needed for dangerous capability evaluations to be properly understood and scrutinised by third parties. We present this standard as “version 1”—a starting point designed to evolve with the science of evaluations.

STREAM

The STREAM standard consists of 28 reporting criteria organized into six high-level categories: Threat Relevance; Test Construction, Grading, and Scoring; Model Elicitation; Model Performance; Baseline Performance; and Results Interpretation. A high-level overview is presented below, and a complete checklist of all criteria can be found here.

Summary Checklist of STREAM
1. Threat relevance
(i) Does the report describe the capabilities that the evaluation measures, and which threat models they are relevant to?
(ii) Does the report state what evaluation results would "rule in" or "rule out" capabilities of concern, if any?
(iii) Does the report provide an example evaluation item and response?
2. Test construction, grading & scoring
(i) Does the report state the number of evaluation items?
(ii) Does the report describe the item type (multiple choice, multiple response, short answer, etc.) and scoring method?
(iii) Does the report describe how the grading criteria were created, and describe quality control measures?
(iv) If human/expert graded... (v) If auto-graded by a model...
(iv-a) Does the report describe the grader sample? (v-a) Does the report describe the base model used for grading, and any modifications made to it?
(iv-b) Does the report describe the grading process? (v-b) Does the report describe the automated grading process?
(iv-c) Does the report state the level of agreement between human graders? (v-c) Does the report state whether the autograder was compared to human graders/other auto-graders?
3. Model elicitation
(i) Does the report specify the exact model version(s) tested?
(ii) Does the report specify the safety mitigations active during testing, and any adaptations to elicitation?
(iii) Does the report describe the elicitation techniques for the test in sufficient detail?
4. Model performance
(i) Does the report give representative performance statistics (e.g. mean, maximum)?
(ii) Does the report give uncertainty measures, and specify the number of evaluation runs conducted?
(iii) Does the report provide results from ablations/alternative testing conditions?
5. Baseline performance
(i) If a human baseline was used... (ii) If no human baseline was used...
(i-a) Does the report describe the human baseline sample and recruitment? (ii-a) Does the report explain why a human baseline would not be appropriate/feasible?
(i-b) Does the report give human performance statistics, and describe differences with the AI test? (ii-b) Does the report provide an alternative comparison point, and explain it?
(i-c) Does the report describe how human performance was elicited?
6. Results interpretation [Can apply once across evaluations]
(i) Does the report state overall conclusions about the model's capabilities/risk level, and connect with evaluation evidence?
(ii) Does the report give 'falsification' conditions for its conclusions, and state whether pre-registered?
(iii) Does the report include predictions about near-term future performance?
(iv) Does the report state the length of time allowed for interpreting results before deployment?
(v) Does the report describe any notable disagreements over results interpretation?

FAQ

Q: How was the STREAM standard developed?

A: STREAM was developed in consultation with 23 experts across government, civil society, academia, and frontier AI companies. It builds upon a previous checklist for analyzing CBRN evaluation reports and was refined through multiple drafts and external feedback. We discuss our methodology in greater detail in the paper.

Q: Does following the STREAM standard mean an AI model is safe?

A: No. Following the STREAM standard only indicates that the model’s evaluation has been reported with sufficient detail for strong third party judgements to be made.

Q: What if an evaluation contains sensitive information that can't be shared publicly?

A: The standard is designed to avoid forcing the disclosure of sensitive or hazardous information. In cases where reporting certain details could create a security risk, we suggest omitting the information from public reports. Instead, developers should provide those details to a credible, independent third party (such as a government AI Safety Institute) and include a statement or attestation from that party in their public report.

Q: Is STREAM only relevant for ChemBio evals?

A: This initial version of the standard is specifically targeted at chemical and biological (ChemBio) evaluations. This means that while many of its principles may be applicable to other domains such as cyberoffence capability evaluations, it is not designed for use in those areas.

Q: Does STREAM cover all types of ChemBio evals?

A: STREAM focuses specifically on benchmark evaluations and is not designed to be sufficient for other types of evaluation (e.g. human uplift studies, red-teaming). However, we do offer preliminary guidance for reporting on human uplift studies in Appendix B of the paper.

Q: How should I cite this work?

A: McCaslin et al. (2025) STREAM: A Standard for Transparently Reporting Evaluations in AI Model Reports (arXiv:2508.09853). https://doi.org/10.48550/arXiv.2508.09853

Q: How can I provide feedback on the standard?

A: We welcome all feedback! We have presented this standard as “version 1” because the science of capability evaluations is still developing, and we intend for STREAM to be an evolving standard that improves with the science over time. We are particularly interested in suggestions regarding criteria that may be missing or those that are overly burdensome to implement. Please send any feedback to feedback@streamevals.com.

Feedback

Our mission with STREAM is to improve public transparency in how dangerous AI capabilities are reported. We see this standard as a first step in a community-wide effort to bring more rigor to evaluation reporting, and we encourage others to build on this work - whether by iterating on STREAM, expanding it to new domains, or developing new standards altogether.

More such efforts are needed to move the field toward scientific norms that can strengthen the credibility of evaluation results. This is a shared mission, and we invite anyone interested in collaborating to contact us at: feedback@streamevals.com.

Authors

Note: The views and opinions expressed in this paper are those of the authors and do not necessarily reflect the official policy or position of their employers.

Tegan McCaslin Independent
Jide Alaga METR
Samira Nedungadi     SecureBio
Seth Donoughe SecureBio
Tom Reed GovAI
Chris Painter METR
Rishi Bommasani HAI
Luca Righetti GovAI

For email correspondence, contact Luca Righetti (luca.righetti@governance.ai)

STREAM -