Leading AI companies regularly publish the results of their dangerous capability evaluations in “model reports” (also called “model cards” or “system cards”). These results are often cited to support important claims about a model’s level of risk. However, there is little consistency across these reports in the evaluation details they provide. In particular, many model reports lack sufficient information on how evaluations were conducted, what they found, and how the results informed risk assessments. This lack of detail undermines the credibility of safety claims and impedes third-party replication efforts.
To address this problem, we propose STREAM-CB - a reporting standard that outlines the key information that should be disclosed in order for third parties to understand, interpret, and scrutinise the results of dangerous capability evaluations. We designed the standard to serve as both a practical resource and an assessment tool: companies can use it as a checklist to improve the transparency of their model reports, and third parties can use it to assess those reports.
Given that the science of dangerous capability evaluations is still developing, we view STREAM-CB as a starting point, and created it with the expectation that it will require updates as the field matures. We thus refer to the standard in this paper as “version 1”, and we invite researchers, practitioners, and regulators to use and improve upon STREAM.
Our reporting standard comprises 28 criteria organized into six high-level categories: Threat Relevance; Test Construction, Grading and Scoring; Model Elicitation; Model Performance; Baseline Performance; and Results Interpretation.
We structured our standard so as to include two tiers of information. Each criterion specifies both a “minimum” requirement of information to be included in a given evaluation summary (which signifies partial compliance with our standard) as well as a “full compliance” portion (which signifies meeting our standard in full, providing all recommended details).
(i) Capability being measured and relevance to threat model | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(ii) Strength & nature of evidence provided by evaluation and performance threshold | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iii) Example item & answer | |
---|---|
Minimum Requirements | Full Compliance |
|
|
* This can be stated just once in the model card for a suite of evaluations – note that, very frequently, a model card will simply identify the core threat model(s) once in an introductory section to the CBRN evaluations as a whole. This is fine and should still be awarded points*.
§ This is not necessary if evaluators explicitly disclose in the report that the evaluation is not a significant contributor to the model safety assessment. In that case, the “minimum” is sufficient for full credit. Additionally, these full credit requirements can be automatically count as met if, instead of labelling individual evals as rule-in or -out, both of the following conditions are met:
(i) Number of items tested & relation to full test set | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(ii) Answer format & scoring | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iii) Answer-key / rubric creation and quality assurance | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iv-a) Human-graded: grader sample and recruiting details | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iv-b) Human-graded: details of grading process and grade aggregation | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iv-c) Human-graded: level of grader agreement | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(v-a) Auto-graded: grader model used and modifications applied to it | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(v-b) Auto-graded: details of auto-grading process and rubric used | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(v-c) Auto-graded: validation of auto-grader | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(i) Model version(s) tested and relationship to launch model | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(ii) Safeguards & bypassing techniques | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iii) Elicitation methods used | |
---|---|
Minimum Requirements | Full Compliance |
|
|
** The approach to testing safeguarded models (including eg the active safeguards and the bypassing strategies) can be reported just once in the model card, so long as the reader can clearly distinguish which evaluations this applies to (if this is not made clear, the point is not awarded).
† These criterion can be met if the model card makes clear elsewhere that, unless otherwise specified, the models tested for each evaluation were of a particular type and specifies the relation of that type to the deployed model. Be sure to search the entire model card very comprehensively in order to see whether such statements exist and whether they apply to the particular eval you are scoring.
(i) Main scores | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(ii) Uncertainty & number of benchmark runs | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iii) Ablations / alt. conditions | |
---|---|
Minimum Requirements | Full Compliance |
|
|
*** This can also be met if the model card makes clear elsewhere that each evaluation is run a given number of X times or, alternatively, specify a different method for establishing confidence intervals (such as a “bootstrap procedure”).
(i-a) Human baseline: sample details | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(i-b) Human baseline: full scores and uncertainty metrics | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(i-c) Human baseline: details of elicitation, resources available and incentives provided | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(ii-a) No human baseline: justification of absence | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(ii-b) No human baseline: alternative to human baseline | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(i) Interpretation of test outcomes, capability conclusion and relation to decision-making | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(ii) Details of test results that would have overturned the above capability conclusion and details of preregistration | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iii) Predictions of future performance gains from post-training and elicitation | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(iv) Time available to relevant teams for interpretation of results | |
---|---|
Minimum Requirements | Full Compliance |
|
|
(v) Presence of internal disagreements | |
---|---|
Minimum Requirements | Full Compliance |
|
|
†† Criterion in this section can be reported once per report, rather than once per evaluation.
**** Note that one way you can meet this criterion is to have clearly marked every single CBRN evaluation as a “rule-in” or “rule-out” evaluation with a clear and comprehensible performance threshold; if this is done, then it is clear that falsifying results would have been inferred from rule-in or rule-out evaluations falling on the other side of that performance threshold, though in that case the model card should ideally explain what the developer would do in the rare case that a rule-in and a rule-out threshold explicitly contradicted one another.
Q: How was the STREAM standard developed?
A: STREAM was developed in consultation with 23 experts across government, civil society, academia, and frontier AI companies. It builds upon a previous checklist for analyzing CBRN evaluation reports and was refined through multiple drafts and external feedback. We discuss our methodology in greater detail in the paper.
Q: Does following the STREAM standard mean an AI model is safe?
A: No. Following the STREAM standard only indicates that the model’s evaluation has been reported with sufficient detail for strong third party judgements to be made.
Q: What if an evaluation contains sensitive information that can’t be shared publicly?
A: The standard is designed to avoid forcing the disclosure of sensitive or hazardous information. In cases where reporting certain details could create a security risk, we suggest omitting the information from public reports. Instead, developers should provide those details to a credible, independent third party (such as a government AI Safety Institute) and include a statement or attestation from that party in their public report.
Q: Is STREAM only relevant for ChemBio evals?
A: This initial version of the standard is specifically targeted at chemical and biological (ChemBio) evaluations. This means that while many of its principles may be applicable to other domains such as cyberoffence capability evaluations, it is not designed for use in those areas.
Q: Does STREAM cover all types of ChemBio evals?
A: STREAM focuses specifically on benchmark evaluations and is not designed to be sufficient for other other types of evaluation (e.g. human uplift studies, red-teaming). However, we do offer preliminary guidance for reporting on human uplift studies in Appendix B of the paper.
Q: How should I cite this work?
A: McCaslin et al. (2025) STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports (arXiv:2508.09853). https://doi.org/10.48550/arXiv.2508.09853
Our mission with STREAM-CB is to improve public transparency in how dangerous AI capabilities are reported. We see this standard as a first step in a vital, community-wide effort to bring more rigor to evaluation reporting.
To help initiate this effort, we invited experts and organizations in AI safety and governance to affirm their support for the core principles of our work. The following statement reflects that shared commitment. It is intended to encourage the adoption of rigorous, transparent reporting practices and to signal broad agreement on the value of frameworks like STREAM in advancing responsible AI governance. The individuals and organizations listed below have added their names in support of this goal.
Transparency in AI safety is fundamental to responsible innovation, helping to build public trust and advance our scientific understanding. Model reports should clearly disclose their results and methodologies to independent parties. We hope that frameworks like STREAM and other criteria can better structure both internal and external model reports—and that future work iterates and expands such work to strengthen ‘peer review’.
[Name, Organization]
[Name, Organization]
Tegan McCaslin | Independent |
Jide Alaga | METR |
Samira Nedungadi | SecureBio |
Seth Donoughe | SecureBio |
Tom Reed | GovAI |
Chris Painter | METR |
Rishi Bommasani | HAI |
Luca Righetti | GovAI |
For email correspondence, contact Luca Righetti (luca.righetti@governance.ai)