STREAM (ChemBio)

A Standard for Transparently Reporting Evaluations in AI Model Reports

Paper Authors Statements of Support 

Introduction

Leading AI companies regularly publish the results of their dangerous capability evaluations in “model reports” (also called “model cards” or “system cards”). These results are often cited to support important claims about a model’s level of risk. However, there is little consistency across these reports in the evaluation details they provide. In particular, many model reports lack sufficient information on how evaluations were conducted, what they found, and how the results informed risk assessments. This lack of detail undermines the credibility of safety claims and impedes third-party replication efforts.

To address this problem, we propose STREAM-CB - a reporting standard that outlines the key information that should be disclosed in order for third parties to understand, interpret, and scrutinise the results of dangerous capability evaluations. We designed the standard to serve as both a practical resource and an assessment tool: companies can use it as a checklist to improve the transparency of their model reports, and third parties can use it to assess those reports.

Given that the science of dangerous capability evaluations is still developing, we view STREAM-CB as a starting point, and created it with the expectation that it will require updates as the field matures. We thus refer to the standard in this paper as “version 1”, and we invite researchers, practitioners, and regulators to use and improve upon STREAM.

STREAM-CB

Our reporting standard comprises 28 criteria organized into six high-level categories: Threat Relevance; Test Construction, Grading and Scoring; Model Elicitation; Model Performance; Baseline Performance; and Results Interpretation.

We structured our standard so as to include two tiers of information. Each criterion specifies both a “minimum” requirement of information to be included in a given evaluation summary (which signifies partial compliance with our standard) as well as a “full compliance” portion (which signifies meeting our standard in full, providing all recommended details).

 

Threat Relevance

(i) Capability being measured and relevance to threat model
Minimum Requirements Full Compliance
  • Describe the capability the evaluation measures
  • Describe the threat model the evaluation informs*
  • State the type of actor in that threat model*
  • State the misuse vector in that threat model*
  • State the enabling AI capability of concern
  • Give a brief justification of why the evaluation is a suitable measure (this could be an explanation of why the skills measured by the evaluation matter)§
  • Where applicable, note any major limitations readers should know§
(ii) Strength & nature of evidence provided by evaluation and performance threshold
Minimum Requirements Full Compliance
  • Say whether a score could be taken as s trong evidence that it lacks or possesses the capability of concern OR if it is not core to the safety assessment. (Note that a rule-in or rule-out threshold counts as a judgement of the strength of evidence an eval provides).
  • Specify whether the eval has a quantitative threshold that rules out or rules in a capability
  • State those quantitative thresholds are (eg a particular x% score on the eval or exceeding a human baseline)§
  • If a threshold is set, give a brief justification for those numbers§
  • State whether these thresholds were defined before or after the evaluation was run.§
  • Where applicable, if an evaluation was created by an external party, and if the interpretation of performance thresholds differs from that of the original evaluation designer, report that this is in fact the case.§
(iii) Example item & answer
Minimum Requirements Full Compliance
  • Provide at least one example test item (it may be redacted, as long it still contains enough detail to illustrate task complexity)
  • Provide a sample answer to that item (as above wrt redactions)
  • State whether the example is representative of test difficulty
  • If (and only if) the example is not representative of test difficulty, explain the limitations of the example.

* This can be stated just once in the model card for a suite of evaluations – note that, very frequently, a model card will simply identify the core threat model(s) once in an introductory section to the CBRN evaluations as a whole. This is fine and should still be awarded points*.

§ This is not necessary if evaluators explicitly disclose in the report that the evaluation is not a significant contributor to the model safety assessment. In that case, the “minimum” is sufficient for full credit. Additionally, these full credit requirements can be automatically count as met if, instead of labelling individual evals as rule-in or -out, both of the following conditions are met:

  1. the evaluation report elsewhere explains that no single evaluation is capable of ruling a capability in or out*; and
  2. the evaluation report explains at one point explains the overall configuration of evidence that would either “rule in” or “rule out” capabilities, as required by 5.6(i) and 5.6.(ii).

Test Construction

(i) Number of items tested & relation to full test set
Minimum Requirements Full Compliance
  • State the number of items the model was tested on (or if it is a long-form task, the number of independently evaluated stages of the task in the test)
  • If only a subset was used, then state the full test-set size
  • If only a subset of the test was used, describe how the subset was chosen
(ii) Answer format & scoring
Minimum Requirements Full Compliance
  • Describe required answer format(s), such as whether the eval consisted of multiple-choice, short-answers, or open-ended generative tasks
  • Where applicable, flag any important details of scoring that would not be obvious to readers, and would be needed for replication.
  • If the evaluation was designed by a third party AND if (and only if) any changes were made to the designer's recommended scoring or testing methodology, reports must explicitly acknowledge such differences and provide a brief justification for them.
(iii) Answer-key / rubric creation and quality assurance
Minimum Requirements Full Compliance
  • Describe how the answer keys or grading rubrics / criteria were developed (specifically, how they were created - not how the grading itself was done)
  • If the answer keys were developed by experts, report their qualifications
  • If the answer keys were developed by experts, report the number of experts involved
  • Describe the process for validation or quality-control of the key was performed, or explicitly specify that this was not done.
  • Where applicable, explain how ambiguous items were handled.
(iv-a) Human-graded: grader sample and recruiting details
Minimum Requirements Full Compliance
  • Give the graders' domain qualifications
  • State the number of graders
  • Describe how graders were recruited
  • State any training provided, or state that there was no such training
(iv-b) Human-graded: details of grading process and grade aggregation
Minimum Requirements Full Compliance
  • Describe grading instructions
  • State whether graders saw both LLM and human answers
  • State whether graders were blinded
  • Explain how grader disagreements were resolved (simple average, majority vote, intervention of senior experts, etc)
  • State the number of independent grades per item
  • Give typical time spent per question
(iv-c) Human-graded: level of grader agreement
Minimum Requirements Full Compliance
  • State if grader agreement was high or low (or some other qualitative assessment of grader agreement)
  • Provide an agreement statistic (eg, Cohen's kappa etc)
  • Flag disagreements that affect safety conclusions OR explicitly state that there were no such disagreements.
(v-a) Auto-graded: grader model used and modifications applied to it
Minimum Requirements Full Compliance
  • Name the base model used as autograder
  • State whether the model was fine-tuned or otherwise
  • Where applicable, if and only if model was modified, briefly describe how this was done.
(v-b) Auto-graded: details of auto-grading process and rubric used
Minimum Requirements Full Compliance
  • Outline the process by which the autograder produces scores (eg, does it reward similarity with "gold standard" answer examples)
  • Share a brief description of any grading rubrics
  • Share a brief description of grading instructions
  • Provide one grading prompt example (redacted if needed)
  • State whether multiple auto-grader samples were used by the grader to generate final scores
  • If multiple samples were created, state the aggregation rule for multiple scores
(v-c) Auto-graded: validation of auto-grader
Minimum Requirements Full Compliance
  • State whether autograder was validated against human experts, another autograder, or not at all.
  • If and only if validated against humans, describe the expert sample (number and qualifications)
  • If validation was done, provide an agreement statistic
  • If validation was done, state whether validation covered full set or subset
  • Where applicable, if and only if validation was *not* done, provide a justification for this.

Model Elicitation

(i) Model version(s) tested and relationship to launch model
Minimum Requirements Full Compliance
  • Label every model version evaluated†
  • Specify whether any tested models are identical to the deployed model†
  • Say if safeguards were on or off
  • If only earlier or alternative snapshots were tested, provide some estimate capability gap (through a brief qualitative description, a comparison of model performances on a public benchmark, or another suitable technique).†
(ii) Safeguards & bypassing techniques
Minimum Requirements Full Compliance
  • If evaluations involve models with safeguards applied, list the active mitigations during testing**
  • If evaluations involve models with safeguards applied, state whether jailbreak or bypass attempts were made**
  • If evaluations involve models with safeguards applied, describe the rigor of bypass attempts (eg, in time spent) with relation to the threat model considered OR justify that no bypassing efforts were necessary.**
(iii) Elicitation methods used
Minimum Requirements Full Compliance
  • List scaffolding or tool integrations (requirement waived if scaffold or tool integrations clearly unnecessary, eg for multiple-choice or short-form answers)
  • List sampling / generation strategies or lack thereof
  • List resource ceilings (tokens, time) or lack thereof
  • List any fine-tuning datasets applied (or lack thereof - note that if the model card makes clear that the final launch model is used, this makes clear no fine-tuning is used)
  • List prompting techniques employed (or lack thereof)
  • Give details sufficient for third-party replication OR confirm that details have been shared and reviewed by an independent organisation. For example, if fine-tuning is used, include a description of the dataset; if particular prompting techniques were used, include example prompts or design process.

** The approach to testing safeguarded models (including eg the active safeguards and the bypassing strategies) can be reported just once in the model card, so long as the reader can clearly distinguish which evaluations this applies to (if this is not made clear, the point is not awarded).

† These criterion can be met if the model card makes clear elsewhere that, unless otherwise specified, the models tested for each evaluation were of a particular type and specifies the relation of that type to the deployed model. Be sure to search the entire model card very comprehensively in order to see whether such statements exist and whether they apply to the particular eval you are scoring.

Results Reporting

(i) Main scores
Minimum Requirements Full Compliance
  • Report the most relevant summary statistic (eg, mean or max score)
  • Report those numbers either in the main text, in a table, or in a figure with clear text labelling.
(ii) Uncertainty & number of benchmark runs
Minimum Requirements Full Compliance
  • Provide an uncertainty metric for each key statistic (eg, Confidence Intervals or standard error - if its a Confidence Interval, the Confidence Level - eg, 95% - must also be reported). Can be provided in-text or in a figure.
  • State the number of full-benchmark runs used per evaluation.***
(iii) Ablations / alt. conditions
Minimum Requirements Full Compliance
  • State whether supplementary runs with major variations on the baseline evaluation conditions (eg different elicitation, resource ceilings, or test versions) were run.
  • Where applicable, if and only if such major variations were run, provide a clear breakdown of the results for each major testing condition.
  • Provide all stats in text, in a table, or in a graph with clear text labelling.

*** This can also be met if the model card makes clear elsewhere that each evaluation is run a given number of X times or, alternatively, specify a different method for establishing confidence intervals (such as a “bootstrap procedure”).

Baseline Results

(i-a) Human baseline: sample details
Minimum Requirements Full Compliance
  • State sample size of baseline
  • State key qualifications of sample (including not just "expert", but specific domain of expertise, education level and/or relevant experience).
  • Briefly describe how the sample was recruited
  • Note sampling biases
(i-b) Human baseline: full scores and uncertainty metrics
Minimum Requirements Full Compliance
  • Give human performance statistic(s) similar to that provided in 5.4.i.
  • Provide an uncertainty metric equivalent to that provided in 5.4.ii
  • Where applicable, note any task differences vs. the AI test
  • Explicitly report whether baseline samples were tested on the full test or a subset of it.
(i-c) Human baseline: details of elicitation, resources available and incentives provided
Minimum Requirements Full Compliance
  • State time allowed for the task
  • List tools or resources humans could use (eg, internet access, biological design tools, etc).
  • Describe incentives offered
  • Report actual time spent on typical questions
  • Note any issues with testing environment or performance (such as compliance issues) or absence thereof
(ii-a) No human baseline: justification of absence
Minimum Requirements Full Compliance
  • If no human baseline is provided, give a brief justification for omitting humans (eg, too expensive, uninformative, or already exceeded by previous models)
  • State whether any consultations were done or evidence supporting that decision was considered
(ii-b) No human baseline: alternative to human baseline
Minimum Requirements Full Compliance
  • Provide another comparison point (e.g. earlier model checkpoints, scores from other publicly released models, or surveys of SMEs)
  • Explain why that baseline is meaningful
  • Discuss major uncertainties with the alternative baseline

Safety Interpretation

(i) Interpretation of test outcomes, capability conclusion and relation to decision-making
Minimum Requirements Full Compliance
  • State the overall capability conclusion that evaluation results report****
  • Briefly describe how the conclusion affects developer decisions (eg whether the model crossed any capability thresholds in their Safety Framework or equivalent document)
  • Qualitatively or quantitatively explain degree to which specific evaluations contributed to this conclusion (note that this is often done by referring to the capabilities specific evaluations measured, rather than by referring to evaluations by name - this is acceptable).
  • Disclose other important evidence streams, such as those performed by external parties or holistic red-teaming.
(ii) Details of test results that would have overturned the above capability conclusion and details of preregistration
Minimum Requirements Full Compliance
  • State the configurations of evaluation results that would have overturned the overall capability conclusion reached
  • State whether these criteria (for "falsification" of the overall capability conclusion) were pre-registered with a credible third party.
(iii) Predictions of future performance gains from post-training and elicitation
Minimum Requirements Full Compliance
  • Provide some quantitative or qualitative statement about how future model improvement might improve which includes the implications of this improvement on risk levels, accounting for both post-training improvements and near-future model releases.
  • Directly tie those projected improvements to decision points (such as capability thresholds) of the company's Frontier AI Safety documentation and predict a crude time frame for reaching them
(iv) Time available to relevant teams for interpretation of results
Minimum Requirements Full Compliance
  • Provide some statement about how long relevant teams had to form and communicate interpretations of results before deployment
  • Provide a rough quantified estimate of this time
(v) Presence of internal disagreements
Minimum Requirements Full Compliance
  • Explicitly state whether any team-members involved with testing disagreed with the above capability conclusion, or agreed with important caveats (if there was no disagreement, this must also be reported).
  • If such disagreements existed, summarise the nature of those disagreements
  • Explain how disagreements were handled OR explain how they would have been handled, had they occurred.

†† Criterion in this section can be reported once per report, rather than once per evaluation.

**** Note that one way you can meet this criterion is to have clearly marked every single CBRN evaluation as a “rule-in” or “rule-out” evaluation with a clear and comprehensible performance threshold; if this is done, then it is clear that falsifying results would have been inferred from rule-in or rule-out evaluations falling on the other side of that performance threshold, though in that case the model card should ideally explain what the developer would do in the rare case that a rule-in and a rule-out threshold explicitly contradicted one another.

FAQ

Q: How was the STREAM standard developed?

A: STREAM was developed in consultation with 23 experts across government, civil society, academia, and frontier AI companies. It builds upon a previous checklist for analyzing CBRN evaluation reports and was refined through multiple drafts and external feedback. We discuss our methodology in greater detail in the paper.

Q: Does following the STREAM standard mean an AI model is safe?

A: No. Following the STREAM standard only indicates that the model’s evaluation has been reported with sufficient detail for strong third party judgements to be made.

Q: What if an evaluation contains sensitive information that can’t be shared publicly?

A: The standard is designed to avoid forcing the disclosure of sensitive or hazardous information. In cases where reporting certain details could create a security risk, we suggest omitting the information from public reports. Instead, developers should provide those details to a credible, independent third party (such as a government AI Safety Institute) and include a statement or attestation from that party in their public report.

Q: Is STREAM only relevant for ChemBio evals?

A: This initial version of the standard is specifically targeted at chemical and biological (ChemBio) evaluations. This means that while many of its principles may be applicable to other domains such as cyberoffence capability evaluations, it is not designed for use in those areas.

Q: Does STREAM cover all types of ChemBio evals?

A: STREAM focuses specifically on benchmark evaluations and is not designed to be sufficient for other other types of evaluation (e.g. human uplift studies, red-teaming). However, we do offer preliminary guidance for reporting on human uplift studies in Appendix B of the paper.

Q: How should I cite this work?

A: McCaslin et al. (2025) STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports (arXiv:2508.09853). https://doi.org/10.48550/arXiv.2508.09853

Statements of Support

Our mission with STREAM-CB is to improve public transparency in how dangerous AI capabilities are reported. We see this standard as a first step in a vital, community-wide effort to bring more rigor to evaluation reporting.

To help initiate this effort, we invited experts and organizations in AI safety and governance to affirm their support for the core principles of our work. The following statement reflects that shared commitment. It is intended to encourage the adoption of rigorous, transparent reporting practices and to signal broad agreement on the value of frameworks like STREAM in advancing responsible AI governance. The individuals and organizations listed below have added their names in support of this goal.

Transparency in AI safety is fundamental to responsible innovation, helping to build public trust and advance our scientific understanding. Model reports should clearly disclose their results and methodologies to independent parties. We hope that frameworks like STREAM and other criteria can better structure both internal and external model reports—and that future work iterates and expands such work to strengthen ‘peer review’.

Signatories

[Name, Organization]

[Name, Organization]

Authors

Tegan McCaslin Independent
Jide Alaga METR
Samira Nedungadi     SecureBio
Seth Donoughe SecureBio
Tom Reed GovAI
Chris Painter METR
Rishi Bommasani HAI
Luca Righetti GovAI

For email correspondence, contact Luca Righetti (luca.righetti@governance.ai)

STREAM (ChemBio) -