Recommendations on compiling test datasets for evaluating artificial intelligence solutions in pathology

Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations on compiling test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help pathologists and regulatory agencies verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.


Introduction
The application of arti cial intelligence techniques to digital tissue images has shown great promise for improving pathological diagnosis [1][2][3].They can not only automate time-consuming diagnostic tasks and make analyses more sensitive and reproducible, but also extract new digital biomarkers from tissue morphology for precision medicine [4].
Pathology involves a large number of diagnostic tasks, each being a potential application for AI.Many of these involve the characterization of tissue morphology.Such tissue classi cation approaches have been developed for identifying tumors in a variety of tissues, including lung [5,6], colon [7], breast [8,9], and prostate [9] but also in non-tumor pathology, e.g., kidney transplants [10].Further applications include predicting outcomes [11,12] or gene mutations [5,13,14] directly from tissue images.Similar approaches are also employed to detect and classify cell nuclei, e.g., to quantify the positivity of immunohistochemistry markers like Ki67, ER/PR, Her2, and PD-L1 [15,16].
Testing AI solutions is an important step to ensure that they work reliably and robustly on routine laboratory cases.AI algorithms run the risk of exploiting feature associations that are speci c to their training data [17].Such "over tted" models tend to perform poorly on previously unseen data.To obtain a realistic estimate of the prediction performance on real-word data, it is common practice to apply AI solutions to a test dataset.The results are then compared with reference results in terms of task-speci c performance metrics, e.g., sensitivity, speci city, or ROC-AUC.
Test datasets may only be used once to evaluate the performance of a nalized AI solution [17].They may not be considered during development.This can be considered a consequence of Goodhart's law stating that measures cease to be meaningful when used as targets [18]: If AI solutions are optimized for test datasets, they cannot provide realistic performance estimates for real-world data.Test datasets are also referred to as "hold-out datasets" or "(external) validation datasets."The term "validation, " however, is not used consistently in the machine learning community and can also refer to model selection during development [17].
Besides over tting, AI methods are prone to "shortcut learning" [19].Many datasets used in the development of AI methods contain confounding variables (e.g., slide origin, scanner type, patient age) that are spuriously correlated with the target variable (e.g., tumor type) [20].AI methods often exploit features that are discriminative for such confounding variables and not for the target variable [21].Despite working well for smaller datasets containing similar correlations, such methods fail in more challenging real-world scenarios in ways humans never would [22].To minimize the likelihood of spurious correlations between confounding variables and the target variable, test datasets must be large and diversi ed [20].At the same time, test datasets must be small enough to be acquired with realistic e ort and cost.Finding a good balance between these requirements is a major challenge for AI developers.
Comparatively little attention has been paid to compiling test datasets for AI solutions in pathology.Datasets for training, on the other hand, were considered frequently [9,[23][24][25][26][27][28].Training datasets are collected with a di erent goal than test datasets: While training datasets should produce the best possible AI models, test datasets should provide the most realistic performance assessment for routine use, which presents unique challenges.
Some publications address individual problems in compiling test datasets in pathology, e.g., how to avoid bias in the performance evaluation caused by sitespeci c image features in test datasets [29].Other publications provide general recommendations for evaluating AI methods for medical applications without considering the speci c challenges of pathology [30][31][32][33][34].
Appropriate test datasets are critical to demonstrate the utility of AI solutions as well as to obtain regulatory approval.However, the lack of guidance on how to compile test datasets is a major barrier to the adoption of AI solutions in laboratory practice.
This article gives recommendations for test datasets in pathology.It summarizes the results of extensive literature reviews and discussions by a committee of various stakeholders, including commercial AI developers, pathologists, and researchers.This committee was established as part of the EMPAIA project (Ecosystem for Pathology Diagnostics with AI Assistance), aiming to facilitate the adoption of AI in pathology [35].

Results
The next sections discuss and provide recommendations on various aspects that must be considered when creating test datasets.For meaningful performance estimates, test datasets must be both diverse enough to cover the variability of data in routine diagnostics and large enough to allow statistically meaningful analyses.Relevant subgroups must be covered, and test datasets should be unbiased.Moreover, test datasets must be su ciently independent of datasets used in the development of AI solutions.Comprehensive information about test datasets must be reported and regulatory requirements must be met when evaluating the clinical applicability of AI solutions.

Target population of images
All images an AI solution may encounter in its intended use constitute its "target population of images."A test dataset must be an adequate sample of this target population to provide a reasonable estimate of the prediction performance of the AI solution.For all applications in pathology, the target population is distributed across multiple dimensions of variability, see Table 1.
Biological variability.The visual appearance of tissue varies between normal and diseased states.This is what AI solutions are designed to detect and characterize.But even tissue of the same category can look very di erent (see Figure 1).The appearance is in uenced by many factors (e.g., genetic, transcriptional, epigenetic, proteomic, and metabolomic) that di er between patients as well as between demographic and ethnic groups [42].These factors often vary spatially (e.g., di erent parts of organs are di erently a ected) and temporally (e.g., the pathological alterations di er based on disease stage) within a single patient [44].
Technical variability.Processing and digitization of tissue sections consists of several steps (e.g., tissue xation, processing, cutting, staining and digitization) all of which can contribute to image variability [36].Differences in section thickness and staining solutions can lead to variable staining appearances [39].Artifacts frequently occur during tissue processing, including elastic deformations, inclusion of foreign objects, and cover glass scratches [38].Di erences in illumination, resolution, and encoding algorithms of slide scanner models also a ect the appearance of tissue images [36].
Observer variability.Images in test datasets are commonly associated with a reference label like a disease category or score determined by a human observer.It is well known that the assessment of tissue images is subject to intra-and inter-observer variability [45][46][47][48][49][50][51].This variability results from subjective biases (e.g., caused by training, specialization, and experience) but also from inherent ambiguities in the images [52,53].
Routine laboratory work occasionally produces images that are unsuitable for the intended use of an AI solution, e.g., because they are ambiguous or of insufcient quality.Most AI solutions require prior quality assurance steps to ensure that solutions are only applied to suitable images [54,55].The boundary between suitable and unsuitable images is usually fuzzy (see Figure 2) and there are di cult images that cannot be clearly assigned to either category (see Figure 3).De ning the target population is challenging and presumes a clear de nition of the intended use by the AI developer.The target population of images must be de ned before test datasets are collected.It must be clearly stated which subsets of images fall under the intended use.Such subsets may consist of speci c disease variants, demographic characteristics, ethnicities, staining characteristics, artifacts, or scanner types.These subsets typically overlap, e.g., the subset of images of one scanner type contains images from di erent patient age groups.A particular challenge is to de ne where the target population ends.Examples of images within and outside the intended use can help human observers sort out unsuitable images as objectively as possible.

Data collection
Test datasets must be representative of the entire target population of images, i.e., su ciently diverse and unbiased.To minimize spurious correlations between confounding variables and the target variable and to uncover shortcut learning in AI methods, all dimensions of biological and technical variability must be adequately covered for the classes considered [20,28], also re ecting the variability of negative cases without visible pathology [28,58].
All images encountered in the normal laboratory work ow must be considered.One way to achieve this is to collect all cases that occurred over a given time period [58] long enough for a su cient number of cases to be collected (e.g., one year [9]).Data should be collected from multiple international laboratories, since they di er in their spectra of patients and diseases, technical equipment and operating procedures.To avoid selection bias, artifacts or atypical morphologies must not be excluded if they are part of the intended use of the product [9,58,59].Data should be collected at the point in the work ow where the AI solution would be applied, taking into account possible prior quality assurance steps in the work ow.
All data in a test dataset must be collected according to a consistent acquisition protocol (see "Reporting").The best way to ensure this is to prospectively collect test datasets according to this protocol.Retrospective datasets were typically collected for a di erent purpose and are thus likely to be subject to selection bias, that is di cult to adjust for [60].If retrospective data are used in a test dataset, a comprehensive description of the acquisition protocol must be available so that potential issues can be identi ed [61].

Annotation
Test datasets for AI solutions contain not only images, but also annotations representing the expected analysis result, e.g., slide-level labels or delineations of tissue regions.In most cases, such reference annotations must be prepared by human observers with su cient experience in the diagnostic use case.Since humans are prone to intra-and inter-observer variability, annotations in test datasets should be created by multiple observers from di erent hospitals or laboratories.For unequivocal results, it can be helpful to organize consensus conferences and to use standardized electronic reporting formats [45].Any remaining disagreement should be documented with justi cation (e.g., suboptimal sample quality) and considered when evaluating AI solutions.
Semi-automatic annotation methods can help reduce the e ort required for manual annotation [62,63].However, they can introduce biases themselves and should therefore be monitored by human observers.

Curation
Unsuitable data that does not t the intended use of an AI solution should not be included in a test dataset.Such data usually must be detected by human observers, e.g., in a dedicated data curation step or during the generation of reference annotations.However, there are automated tools to support this process [64].Some approaches identify unsuitable data based on basic image features such as brightness, predominant colors, and sharpness [65,66] or by detecting typical artifacts like tissue folds and air bubbles [37,67].Other methods analyze domain shifts [68][69][70] or use dedicated neural networks trained for outlier detection [71].There are also approaches for detecting outliers depending on the tested AI solution [68,[72][73][74][75].Although these approaches can help exclude unsuitable images from test datasets, they do not yet appear to be mature enough to be used entirely without human supervision.The boundary between the target population of images and unsuitable images that do not fall under the intended use is fuzzy.

Synthetic data
There are a variety of techniques for extending datasets with synthetic data.Some techniques alter existing images in a generic (e.g., rotation, mirroring) or histologyspeci c way (e.g., stain transformations [26] or emulation of image artifacts [38,[76][77][78][79][80][81]).Other techniques create fully synthetic images from scratch [82][83][84][85][86].These techniques are useful for data augmentation [1,2,87], i.e., enriching development data in order to avoid over tting and increase robustness.However, they cannot replace original real-world data for test datasets.Because all of these techniques are based on simpli ed models of real-world variability, they are likely to introduce biases into a test dataset and make meaningful performance measurement impossible.

Sample size
Any test dataset is a sample from the target population of images, thus any performance metric computed on a test dataset is subject to sampling error.In order to draw reliable conclusions from evaluation results, the sampling error must be su ciently small.Larger samples generally result in lower sampling error, but are also more expensive to produce.Therefore, the minimum sample size required to achieve a maximum allowable sampling error should be determined prior to data collection.
Many di erent methods have been proposed for sample size determination.Most of them refer to statistical signi cance tests which are used to test a prespeci ed hypothesis about a population parameter (e.g., sensitivity, speci city, ROC-AUC) on the basis of an observed data sample [88][89][90].Such sample size determination methods are commonly used in clinical trial planning and available in many statistical software packages [75].
When evaluating AI solutions in pathology, the goal is more often to estimate a performance metric with a su cient degree of precision than to test a previously de ned hypothesis.Con dence intervals (CIs) are a natural way to express the precision of an estimated metric and should be reported instead of or in addition to test results [91].A CI is an interval around the sample statistic that is likely to cover the true population value at some con dence level, usually 95% [92].The sample statistic can either be the performance metric itself or a di erence between the performance metrics of two methods, e.g., when comparing performance to an established solution.
When using CIs, the sample size calculation can be based on the targeted width of the CI which is inversely proportional to the precision of the performance estimation [91].Several approaches have been proposed for that matter [93][94][95][96][97].To determine a minimum sample size, assumptions regarding the sample statistic, its variability, and usually also its distributional form must be made.The open-source software "presize" implements several of these methods and provides a simple webbased user interface to perform CI-based sample size calculations for common performance metrics [98].

Subsets
AI solutions that are very accurate on average often perform much worse on certain subsets of their target population of images [99], a phenomenon known as "hidden strati cation." Such di erences in performance can exceed 20% [22].Hidden strati cation occurs particularly in low-prevalence subgroups, but may also occur in subgroups with poor label quality or subtle distinguishing characteristics [22].There are substantial di erences in cancer incidence, e.g., by gender, socioeconomic status, and geographic region [100].Hence, hidden strati cation may result in disproportionate harm to patients in less common demographic groups and jeopardize the clinical applicability of AI solutions [22].Common performance measures computed on the entire test dataset can be dominated by larger subsets and do not indicate whether there are subsets for which an AI solution underperforms [101].
To detect hidden strati cation, AI solutions must be evaluated independently on relevant subsets of the target population of images (e.g., certain medical characteristics, patient demographics, ethnicities, scanning equipment) [22,99].This means in particular that the metadata for identifying the subsets must be available [30].Performance evaluation on subsets is an important requirement to obtain clinical approval by the FDA (see "Regulatory requirements").Accordingly, such subsets should be speci cally delineated within test datasets.Each subset needs to be su ciently large to allow statistically meaningful results (see "Sample size").It is important to provide information on why and how subsets were collected so that any issues AI solutions may have with speci c subsets can be speci cally tracked (see "Reporting").Identifying subsets at risk of hidden strati cation is a major challenge and requires extensive knowledge of the use case and the distribution of possible input images [22].As an aid, potentially relevant subsets can also be detected automatically using unsupervised clustering approaches such as k-means [22].If a detected cluster underperforms compared to the entire dataset, this may indicate the presence of hidden strati cation that needs further examination.

Bias detection
Biases can make test datasets unsuitable for evaluating the performance of AI algorithms.Therefore, it is important to identify potential biases and to mitigate them early during data acquisition [28].Bias, in this context, refers to sampling bias, i.e., the test dataset is not a randomly drawn sample from the target population of images.Subsets to be evaluated independently may be biased by construction with respect to particular features (e.g., patient age).Here, it is important to ensure that the subgroups do not contain unexpected biases with respect to other features.For example, the prevalence of slide scanners should be independent of patient age, whereas the prevalence of diagnoses may vary by age group.
For features represented as metadata (e.g., patient age, slide scanner, or diagnosis), bias can be detected by comparing the feature distributions in the test dataset and the target population using summary statistics (e.g., via mean and standard deviation) or dedicated fairness metrics [102,103].Detection of bias in an entire test dataset requires a good estimate of the feature distribution of the target population of images.Bias in subgroups can be detected by comparing the subset distribution to the entire dataset.Several toolkits for measuring bias based on metadata have been proposed [104,105] and evaluated [106].
Detecting bias in the image data itself is more challenging.Numerous features can be extracted from image data and it is di cult to determine the distribution of these features in the target population of images.Similar to automatic detection of unsuitable data, there are automatic methods to reveal bias in image data.Domain shifts [68] can be detected either by comparing the distributions of basic image features (e.g., contrast) or by more complex image representations learned through speci c neural network models [68,71,107].Another approach is to train trivial machine learning models with modi ed images from which obvious predictive information has been removed (e.g., tumor regions): If such models perform better than chance, this indicates bias in the dataset [108,109].

Independence
In the development of AI solutions, it is common practice to split a given dataset into two sets, one for development (e.g., a training and a validation set for model selection) and one for testing [17].AI methods are prone to exploit spurious correlations in datasets as shortcut opportunities [19].In this case, the methods perform well on data with similar correlations, but not on the target population.If both development and test datasets are drawn from the same original dataset, they are likely to share spurious correlations, and the performance on the test dataset may overestimate the performance on the target population.Therefore, datasets used for development and testing need to be su ciently independent.
As explained below, it is not su cient for test datasets to merely contain di erent images than development datasets [17,19].
To account for memory constraints, histologic wholeslide images (WSIs) are usually divided into small subimages called "tiles."AI methods are then applied to each tile individually, and the result for the entire WSI is obtained by aggregating the results of the individual tiles.If tiles are randomly assigned, tiles from the same WSI can end up in both the development and the test datasets, possibly in ating performance results.A substantial number of published research studies are a ected by this problem [110].Therefore, to avoid any risk of bias, none of the tiles in a test dataset may originate from the same WSI as the tiles in the development set [110].
Datasets can contain site-speci c feature distributions [29].If these site-speci c features are correlated with the outcome of interest, AI methods might use these features for classi cation rather than the relevant biological features (e.g., tissue morphology) and be unable to generalize to new datasets.A comprehensive evaluation based on multi-site datasets from TCGA showed that including data from one site in development and test datasets often leads to overoptimistic estimates of model accuracy [29].This study also found that commonly used color normalization and augmentation methods did not prevent models from learning sitespeci c features, although stain di erences between laboratories appeared to be a primary source of site-speci c features.Therefore, the images in development and test datasets must originate not only from di erent subjects, but should also from di erent clinical sites [31,111,112].
As described in the Introduction section, a given AI solution should only be evaluated once against a given test dataset [17].Datasets published in the context of challenges or studies (many of which are based on TCGA [4] and have regional biases [113]) should generally not be used as test datasets: it cannot be ruled out that they were taken into account in some form during development, e.g., inadvertently or as part of pretraining.Ideally, test datasets should not be published at all and the evaluation should be conducted by an independent body with no con icts of interest [30].

Reporting
Adequate reporting of test datasets is essential to determine whether a particular dataset is appropriate for a particular AI solution.Detailed metadata on the coverage of various dimensions of variability is required for detecting bias and identifying relevant subsets.Data provenance must be tracked to ensure that test data are su ciently disjoint from development data [28,29].Requirements for the test data [114] and acquisition protocols [115] should also be reported so that further data can be collected later.Accurate reporting of test datasets is important in order to submit evaluation results traceable to the test data for regulatory approval [116].
Various guidelines for reporting clinical research and trials, including diagnostic models, have been published [117].Some of these have been adapted specifically for machine learning approaches [118,119] or such adaptation is under development [120][121][122][123].However, only very few guidelines elaborate on data reporting [124], and there is not yet consensus on structured reporting of test datasets, particularly for computational pathology.
Data acquisition protocols should comprehensively describe how and where the test dataset was acquired, handled, processed, and stored [114,115].This documentation should include precise details of the hardware and software versions used and also cover the creation of reference annotations.Moreover, quality criteria for rejecting data and procedures for handling missing data [124] should be reported, i.e., aspects of what is not in the dataset.Protocols should be de ned prior to data acquisition when prospectively collecting test data.Completeness and clarity of the protocols should be veri ed during data acquisition.
Reported information should characterize the acquired dataset in a useful way.For example, summary statistics allow an initial assessment whether a given dataset is an adequate sample of the target population.Relevant subsets and biases identi ed in the dataset should be reported as well.Generally, one should collect and report as much information as feasible with the available resources, since retrospectively obtaining missing metadata is hard or impossible.If there will be multiple versions of a dataset, e.g., due to iterative data acquisition or review of reference annotations, versioning is needed.Suitable hashing can guarantee integrity of the entire dataset as well as its individual samples, and identify datasets without disclosing contents.

Regulatory requirements
AI solutions in pathology are in vitro diagnostic medical devices (IVDMDs) because they evaluate tissue images for diagnostic purposes outside the human body.Therefore, regulatory approval is required for sale and use in a clinical setting [125].The U.S. Food and Drug Administration (FDA) and European Union (EU) impose similar requirements to obtain regulatory approval.This includes compliance with certain quality management and documentation standards, a risk analysis, and a comprehensive performance evaluation.The performance evaluation must demonstrate that the method provides accurate and reliable results compared to a gold standard (analytical performance) and that the method provides real bene t in a clinical context (clinical performance).Good test datasets are an essential prerequisite for a meaningful evaluation of analytical performance.

EU + UK
In the EU and UK, IVDMDs are regulated by the In vitro Diagnostic Device Regulation (IVDR, formally "Regulation 2017/746") [126].After a transition period, compliance with the IVDR will be mandatory for novel routine pathology diagnostics as of May 26, 2022.The IVDR does not impose speci c requirements on test datasets used in the analytical performance evaluation.However, the EU has put forward a proposal for an EU-wide regulation on harmonized rules for the assessment of AI [127].
The EU proposal [127] considers AI-based IVDMDs as "high-risk AI systems" (preamble (30)).For test datasets used in the evaluation of such systems, the proposal imposes certain quality criteria: test datasets must be "relevant, representative, free of errors and complete" and "have the appropriate statistical properties" (Article 10.3).Likewise, it requires test datasets to be subject to "appropriate data governance and management practices" (preamble (44)) with regard to design choices, suitability assessment, data collection, and identi cation of shortcomings.

USA
In the US, IVDMDs are regulated in the Code of Federal Regulations (CFR) Part 809 [128].Just like the IVDR, the CFR does not impose speci c requirements on test datasets used in the analytical performance evaluation.However, the CFR states that products should be accompanied by labeling stating speci c performance characteristics (e.g., accuracy, precision, speci city, and sensitivity) related to normal and abnormal populations of biological specimens.
In 2021, the FDA approved the rst AI software for pathology [129].In this context, the FDA has established a de nition and requirements for approval of generic AI software for pathology, formally referred to as "software algorithm devices to assist users in digital pathology" [130].
Test datasets used in analytical performance studies are expected to contain an "appropriate" number of images.To be "representative of the entire spectrum of challenging cases" (3.ii.A. and B. of source [130]) that can occur when the product is used as intended, test datasets should cover multiple operators, slide scanners, and clinical sites and contain "clinical specimens with de ned, clinically relevant, and challenging characteristics."(3.ii.B. of source [130]) In particular, test datasets should be strati ed into relevant subsets (e.g., by medical characteristics, patient demographics, scanning equipment) to allow separate determination of performance for each subset.Case cohorts considered in clinical performance studies (e.g., evaluating unassisted and software-assisted evaluation of pathology slides with intended users) are expected to adhere to similar speci cations.
Product labeling according to CFR 809 was also dened in more detail.In addition to the general characteristics of the dataset (e.g., origin of images, annotation procedures, subsets, . . .), limitations of the dataset (e.g., poor image quality or insu cient sampling of certain subsets) that may cause the software to fail or operate unexpectedly should be speci ed.
In summary, there are much more speci c requirements for test datasets in the US than in the EU.However, none of the regulations clearly specify how the respective requirements can be achieved or veri ed.

Discussion
Our recommendations for compiling test datasets are summarized in Figure 4.They are intended to help AI developers demonstrate the robustness and practicality of their solutions to regulatory agencies and end users.Likewise, the advice can be used to check whether test datasets used in the evaluation of AI solutions were appropriate and reported performance measures are meaningful.Much of the advice can also be transferred both to image analysis solutions without AI and to similar domains where solutions are applied to medical images, such as radiology or ophthalmology.
A key nding of the work is that it remains challenging to collect test datasets and that there are still many unanswered questions.The current regulatory requirements remain vague and do not specify in detail important aspects such as the required diversity of test datasets or the required con dence in measured performance metrics.The main challenge is that the target population of images is elusive, i.e., it cannot be formally speci ed but only roughly described.This makes it di cult to determine whether a dataset is representative, i.e., whether the many dimensions of variability are covered su ciently, and whether the sample distribution corresponds to real-world data.Without a clear measure of representativity, it is also impossible to determine whether a test dataset is large enough to enable assessment of performance metrics with a maximum sampling error.
For regulatory approval, a plausible justi cation is needed why the test dataset used was good enough.Besides following the advice in this paper, it can also be helpful to refer to published studies in which AI solutions have been comprehensively evaluated.Additional guidance can be found in the summary documents of approved AI solutions published by the FDA, which include information on their evaluation [111].It turns out that many of the AI devices approved by the FDA were evaluated only at a small number of sites [111] with limited geographic diversity [131].Test sets used in current studies typically involved 1000s of slides, 100s of patients, <5 sites, and <5 scanner types [54,58,132,133].
Today, AI solutions in pathology may not be used for primary diagnosis, but only in conjunction with a standard evaluation by the pathologist [130].Therefore, compared to a fully automated usage scenario, requirements for robustness are considerably lower.This also applies to the expected con dence in the performance measurement and the scope of the test dataset used.In a supervised usage scenario, the accuracy of an AI solution determines how often the user needs to intervene to correct results, and thus its practical usefulness.End users are interested in the most meaningful evaluation of the accuracy of AI solutions to assess their practical utility.Therefore, a comprehensive evaluation of the real-world performance of a product, taking into account the advice given in this paper, can be an important marketing tool.

Limitations and outlook
Some aspects of compiling test datasets were not considered in this article.One aspect is how to collaborate with data donors, i.e., how to incentivize or compensate them for donating data.Other aspects include the choice of software tools and data formats for the collection and storage of data sets or how the use of test datasets should be regulated.These aspects must be clari ed individually for each use case and the AI solution to be tested.Furthermore, we do not elaborate on legal aspects of collecting test datasets, e.g., obtaining consent from patients, privacy regulations, licensing, and liability.For more details on these topics, we refer to other works [134].This paper focuses exclusively on the compilation of test datasets.For advice on other issues related to validating AI solutions in pathology, such as how to select an appropriate performance metric, how  to make algorithmic results interpretable, or how to conduct a clinical performance evaluation with end users, we also refer to other works [30,31,33,34,135,136].
For AI solutions to operate with less user intervention and to better support diagnostic work ows, real-world performance must be assessed more accurately than is currently possible.The key to accurate performance measures is the representativeness of the test dataset.Therefore, future work should focus on better characterizing the target population of images and how to collect more representative samples.Empirical studies should be conducted on how di erent levels of coverage of the variability dimensions (e.g., laboratories, scanner types) a ect the quality of performance evaluation for common use cases in computational pathology.
In addition, clear criteria should be developed to delineate the target population from unsuitable data.Currently, the assessment of the suitability of data is typically done by humans, which might introduce subjective bias.Automated methods can help to make the assessment of suitability more objective (see "Curation") and should therefore be further explored.However, such automated methods must be validated on dedicated test datasets themselves.
Another open challenge is how to deal with changes in the target population of images.Since the intended use for a particular product is xed, in theory the requirements for the test datasets should also be xed.However, the target distribution of images is in uenced by several factors that change over time.These include technological advances in specimen and image acquisition, distribution of scanner systems used, and shifting patient populations [135,137].As part of post-market surveillance, AI solutions must be continuously monitored during their entire lifecycle [116].Clear processes are required for identifying changes in the target population of images and adapting performance estimates accordingly.

Conclusions
Appropriate test datasets are essential for meaningful evaluation of the performance of AI solutions.The recommendations provided in this article are intended to help demonstrate the utility of AI solutions in pathology and to assess the validity of performance studies.
The key remaining challenge is the vast variability of images in computational pathology.Further research is needed on how to formalize criteria for su ciently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic work ows in the future.

Figure 1 :Figure 2 :
Figure1: Examples of tissue variability within and between biopsies (H&E-stained breast tissue of female patients with invasive carcinomas of no special type, 40× objective magni cation).First and second column from the left: 41yo patients, grade 2; third and fourth column: 42yo patients, grade 3.

Figure 3 :
Figure 3: Examples of di erent severity levels of artifacts on a prostate section.The top row shows simulated foreign objects, the bottom row shows simulated focal blur.The original image on the left is clearly within the intended use of algorithms for Gleason grading in prostate cancer diagnostics, while the rightmost images are clearly unsuitable.The tissue image is adapted from another source [56] (CC0-licensed [57]).

❏❏
Define target population of images for the intended use ❏ Identify relevant and irrelevant dimensions of variability for the intended use ❏ Identify relevant subsets ❏ Estimate required sample size based on confidence interval ❏ Define procedures for handling missing data ❏ Check up-to-date regulatory requirements and guidance ❏ Ensure that test data is independent of development data ❏ Keep test data undisclosed Data acquisition Adhere to the acquisition protocol ❏ Cover all relevant dimensions of variability ❏ Include all images during routine lab workflow without selection ❏ Include data from multiple international laboratories ❏ Include annotations from multiple observers ❏ Reject image outside the intended use ❏ Consider semi-automatic tools for annotating and rejecting images; verify their results ❏ Report all aspects of how dataset was acquired, handled, processed and stored; including reference annotations, rejected data, and missing data ❏ Avoid synthetic data Monitoring data acquisition ❏ Verify clarity and completeness of acquisition protocol ❏ Check for additional subsets at risk of hidden stratification ❏ Identify potential biases in the dataset and in subsets ❏ Check data for plausibility ❏ Mitigate issues early during acquisition ❏ Keep track of different versions of dataset, if applicable ❏ Report summarizing information about contents of test dataset

Figure 4 :
Figure 4: Overview of recommendations to be considered during di erent phases of collecting test datasets