Machine Learning Algorithm Validation

From Essentials to Advanced Applications and Implications for Regulatory Certification and Deployment
  • Author Footnotes
    1 F. Maleki and N. Muthukrishnan contributed equally to this article.
    Farhad Maleki
    Footnotes
    1 F. Maleki and N. Muthukrishnan contributed equally to this article.
    Affiliations
    Augmented Intelligence & Precision Health Laboratory (AIPHL), Department of Radiology & Research Institute of the McGill University Health Centre, 5252 Boulevard de Maisonneuve Ouest, Montreal, Quebec H4A 3S5, Canada
    Search for articles by this author
  • Author Footnotes
    1 F. Maleki and N. Muthukrishnan contributed equally to this article.
    Nikesh Muthukrishnan
    Footnotes
    1 F. Maleki and N. Muthukrishnan contributed equally to this article.
    Affiliations
    Augmented Intelligence & Precision Health Laboratory (AIPHL), Department of Radiology & Research Institute of the McGill University Health Centre, 5252 Boulevard de Maisonneuve Ouest, Montreal, Quebec H4A 3S5, Canada
    Search for articles by this author
  • Katie Ovens
    Affiliations
    Department of Computer Science, University of Saskatchewan, 176 Thorvaldson Bldg, 110 Science Place, Saskatoon S7N 5C9, Canada
    Search for articles by this author
  • Caroline Reinhold
    Affiliations
    Augmented Intelligence & Precision Health Laboratory (AIPHL), Department of Radiology & Research Institute of the McGill University Health Centre, 5252 Boulevard de Maisonneuve Ouest, Montreal, Quebec H4A 3S5, Canada

    Department of Radiology, McGill University, 1650 Cedar Avenue, Montreal, Quebec H3G 1A4, Canada
    Search for articles by this author
  • Reza Forghani
    Correspondence
    Corresponding author. Room C02.5821, 1001 Decarie Boulevard, Montreal, Quebec H4A 3J1, Canada.
    Affiliations
    Augmented Intelligence & Precision Health Laboratory (AIPHL), Department of Radiology & Research Institute of the McGill University Health Centre, 5252 Boulevard de Maisonneuve Ouest, Montreal, Quebec H4A 3S5, Canada

    Department of Radiology, McGill University, 1650 Cedar Avenue, Montreal, Quebec H3G 1A4, Canada

    Segal Cancer Centre, Lady Davis Institute for Medical Research, Jewish General Hospital, 3755 Cote Ste-Catherine Road, Montreal, Quebec H3T 1E2, Canada

    Gerald Bronfman Department of Oncology, McGill University, Suite 720, 5100 Maisonneuve Boulevard West, Montreal, Quebec H4A3T2, Canada

    Department of Otolaryngology - Head and Neck Surgery, Royal Victoria Hospital, McGill University Health Centre, 1001 boul. Decarie Boulevard, Montreal, Quebec H3A 3J1, Canada
    Search for articles by this author
  • Author Footnotes
    1 F. Maleki and N. Muthukrishnan contributed equally to this article.

      Keywords

      Key points

      • Understanding and following the best practices for evaluating machine learning (ML) models is essential for developing reproducible and generalizable ML applications.
      • The reliability and robustness of a ML application will depend on multiple factors, including dataset size and variety as well as a well-conceived design for ML algorithm development and evaluation.
      • A rigorously designed ML model development and evaluation process using large and representative training, validation, and test datasets will increase the likelihood of developing a reliable and generalizable ML application and will also facilitate future regulatory certification.
      • Scalable, auditable, and transparent platforms for building and sharing multi-institutional datasets will be a crucial step in developing generalizable solutions in the health care domain.

      Introduction

      With growing interest in machine learning (ML), it is essential to understand the methodologies used for evaluating ML models to achieve reproducible solutions that can be successfully deployed in real-world settings.
      • Beam A.L.
      • Manrai A.K.
      • Ghassemi M.
      Challenges to the reproducibility of machine learning models in health care.
      ,

      McDermott MB, Wang S, Marinsek N, et al. Reproducibility in machine learning for health. Paper presented at: 2019 Reproducibility in Machine Learning, RML@ ICLR 2019 Workshop. New Orleans, May 6, 2019.

      Ability to generalize, meaning that conclusions and algorithms generated based on the specific population studied can be extended to the population at large, is essential for successful translation, deployment, and adoption of ML in the clinical setting. ML is the scientific discipline dealing with developing computational models that can improve their performance with new experiences. A performance measure, which is defined quantitatively, drives the model building and evaluation process.
      • Forghani R.
      • Savadjiev P.
      • Chatterjee A.
      • et al.
      Radiomics and artificial intelligence for biomarker and prediction model development in oncology.
      Developing a ML model requires 3 major components: representation, evaluation, and optimization.
      • Domingos P.M.
      A few useful things to know about machine learning.
      The representation component involves deciding on a type of model or algorithm that is used to represent the association between the input data and the outcomes of interest. Examples of such models are support vector machines, random forests, and neural networks.
      • Friedman J.
      • Hastie T.
      • Tibshirani R.
      The evaluation component concerns defining and calculating quantitative performance measures that show the goodness of a representation; that is, the capability of a given model to represent the association between inputs and outputs. Among performance measures that are commonly used for model evaluations are accuracy, precision, recall, mean squared error, and the Jaccard index.

      Bertels J, Eelbode T, Berman M, et al. Optimizing the Dice score and Jaccard index for medical image segmentation: Theory and practice. Paper presented at: International Conference on Medical Image Computing and Computer-Assisted Intervention. Shenzhen (China), October 13-17, 2019.

      ,
      • Tharwat A.
      Classification assessment methods.
      The aim of the optimization component is to update parameters of a given representation (ie, model) with the goal of increasing the performance measures of interest. Examples of approaches used for optimization are gradient descent–based methods and the Newton method.
      • Goodfellow I.
      • Bengio Y.
      • Courville A.
      Deep learning.
      In developing ML models, the available data are often partitioned into 3 disjoint sets commonly referred to as training, validation, and test sets. The data from the training set are used to train the model. A model is often trained through an iterative process. In each iteration, a performance measure reflecting the error made by the model when applied to the data in the training set is calculated. This measure is used to update the model parameters in order to reduce the model error when applied to the data in the training set. The model parameters are a set of variables associated with the model, and their values are learned during the training process. Beside model parameters, there might be other variables associated with a model where their values are not updated during training. These variables are referred to as hyperparameters. The optimal or near-optimal values for hyperparameters are determined using data in the validation set. This process is often referred to as hyperparameter tuning. After training and fine-tuning the model, data from the test set are used to evaluate the model for ability to generalize (ie, the performance on unseen data).
      This review article first describes the fundamental concepts required for understanding the model evaluation processes. Then it explains the main challenges that might affect the ability to generalize of ML models. Next, it highlights common workflows for evaluation of ML models. In addition, it discusses the implications and importance of a robust experimental design to facilitate future certification and strategies required for deployment of ML models in clinical settings.

      Estimating error in model evaluation

      In ML applications, available data are often partitioned into training, validation, and test sets. A performance measure is used to reflect the model error when applied to data in these sets. The error made by a model when applied to the data in the training set is referred to as training error, and the error made by a model when applied to data in a test set is referred to as test error. The test error is used as an estimate for the generalization error (ie, the error of the model when applied to unseen data). Therefore, it is essential that data in the test set are not used during training and fine-tuning of the model. Irreducible error, also referred to as Bayes error, is another type of error resulting from the inherent noise in the data. Irreducible error is the lowest possible error achievable for a given task using the available data. This error is independent of the model being used and often cannot be mathematically calculated. It is often estimated by the error made by a group of humans with the domain expertise for the task at hand. The resulting estimate is considered as an upper bound for irreducible error. Understanding these error types is important for developing and evaluating ML models.
      Underfitting and overfitting are defined based on the error types described earlier. An underfitted model achieves a training error that is much higher than the irreducible error, and an overfitted model achieves a training error that is much lower than the test error. These concepts are associated with the model complexity; that is, the capacity of a model to represent associations between model inputs and outputs. The complexity of different models can be compared by their number of parameters and the way these parameters interact in the model (eg, linear, nonlinear). Models with high complexity often tend to be too sensitive to the dataset used for training. Often the predictions of a model when trained using different datasets, all sampled from the same population, have a high variance, introducing error. Models with high complexity and consequently high error variance tend to overfit. In contrast, low-complexity models may be biased to learning simpler associations between inputs and outputs that might not be sufficient for representing true associations. For example, a linear model cannot represent an exponential association between inputs and outputs. Low-complexity models tend to underfit. Developing an optimal model requires a trade-off between bias and variance by controlling model complexity. Also, techniques such as bagging and boosting can be used to control the bias and variance of a model.

      Quinlan JR. Bagging, boosting, and C4. 5. Paper presented at: AAAI/IAAI, Vol. 1. Portland (Oregon), August 4–8, 1996.

      As mentioned, the performance on the test set provides an estimation of the generalization error, which indicates the performance of the algorithm on external data. The algorithm’s error can be decomposed into 3 terms: irreducible error, estimation bias error, and estimation variance error.
      • Friedman J.
      • Hastie T.
      • Tibshirani R.
      The irreducible error, as the name suggests, is irreducible and is caused by noise that usually exists in any test set. The estimation bias and variance are errors caused by the model complexity. Low-complexity models have a high bias and low variance and tend to underfit the training data. Underfitted models typically have a low training performance and low validation and testing performances. In contrast, high-complexity models have a low bias but a high variance and tend to overfit to the training data. An overfitted model typically has a low training error and high validation and testing errors. During model development (ie, before deploying a model on unseen data or a test set), the best method to monitor the performance of a model and its complexity is to observe the error on the training and validation sets and comparing them with the estimate of the irreducible error.
      • Obermeyer Z.
      • Emanuel E.J.
      Predicting the future—big data, machine learning, and clinical medicine.
      Another factor that should be considered is the choice of a performance measure when evaluating a ML model. For example, consider an application of ML in glioblastoma (GBM) for detecting the pixels corresponding with GBM in a medical image. As shown in Fig. 1, the pixels representing the GBM are less than 10% of all pixels in the image, thus, in this example, if an evaluation metric such as global pixel accuracy is used, a learning algorithm may tend toward a naive solution by classifying all pixels as non-GBM to achieve a pixel accuracy of more than 90%. Therefore, the initial goal of learning GBM is obscured because all pixels are considered non-GBM. Suppose a more appropriate metric, such as the Dice score or the Jaccard index, is used instead. In this case, the aforementioned naive solution achieves a score of zero, which prevents the model from suggesting such a naive solution and encourages the model to find solutions that are aligned with the task at hand.
      Figure thumbnail gr1
      Fig. 1GBM segmentation example. The pixels in the segmented area inside the yellow contour and corresponding to the heterogenously enhancing mass, are less than 10% of pixels of the image; therefore, care is needed with choosing a relevant performance measure that truly measures performance for GBM pixel recognition rather than an indirect approach (such as global pixel accuracy) tending toward a naive solution based on classifying all pixels as non-GBM.

      Evaluation of machine learning models: terminology

      This article uses the following terminology. Training data refers to the data used for learning the model parameters during the training process. The validation data refers to the data used for searching the optimal set of hyperparameters for a model. The test set refers to the data not being used during the model building (ie, data not being used for model training and fine-tuning).
      In ML literature, the terms validation and test have been used interchangeably.
      • Erickson B.J.
      • Korfiatis P.
      • Akkus Z.
      • et al.
      Machine learning for medical imaging.
      • Steyerberg E.W.
      • Bleeker S.E.
      • Moll H.A.
      • et al.
      Internal and external validation of predictive models: a simulation study of bias and precision in small samples.
      • Steyerberg E.W.
      • Harrell F.E.
      Prediction models need appropriate internal, internal–external, and external validation.
      Furthermore, in medical ML, models can be trained on data from a single or several institutions and evaluated on data from another institution. Although this process is similar to the testing stage in the traditional ML workflow, this form of evaluation has been referred to as external validation,
      • Kann B.H.
      • Hicks D.F.
      • Payabvash S.
      • et al.
      Multi-institutional validation of deep learning for pretreatment identification of extranodal extension in head and neck squamous cell carcinoma.
      • Welch M.L.
      • McIntosh C.
      • Traverso A.
      • et al.
      External validation and transfer learning of convolutional neural networks for computed tomography dental artifact classification.
      • Datema F.R.
      • Ferrier M.B.
      • Vergouwe Y.
      • et al.
      Update and external validation of a head and neck cancer prognostic model.
      and the validation of the model using the local data has been referred to as internal validation.
      • König I.R.
      • Malley J.
      • Weimar C.
      • et al.
      Practical experiences on the necessity of external validation.
      ,
      • Kocak B.
      • Yardimci A.H.
      • Bektas C.T.
      • et al.
      Textural differences between renal cell carcinoma subtypes: machine learning-based quantitative computed tomography texture analysis with independent external validation.
      This loose terminology causes confusion among researchers. To avoid such confusion, the authors suggest using the terminology in Table 1.
      Table 1Suggested terminology for machine learning evaluation
      Validation dataData used for learning hyperparameters or model selection
      Test dataData used for providing an unbiased estimate of the generalization error. Test data should not be used for learning parameters or hyperparameters of the model
      Validation errorThe error of a model on the validation set. This is not an unbiased estimate of the generalization error. It is used for model selection and fine-tuning of hyperparameters
      Test errorAn unbiased estimate for the generalization error of a model. Test error is calculated as the model error when applied to test data
      Model validationThe process of calculating the validation error for a model
      Model evaluationThe process of calculating the test error for a model
      Internal evaluationModel evaluation using the local test data
      External evaluationModel evaluation using the external test data

      Model evaluation workflow

      Different approaches for ML model validation and evaluation are reviewed here.

       Holdout Validation

      Holdout validation is the most common approach for evaluating ML models (Fig. 2). In this approach, the available data is partitioned into training, validation, and test sets.
      • Friedman J.
      • Hastie T.
      • Tibshirani R.
      The proportion of the available data used for each set depends on the number of available data points, the data variability, and the characteristics of the model being used. In general, the proportion of the validation set assigned to the training data needs to be large when working with small datasets.
      • Guyon I.
      A scaling law for the validation-set training-set size ratio.
      Typically, 70% of the available data are used for training, 15% for validation, and the remaining 15% for testing the model, although the percentage allocations can vary.
      • Forghani R.
      • Chatterjee A.
      • Reinhold C.
      • et al.
      Head and neck squamous cell carcinoma: prediction of cervical lymph node metastasis by dual-energy CT texture analysis with machine learning.
      As datasets grow, the ratio between the validation and training sets can be smaller because the validation sets are intrinsically large and better reflect the data.
      • Guyon I.
      • Makhoul J.
      • Schwartz R.
      • et al.
      What size test set gives good error rate estimates?.
      The training and validation sets are used for model building. The data in the training set are used to learn the model parameters. The validation data are used to determine the model hyperparameters during a process called hyperparameter tuning. Although often computationally demanding, hyperparameter tuning can be accomplished in a parallel and automated manner.
      • Hutter F.
      • Kotthoff L.
      • Vanschoren J.
      Automated machine learning: methods, systems, challenges.
      Figure thumbnail gr2
      Fig. 2The holdout method. The blue and green samples represent different samples from 2 different classes. In the holdout method, samples are randomly assigned to the training (purple box), validation (yellow box), or test (orange box) sets. When the dataset used for training and evaluation of the ML model is small, the performance measures using validation and test sets are sensitive to the composition of these sets, and the resulting performance measures often are not reliable.
      Distinguishing between model parameters and model hyperparameters is essential. Model parameters are the model properties (variables) whose values are learned during algorithm training using the training set. For example, weights and bias values in a neural network or the coefficients in a linear regression model are considered parameters. In contrast, hyperparameters are not learned using the training set; the data in the validation set are used to guide the hyperparameter selection. For example, the number of layers in a neural network and the regularization values in a ridge regression model are considered as hyperparameters. Also, the choice of a model, among several different models, is considered a hyperparameter.
      After the model is trained and fine-tuned, it is evaluated on the test set to provide an estimate of the model generalization error (ie, the error of the resulting model when applied to unseen data). Therefore, it is essential that the test data are not used during training and fine-tuning the models; otherwise, the estimate for the generalization error would be overoptimistic and unreliable.
      • Russell S.
      • Norvig P.
      Artificial intelligence: a modern approach.
      The holdout validation approach is commonly used when training deep learning models with large-scale datasets because it is computationally less demanding. However, for small datasets, this approach is criticized for not using the whole dataset. A small test set might not provide a reliable estimate of model performance, and the resulting performance measures might be sensitive to the choice of the test set. For small datasets, selecting a test set large enough to be representative of the underlying data is often impossible. Further, using a larger test set means that fewer samples are available to be used for training the model, which negatively affects the performance of the resulting model. Also, when fine-tuning a model using this approach, the resulting model may be sensitive to the choice of the validation set, resulting in models with low ability to generalize.

       Cross-Validation

      Cross-validation is a resampling approach used for the evaluation of ML models. The aim of cross-validation is to provide an unbiased estimate of model performance. Compared with holdout validation, this approach tends to provide a more accurate estimate of generalization error when dealing with small datasets. Various cross-validation techniques for the evaluation of ML models are reviewed next.

       K-fold cross-validation

      In k-fold cross-validation (KFCV), data points are randomly assigned to k disjoint groups (Fig. 3). In an iterative process, each time one of these k groups is selected as the validation set, the remaining k − 1 groups are combined and used as the training set. This process is iterated k times so that each group is selected once as the validation set. The average of performance measures across the k iteration is used as the estimate for the validation error. Compared with holdout validation, this approach is computationally more demanding because it requires training and evaluation of the model k times. However, because the model evaluation is performed k times, the variance of the performance measure is reduced and the resulting estimate is more reliable.
      Figure thumbnail gr3
      Fig. 3Three-fold cross-validation (red bounding box). In practice, a portion of samples is locked away for calculating an unbiased estimate of the generalization error. The cross-validation method takes the remaining data as input and randomly assigns them to k disjoint groups (k = 3 in this example). In an iterative process, each time one of these k groups is selected as the validation set (yellow box) and the remaining k − 1 groups are combined and used as the training set (purple boxes). This process is iterated k times so that each group is selected as a validation set once. The average of the model error on the validation sets then can be used as an estimate of the validation error. Note that, in practice, training and validation data in each iteration are used for learning model parameters as well as selecting hyperparameters for the model. Therefore, the resulting estimate is considered an estimate for the validation error, not for the test error, because the validation data are used for both learning the model parameters and hyperparameters; therefore, it might provide an overoptimistic estimate of generalization error, which is the reason why a test set is locked away before conducting cross-validation. In practice, only in rare cases, such as developing a simple linear regression model that includes a fixed set of variables where the model has no hyperparameters to tune, is a test set not locked away. In such cases, the average of model error (performance measure) on the k validation sets can be used as an unbiased estimate of the generalization error (performance measure).
      The value of k is often chosen such that each of the resulting k groups is a representative sample of the dataset. Another factor that plays a role in determining the value of k is the availability of computational resources. Also, KFCV can be run in parallel to speed up the model evaluation process, which can be accomplished because each iteration of KFCV is independent of the other iterations. The 10-fold and 5-fold cross-validations are the most widely used KFCVs for evaluating ML models.
      • Friedman J.
      • Hastie T.
      • Tibshirani R.

       Stratified k-fold cross-validation

      Class imbalance is a common phenomenon in ML. Class imbalance occurs when there is a substantial difference between the number of samples for the majority class and the minority class, where the majority class is defined as the class with the highest number of samples and the minority class is defined as the class with the lowest number of samples. In such a setting, KFCV might lead to unstable performance measures. There might be zero or very few samples from the minority class in 1 or a few of the data folds, which would substantially affect the evaluation metrics for such folds. In the stratified KFCV (SKFCV), each of the k groups of data points are sampled so that the distribution of the classes in each fold closely mirrors the distribution of classes in the whole dataset.

       Leave-one-out cross-validation

      Although KFCV provides more reliable estimates for generalization error, the resulting model only uses k − 1 groups for training and validation. Leave-one-out cross-validation (LOOCV) uses k = n, where n is the number of samples in the dataset; therefore, all but 1 sample are used for model training. LOOCV is computationally more demanding because it requires training n models. Therefore, it cannot be used when the dataset is very large or the training process for a single model is computationally expensive. LOOCV has been recommended for small or imbalanced datasets.
      • Wong T.-T.
      Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation.

       Leave-p-out cross-validation

      Leave-p-out cross-validation (LPOCV) is an extended form of LOOCV, where validation sets can have p elements instead of 1. It is an exhaustive approach designed to use all of the possible validation sets of size p for the evaluation of ML models. For a dataset of n distinct data points, the number of distinct sets of size p, where p = n/k, are as follows:
      C(n, p)=(np+1)× (np+2)××n1×2××(p)


      Even for moderately large datasets when p>1, this value exponentially grows, and LPOCV quickly becomes impractical. For small datasets, often a value of p = 2, which is known as leave-pair-out cross-validation, is used to achieve a robust estimate of the model performance.
      • Airola A.
      • Pahikkala T.
      • Waegeman W.
      • et al.
      An experimental comparison of cross-validation techniques for estimating the area under the ROC curve.
      Note that for p = 1, this approach is equivalent to LOOCV.

       Leave-one-group-out cross-validation

      In some applications, there might be samples in the dataset that are not independent of each other and are somehow related. In such scenarios, the knowledge of one sample from a group might reveal information about the status of other samples in the same group. For example, different pathology slides for the same patients or different MRI scans of a patient during the course of treatment might reveal information about the patient’s disease. Having samples from these groups scattered in training, validation, and test sets results in overoptimistic performance evaluations and leads to a lack of ability to generalize. Leave-one-group-out cross-validation (LOGOCV) is similar to LOOCV but, instead of leaving 1 data point out, it leaves 1 group of samples out, which requires that, for each sample in the dataset, a group identifier be provided. These group identifiers can represent domain-specific stratification of the samples. For example, when developing a model for classifying MRI scans into cancerous and noncancerous, all scans for a patient during an unsuccessful course of treatment should have the same group identifier.

       Nested cross-validation

      Most ML models rely on several hyperparameters. Tuning these hyperparameters is a common practice in building ML solutions. Often, hyperparameter values that lead to the best performance are experimentally sought. In a traditional cross-validation, where data are split into training and validation sets, experimenting with several models and searching for their optimal hyperparameter values often makes the resulting validation error an overoptimistic performance measure if used for estimating generalization error. Therefore, a test set should be locked away and not be used for model training and hyperparameter tuning. The model performance on this test set can be used as a reliable estimate of generalization error. Selecting a single subset of data as the test set for small datasets leads to estimates for generalization error that have high variance and are sensitive to the composition of the test set. Nested cross-validation (NCV) is used to address this challenge (Fig. 4).
      Figure thumbnail gr4
      Fig. 4A 4-fold outer and 3-fold inner NCV. First, the samples in the dataset are randomly shuffled. Then the outer loop uses different train (purple box), validation (yellow box), and test (orange box) splits. The outer folds 1, 2, 3, and 4 are depicted in the top-left, top-right, bottom-left, and bottom-right corners, respectively. For each outer fold, a 3-fold cross-validation highlighted in a red box is used. The model with different hyperparameters is trained using the training set, and the optimal hyperparameters are chosen based on the average performance of the trained models on the validation sets. In the outer loop, generalization error is estimated by averaging test error over the 4 test sets.
      NCV consists of an outer cross-validation loop and an inner cross-validation loop. The outer loop uses different train, validation, and test splits. The inner loop takes a train and validation set chosen by the outer loop, then the model with different hyperparameters is trained using the training set, and the best hyperparameters are chosen based on the performance of the trained models on the validation set. In the outer loop, generalization error is estimated by averaging test error over the test sets in the outer loop. Fig. 4 shows a 4-fold outer with 3-fold inner NCV.

      Data used for model evaluation

      Different approaches for model development and evaluation and the impact on algorithm performance were discussed earlier. Here, this article discusses the important attributes of the datasets used for developing reliable ML algorithms.

       Data: Size Matters

      In some disciplines, developing large datasets for building and evaluating ML models might not be practical. For example, in the medical domain, developing large-scale datasets is often not an option because of the rarity of the phenotype under study, limited financial resources, limited expertise required for data preparation or annotation, patients’ privacy concerns, or other ethical or legal concerns and barriers. For example, in the medical imaging domain, experienced physicians are required to manually annotate medical images reliably. Furthermore, because of the patients’ privacy concerns and specific legal and regulatory requirements in different jurisdictions, developing a large-scale multi-institutional dataset can be challenging. Therefore, most research is conducted using small datasets. Splitting such small datasets into train, validation, and test sets further shrinks the dataset used for model evaluation, which leads to unreliable estimates of performance measures. Consequently, the resulting models suffer from a lack of ability to generalize and lack of reproducibility.
      • Beam A.L.
      • Manrai A.K.
      • Ghassemi M.
      Challenges to the reproducibility of machine learning models in health care.
      When the dataset used for model building and evaluation is small, the LOOCV or LOGOCV approach is recommended for model evaluation.
      If possible, public datasets can be added to the local dataset; however, depending on the structure of a public dataset, there could be an inherent selection bias. Clinicians must be aware of this issue in order to evaluate its impact on the ability to generalize. One example is a mucosal head and neck cancer set consisting mostly of a subset of the disease; for example, human papilloma virus (HPV)–positive oropharyngeal head and neck squamous cell carcinomas (HNSCCs) treated with radiation and chemotherapy. Models trained on such a dataset (eg, for predicting treatment response and outcome) may not be generalizable to HNSCCs of the oral cavity, which are typically HPV negative and treated surgically, even though they are still pathologically mucosal HNSCC. The quality of labeling and annotations can also affect model performance. The variability between the public dataset annotations and the annotations in the training data may also lead to models with low ability to generalize.

      Cohen JP, Hashir M, Brooks R, et al. On the limits of cross-domain generalization in automated X-ray prediction. arXiv preprint arXiv:200202497. 2020.

      Therefore, data aggregation does not always lead to improving model performance and generalization.
      • Saha A.
      • Harowicz M.R.
      • Mazurowski M.A.
      Breast cancer MRI radiomics: an overview of algorithmic features and impact of inter-reader variability in annotating tumors.
      Crowdsourced annotations have also been used to address the challenge of annotating medical datasets.
      • Albarqouni S.
      • Baur C.
      • Achilles F.
      • et al.
      Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images.
      • McKenna M.T.
      • Wang S.
      • Nguyen T.B.
      • et al.
      Strategies for improved interpretation of computer-aided detections for CT colonography utilizing distributed human intelligence.
      • Nguyen T.B.
      • Wang S.
      • Anugu V.
      • et al.
      Distributed human intelligence for colonic polyp classification in computer-aided detection for CT colonography.
      Several publications have explored the differences between expert contours and crowdsourced nonexpert contours, which are generally considered as noisy annotations.
      • Albarqouni S.
      • Baur C.
      • Achilles F.
      • et al.
      Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images.
      • McKenna M.T.
      • Wang S.
      • Nguyen T.B.
      • et al.
      Strategies for improved interpretation of computer-aided detections for CT colonography utilizing distributed human intelligence.
      • Nguyen T.B.
      • Wang S.
      • Anugu V.
      • et al.
      Distributed human intelligence for colonic polyp classification in computer-aided detection for CT colonography.
      This research suggests that crowdsourced annotations can translate to improving model performance only with carefully crafted strategies.
      • Albarqouni S.
      • Baur C.
      • Achilles F.
      • et al.
      Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images.
      • McKenna M.T.
      • Wang S.
      • Nguyen T.B.
      • et al.
      Strategies for improved interpretation of computer-aided detections for CT colonography utilizing distributed human intelligence.
      • Nguyen T.B.
      • Wang S.
      • Anugu V.
      • et al.
      Distributed human intelligence for colonic polyp classification in computer-aided detection for CT colonography.
      Another approach commonly used in medical imaging is increasing the number of samples using patch-based approaches.
      • Greenspan H.
      • Van Ginneken B.
      • Summers R.M.
      Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique.
      In these approaches, two-dimensional or three-dimensional (3D) patches are extracted from medical images. These patches are then used for model training and evaluation. These approaches often extract several patches from a single image, which leads to increasing the number of available data points for developing and evaluating ML models. For example, instead of treating the GBM example in Fig. 1 as a single training image, it can be split into several small patches of GBM samples. In such scenarios when the dataset is small, using an LOGOCV approach for model evaluation is recommended to achieve a reliable evaluation of the model performance. Otherwise, the performance measures resulting from this approach might be unreliable and overoptimistic.
      Using data augmentation is another alternative for increasing the number of samples used for training and evaluation of ML models. Among examples of commonly used data augmentation techniques are geometric affine transformations such as translation, scaling, and rotation. Data augmentation has been widely used in building ML models for image data, and sophisticated software packages are available for this task.
      • Buslaev A.
      • Iglovikov V.I.
      • Khvedchenya E.
      • et al.
      Albumentations: fast and flexible image augmentations.
      ,

      Cubuk ED, Zoph B, Mane D, et al. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:180509501. 2018.

      However, most of these tools have been designed for regular RGB data (images with three color channels: red, green, and blue) and do not support 3D medical images. Therefore, for 3D images, simple augmentations such as flipping and rotation have been commonly used. Synthetic image generation using generative adversarial networks (GANs) has been also used for data augmentation.
      • Zhao H.
      • Li H.
      • Maurer-Stroh S.
      • et al.
      Synthesizing retinal and neuronal images with generative adversarial nets.
      • Salehinejad H.
      • Colak E.
      • Dowdell T.
      • et al.
      Synthesizing chest x-ray pathology for training deep convolutional neural networks.

      Han C, Kitamura Y, Kudo A, et al. Synthesizing diverse lung nodules wherever massively: 3D multi-conditional GAN-based CT image augmentation for object detection. Paper presented at: 2019 International Conference on 3D Vision (3DV). Québec (Canada), September 16-19, 2019.

      • Frid-Adar M.
      • Diamant I.
      • Klang E.
      • et al.
      GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification.

      Beers A, Brown J, Chang K, et al. High-resolution medical image synthesis using progressively grown generative adversarial networks. arXiv preprint arXiv:180503144. 2018.

      Although data augmentation is an important tool in developing ML models, proper and impactful application has to be evaluated on a case-by-case basis. When using data augmentation, clinicians must ensure that it is not simply being used to amplify or overrepresent information within the training data, which could result in overfitting.

       Data: Variety Matters

      Alongside the size of the datasets used for developing and evaluating ML models, the variety in the dataset is a crucial element to consider. Datasets are typically gathered under a variety of circumstances. For example, in medical imaging, scans from different institutions may substantially vary because of factors such as different scanner settings, disease prevalence at a specific institutions because of population demographics, and the use of different protocols. Even within an institution, there are frequently different scanner types, resulting in technical variations, among a list of other potential sources of technical variations and noise. With these factors affecting the data variability, the training, validation, and test sets should represent these variations to be able to create a generalizable model.
      Furthermore, because models are selected based on the performance of the validation set, it is crucial that the distribution of the validation set follows the distribution of the test set.
      • Goodfellow I.
      • Bengio Y.
      • Courville A.
      Deep learning.
      If the data distribution of the validation set is different from that of the test set (eg, validation and test sets are from different institutions or different scanners), the model performance based on the validation set may not translate to a clear picture of model performance for the test set. This situation often manifests as a decline in the model performance measures from the validation set to the test set.
      In some medical imaging research, data collected from selected institutions have been used for model building, and the performance of the models has been evaluated using data from a different institution.
      • Kann B.H.
      • Hicks D.F.
      • Payabvash S.
      • et al.
      Multi-institutional validation of deep learning for pretreatment identification of extranodal extension in head and neck squamous cell carcinoma.
      • Welch M.L.
      • McIntosh C.
      • Traverso A.
      • et al.
      External validation and transfer learning of convolutional neural networks for computed tomography dental artifact classification.
      • Datema F.R.
      • Ferrier M.B.
      • Vergouwe Y.
      • et al.
      Update and external validation of a head and neck cancer prognostic model.
      For such models, the performance measures on the test set may not achieve their maximum performance if there is a considerable difference between the distribution of data used for training and data used for evaluating the models.

      Storkey A. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning; 2009. p. 3-28.

      If factors that might substantially affect the data distribution can be controlled, using data from 2 institutions can lead to improving performance measures. In such scenarios, data from the first institution can be used for model development and data from the second institution can be used for model evaluation, which can lead to improving performance measures, because the test set does not need to be held out from the data from the first institution. Therefore, a larger dataset can be used for model building. In addition, this approach does not require sharing the original dataset between institutions, because the trained model in the first institution can be shared with the second institution to be evaluated.
      Another important factor that needs to be considered when using data from 1 institution for model development and data from another institution for data evaluation is data representation. Although this approach is considered by many as the gold standard approach for developing a model and evaluating its performance, clinicians need to be aware of the inherent potential pitfalls of this approach. This approach can only be successful if the unique characteristics of evaluation data (eg, from the second institution) are reflected or represented in the training data (eg, from the first institution). If this is not the case, the performance may not be optimal and the generalization error could be overestimated. To make a practical comparison, if an institution is deploying new image analysis software developed based on data from other institutions, the out-of-box algorithm will not perform optimally unless the major characteristics of the scans at the deploying institution are compatible with those used for generating the data used for developing the image analysis software. Using this logic, it also follows that, when deploying an algorithm in a new environment, it may be worthwhile to either evaluate for representation through analyses such as outlier analysis (discussed later) or first perform additional training and validation in the new environment for optimization before deploying the algorithm for use in the new environment.
      In addition, certain data samples may be poorly represented in a given dataset. It is important to identify these samples and determine whether they should be excluded or whether more of such samples are necessary in the dataset. Consider a dataset where a small subset of the images are degraded with severe artifacts. Artifacts are common in clinical practice and can be caused by noise, beam hardening from normal anatomic structures, or metal implants. To deploy a generalized model in practice, a model should be able to predict and properly process the image if significant artifact is present; therefore, it is essential that the model is exposed to artifacts in the training and evaluation phases. If images with severe artifact are not well represented, the model may lack the ability to process these cases correctly. To address this issue, a possible approach is the use of preprocessing techniques for artifact reduction as a first step before feeding images to the trained model.
      • Philipsen R.H.H.M.
      • Maduskar P.
      • Hogeweg L.
      • et al.
      Localized energy-based normalization of medical images: application to chest radiography.
      In this way, the trained model is treated as a specialized model that is only able to make a prediction or classification in the absence of severe artifact.
      To determine whether a sample is poorly represented, techniques that measure the similarity across images in a dataset can be used to identify outliers.
      • Zhang M.
      • Leung K.H.
      • Ma Z.
      • et al.
      A generalized approach to determine confident samples for deep neural networks on unseen data.

      Salimans T, Goodfellow I, Zaremba W, et al. Improved techniques for training gans. Paper presented at: Advances in neural information processing systems. Barcelona (Spain), December 5-10, 2016.

      Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Paper presented at: Advances in Neural Information Processing Systems. Long Beach (CA), December 4-9, 2017.

      These techniques use pretrained models to compute feature vectors for each sample in the dataset. Then, similarity scores between all samples are calculated to detect outliers (ie, samples that are different from the rest of the samples in the dataset). Such samples tend to have characteristics that are poorly represented in the dataset. These techniques are also useful to consider when using data from different institutions.

      Storkey A. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning; 2009. p. 3-28.

      ,

      Glocker B, Robinson R, Castro DC, et al. Machine learning with multi-site imaging data: An empirical study on the impact of scanner effects. arXiv preprint arXiv:191004597. 2019.

      By identifying outliers or underrepresented samples, samples can be removed from the validation sets to fine-tune performance to a more specific application, or samples similar to the outlier samples can be introduced to the training data to increase confidence in these samples. If a model is designed for working in a specific scenario (eg, only for data with no severe artifact), the limitations of the resulting model must be clearly communicated to avoid using the model in the wrong context. In a deployment setting, such approaches may even be used to flag an image set or scan that may not be well represented and consequently has a high likelihood of not being reliably evaluated by an algorithm. In such cases, the expert radiologist making the final interpretation would be made aware of the potential pitfall, taking this into account for the final interpretation.

      Implication of robust experimental design for regulatory approval and certification

      Although a comprehensive discussion of regulatory approval and certification is beyond the scope of this article, optimal algorithm development and evaluation is paramount to successful certification and deployment for patient care in the clinical setting. This article therefore concludes with a brief discussion of this topic. Learning from other industries, such as the pharmaceutical industry, by establishing well-conceived guidelines and ensuring rigorous experimental design from the outset, there is an opportunity to accelerate the future translation of ML algorithms for clinical deployment and use for patient care. The adoption of robust industry-grade platforms that are auditable is also likely to facilitate future certification and clinical deployment.
      As interest in ML continues to grow, the widespread deployment of ML models in clinical settings is highly anticipated. The performance of ML models in medical imaging has been shown to achieve performance superior to or comparable with human experts in very selective and controlled settings.
      • Miller D.D.
      • Brown E.W.
      Artificial intelligence in medical practice: the question to the answer?.
      In the future, ML models are expected to provide predictions for clinical outcomes of interest and assist clinicians in providing a diagnosis and treatment plan in a timely and accurate manner, enabling more precise and personalized patient therapy and management.
      • Jiang F.
      • Jiang Y.
      • Zhi H.
      • et al.
      Artificial intelligence in healthcare: past, present and future.
      Before deploying a model in a clinical setting, its performance needs to be thoroughly validated and the generalization error needs to be understood. Moreover, if a model is expected to be generalizable, it must have exposure to a variety of samples. Intuitively, data from different institutions may be considered as different distributions based on many factors, including different scanner settings, disease prevalence at a specific institutions caused by population demographics, and following different protocols. This outcome is achieved by carefully considering these factors and their implications, as discussed in this article, to incorporate them into the experimental design.
      To properly train generalizable models, large-scale multi-institutional datasets would also be beneficial. Having more data for ML models increases the confidence in predictions and allows robust validation and testing, providing better estimations on the generalization error.
      • Miller D.D.
      • Brown E.W.
      Artificial intelligence in medical practice: the question to the answer?.
      Acquiring large-scale datasets is susceptible to its own challenges. Current regulations and infrastructure limitations make data sharing between institutions a tedious and time-consuming process. Furthermore, medical image sets can be large in volume, ranging from several hundred megabytes to several gigabytes, which highlights the need for specialized infrastructures for developing large-scale multi-institutional datasets. Secure cloud platforms that facilitate distributed data access would be ideal for collaboration and building such large-scale multi-institutional datasets. The implementation of scalable and streamlined platforms for data preparation and curation will be a key factor in facilitating the development of reliable ML algorithms in the future.
      Before deployment of a ML model in a specific institution, the model needs to be validated within that institution to verify that the model can meet the performance requirements using the local data. Models can also be fine-tuned to the local data to achieve better localized performance, as discussed earlier. However, any changes of a deployed algorithm or its performance need to be excessively examined. Alongside such local validation, ML models generally need to be validated over time as well. For examples, as the prevalence of diseases changes, the deployed models might need to adapt as well. Although ML algorithms can learn from exposure or “experience,” the implementation of an actively changing or mutating algorithm for patient care in the clinical setting would be a very complex process that would require robust feedback loops and quality monitoring, ensuring reliable performance and stability, and is unlikely to be the model for implementation in the foreseeable future. Instead, the current model is to develop and evaluate an algorithm using large and varied datasets. The algorithm that is deployed will not be actively changing based on its use after deployment. As such, alternative mechanisms, such as quality monitoring and periodic updates by the vendors based on additional training and evaluation, could be a potential model for optimizing performance in the clinical setting.
      Because of the reasons mentioned earlier, algorithms are expected to adapt, and will be required to be accessible, transparent, and auditable. Transparent platforms that allow for continuous evaluation on the performance of an ML model are required to approve the implementation of new ML models or software updates in clinical settings.
      • He J.
      • Baxter S.L.
      • Xu J.
      • et al.
      The practical implementation of artificial intelligence technologies in medicine.
      These platforms should be transparent and auditable such that clinicians can investigate any underlying biases in the datasets or models.
      • He J.
      • Baxter S.L.
      • Xu J.
      • et al.
      The practical implementation of artificial intelligence technologies in medicine.
      Explainability, to the extent feasible, will also facilitate deployment and adoption. Software pilot programs such as the Precertification Program outlined in the US Food and Drug Administration's Digital Health Innovation Action Plan (https://www.fda.gov/medical-devices/digital-health/digital-health-software-precertification-pre-cert-program) are models that will help the future development of a regulatory framework for streamlined and efficient regulatory oversight of applications developed by manufacturers with a demonstrated culture of quality and organizational excellence. This framework could represent a mechanism through which would-be trusted vendors could deploy artificial intelligence (AI)–based software in an efficient and streamlined manner, including deployment of software iterations and changes, under appropriate controls and oversight. In addition, current regulatory frameworks consider AI algorithms as a software as medical device, which are expected to be locked and not evolving.
      • Shah P.
      • Kendall F.
      • Khozin S.
      • et al.
      Artificial intelligence and machine learning in clinical development: a translational perspective.
      As experience and comfort with ML applications increases, new regulatory frameworks will have to be developed to allow model adaptations that enable optimal performance while ensuring reliability and patient safety.

      Summary

      With the surge in popularity of ML and deep learning solutions and increasing investments in such approaches, ML solutions that fail to generalize, when applied to external data, may gain public attention that could hinder the slow but steady adaptation of ML in the health care domain. Following the best practices for the development and evaluation of ML models is a necessity for developing generalizable solutions that can be deployed in clinical settings. This requirement is even more important for deep learning models, which have high capacities and can easily overfit to the available data if a proper methodology for model evaluation is not followed.
      Evaluating ML models in the health care domain is often a challenging task because of the difficulty of developing large-scale datasets resulting from the lack of required resources or ethical issues. Because of the small datasets used for development and evaluation of ML models, applications that do not follow a rigorous and sound evaluation procedure are prone to overfitting to the available data. Lack of familiarity with the best practices for model evaluation lead to a lack of generalization of published research. Also, the unavailability of code and data makes evaluating and reproducing such models difficult.
      A rigorous experimental design and the use of transparent platforms for building and sharing multi-institutional datasets and following best practices for model evaluation will be crucial steps in developing generalizable solutions in the health care domain. Such platforms could also serve as a medium for reproducible research, which would then increase the likelihood of successful deployment of ML models in the health care domain, with the potential to streamline health care processes, increase efficiency and quality, and improve patient care through precision medicine.

      References

        • Beam A.L.
        • Manrai A.K.
        • Ghassemi M.
        Challenges to the reproducibility of machine learning models in health care.
        JAMA. 2020; 323: 305-306
      1. McDermott MB, Wang S, Marinsek N, et al. Reproducibility in machine learning for health. Paper presented at: 2019 Reproducibility in Machine Learning, RML@ ICLR 2019 Workshop. New Orleans, May 6, 2019.

        • Forghani R.
        • Savadjiev P.
        • Chatterjee A.
        • et al.
        Radiomics and artificial intelligence for biomarker and prediction model development in oncology.
        Comput Struct Biotechnol J. 2019; 17: 995
        • Domingos P.M.
        A few useful things to know about machine learning.
        Commun ACM. 2012; 55: 78-87
        • Friedman J.
        • Hastie T.
        • Tibshirani R.
        The elements of statistical learning. vol. 1. Springer Series in Statistics, New York2001
      2. Bertels J, Eelbode T, Berman M, et al. Optimizing the Dice score and Jaccard index for medical image segmentation: Theory and practice. Paper presented at: International Conference on Medical Image Computing and Computer-Assisted Intervention. Shenzhen (China), October 13-17, 2019.

        • Tharwat A.
        Classification assessment methods.
        New England Journal of Entrepreneurship. 2020; (https://10.1016/j.aci.2018.08.003)
        • Goodfellow I.
        • Bengio Y.
        • Courville A.
        Deep learning.
        MIT Press, Cambridge (MA)2016
      3. Quinlan JR. Bagging, boosting, and C4. 5. Paper presented at: AAAI/IAAI, Vol. 1. Portland (Oregon), August 4–8, 1996.

        • Obermeyer Z.
        • Emanuel E.J.
        Predicting the future—big data, machine learning, and clinical medicine.
        N Engl J Med. 2016; 375: 1216
        • Erickson B.J.
        • Korfiatis P.
        • Akkus Z.
        • et al.
        Machine learning for medical imaging.
        Radiographics. 2017; 37: 505-515
        • Steyerberg E.W.
        • Bleeker S.E.
        • Moll H.A.
        • et al.
        Internal and external validation of predictive models: a simulation study of bias and precision in small samples.
        J Clin Epidemiol. 2003; 56: 441-447
        • Steyerberg E.W.
        • Harrell F.E.
        Prediction models need appropriate internal, internal–external, and external validation.
        J Clin Epidemiol. 2016; 69: 245-247
        • Kann B.H.
        • Hicks D.F.
        • Payabvash S.
        • et al.
        Multi-institutional validation of deep learning for pretreatment identification of extranodal extension in head and neck squamous cell carcinoma.
        J Clin Oncol. 2020; 38: 1304-1311
        • Welch M.L.
        • McIntosh C.
        • Traverso A.
        • et al.
        External validation and transfer learning of convolutional neural networks for computed tomography dental artifact classification.
        Phys Med Biol. 2020; 65: 035017
        • Datema F.R.
        • Ferrier M.B.
        • Vergouwe Y.
        • et al.
        Update and external validation of a head and neck cancer prognostic model.
        Head Neck. 2013; 35: 1232-1237
        • König I.R.
        • Malley J.
        • Weimar C.
        • et al.
        Practical experiences on the necessity of external validation.
        Stat Med. 2007; 26: 5499-5511
        • Kocak B.
        • Yardimci A.H.
        • Bektas C.T.
        • et al.
        Textural differences between renal cell carcinoma subtypes: machine learning-based quantitative computed tomography texture analysis with independent external validation.
        Eur J Radiol. 2018; 107: 149-157
        • Guyon I.
        A scaling law for the validation-set training-set size ratio.
        AT&T Bell Laboratories, Berkeley (CA)1997: 1-11
        • Forghani R.
        • Chatterjee A.
        • Reinhold C.
        • et al.
        Head and neck squamous cell carcinoma: prediction of cervical lymph node metastasis by dual-energy CT texture analysis with machine learning.
        Eur Radiol. 2019; 29: 6172-6181
        • Guyon I.
        • Makhoul J.
        • Schwartz R.
        • et al.
        What size test set gives good error rate estimates?.
        IEEE Trans Pattern Anal Mach Intell. 1998; 20: 52-64
        • Hutter F.
        • Kotthoff L.
        • Vanschoren J.
        Automated machine learning: methods, systems, challenges.
        Springer Nature, Berkeley (CA)2019
        • Russell S.
        • Norvig P.
        Artificial intelligence: a modern approach.
        3rd edition. Pearson, Upper Saddle River (NJ)2009
        • Wong T.-T.
        Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation.
        Pattern Recognition. 2015; 48: 2839-2846
        • Airola A.
        • Pahikkala T.
        • Waegeman W.
        • et al.
        An experimental comparison of cross-validation techniques for estimating the area under the ROC curve.
        Comput Stat Data Anal. 2011; 55: 1828-1844
      4. Cohen JP, Hashir M, Brooks R, et al. On the limits of cross-domain generalization in automated X-ray prediction. arXiv preprint arXiv:200202497. 2020.

        • Saha A.
        • Harowicz M.R.
        • Mazurowski M.A.
        Breast cancer MRI radiomics: an overview of algorithmic features and impact of inter-reader variability in annotating tumors.
        Med Phys. 2018; 45: 3076-3085
        • Albarqouni S.
        • Baur C.
        • Achilles F.
        • et al.
        Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images.
        IEEE Trans Med Imaging. 2016; 35: 1313-1321
        • McKenna M.T.
        • Wang S.
        • Nguyen T.B.
        • et al.
        Strategies for improved interpretation of computer-aided detections for CT colonography utilizing distributed human intelligence.
        Med Image Anal. 2012; 16: 1280-1292
        • Nguyen T.B.
        • Wang S.
        • Anugu V.
        • et al.
        Distributed human intelligence for colonic polyp classification in computer-aided detection for CT colonography.
        Radiology. 2012; 262: 824-833
        • Greenspan H.
        • Van Ginneken B.
        • Summers R.M.
        Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique.
        IEEE Trans Med Imaging. 2016; 35: 1153-1159
        • Buslaev A.
        • Iglovikov V.I.
        • Khvedchenya E.
        • et al.
        Albumentations: fast and flexible image augmentations.
        Information. 2020; 11: 125
      5. Cubuk ED, Zoph B, Mane D, et al. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:180509501. 2018.

        • Zhao H.
        • Li H.
        • Maurer-Stroh S.
        • et al.
        Synthesizing retinal and neuronal images with generative adversarial nets.
        Med Image Anal. 2018; 49: 14-26
        • Salehinejad H.
        • Colak E.
        • Dowdell T.
        • et al.
        Synthesizing chest x-ray pathology for training deep convolutional neural networks.
        IEEE Trans Med Imaging. 2018; 38: 1197-1206
      6. Han C, Kitamura Y, Kudo A, et al. Synthesizing diverse lung nodules wherever massively: 3D multi-conditional GAN-based CT image augmentation for object detection. Paper presented at: 2019 International Conference on 3D Vision (3DV). Québec (Canada), September 16-19, 2019.

        • Frid-Adar M.
        • Diamant I.
        • Klang E.
        • et al.
        GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification.
        Neurocomputing. 2018; 321: 321-331
      7. Beers A, Brown J, Chang K, et al. High-resolution medical image synthesis using progressively grown generative adversarial networks. arXiv preprint arXiv:180503144. 2018.

      8. Storkey A. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning; 2009. p. 3-28.

        • Philipsen R.H.H.M.
        • Maduskar P.
        • Hogeweg L.
        • et al.
        Localized energy-based normalization of medical images: application to chest radiography.
        IEEE Trans Med Imaging. 2015; 34: 1965-1975
        • Zhang M.
        • Leung K.H.
        • Ma Z.
        • et al.
        A generalized approach to determine confident samples for deep neural networks on unseen data.
        in: Greenspan H. Tanno R. Erdt M. Uncertainty for safe utilization of machine learning in medical imaging and clinical image-based procedures. Springer, 2019: 65-74
      9. Salimans T, Goodfellow I, Zaremba W, et al. Improved techniques for training gans. Paper presented at: Advances in neural information processing systems. Barcelona (Spain), December 5-10, 2016.

      10. Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Paper presented at: Advances in Neural Information Processing Systems. Long Beach (CA), December 4-9, 2017.

      11. Glocker B, Robinson R, Castro DC, et al. Machine learning with multi-site imaging data: An empirical study on the impact of scanner effects. arXiv preprint arXiv:191004597. 2019.

        • Miller D.D.
        • Brown E.W.
        Artificial intelligence in medical practice: the question to the answer?.
        Am J Med. 2018; 131: 129-133
        • Jiang F.
        • Jiang Y.
        • Zhi H.
        • et al.
        Artificial intelligence in healthcare: past, present and future.
        Stroke Vasc Neurol. 2017; 2: 230-243
        • He J.
        • Baxter S.L.
        • Xu J.
        • et al.
        The practical implementation of artificial intelligence technologies in medicine.
        Nat Med. 2019; 25: 30
        • Shah P.
        • Kendall F.
        • Khozin S.
        • et al.
        Artificial intelligence and machine learning in clinical development: a translational perspective.
        NPJ Digit Med. 2019; 2: 1-5