Machine Learning Techniques for the Diagnosis of Attention-Deficit/Hyperactivity Disorder from Magnetic Resonance Imaging: A Concise Review
Correspondence Address: Source of Support: None, Conflict of Interest: None DOI: 10.4103/0028-3886.333520
Source of Support: None, Conflict of Interest: None
Keywords: Attention-deficit/hyperactivity disorder, functional brain connectivity networks, machine learning, magnetic resonance imaging, morphometry
Attention-deficit/hyperactivity disorder (ADHD) is a neuro-developmental abnormality, one of the prevalent psychological abnormalities seen in children. As per the epidemiological statistics of Centres for Disease Control and Prevention, US Government, approximately 9.4% of children within the age of 2–17 years have exhibited the symptoms of ADHD. Epidemiological statistics of ADHD among primary school children in India are more severe and a very high prevalence of 11.32%.
ADHD is characterized by deficits in attention, impulsivity both motor and non-motor; moreover, it is highly associated with morbidity and disability. The diagnosis of ADHD remains essentially clinical, based on history and exam. It can be supported by neuropsychological assessments; however, due to various cognitive profiles in patients with ADHD, these provide a assistive, but not fully curable function. ADHD precise etiology and pathogenesis remain still unclear. ADHD can be either hereditary or acquired. People with ADHD usually exhibit hyperactive and impulsive behavior. Children with ADHD may exhibit difficulty in paying and maintaining mental attention. ADHD adversely affects studies, discipline, and social interaction. Hence, early intervention is necessary for ADHD to ensure the well-being and future of its victims. According to the American Psychiatric Association, based on the number, type, and severity of symptoms, ADHD can be categorized into three grades. They are primarily hyperactive (ADHD-H), primarily inattentive (ADHD-I) and combined subtype (ADHD-C). Partial maturation or development of front-basal regions of both frontal lobes and superior and middle temporal gyrus is an indication of ADHD. At the brain circuitry level, functionalities of Cortico-limbic areas and Superior Longitudinal Fasciculus (SLF) get impaired in ADHD. No individual test or highly specific method is available for the diagnosis of ADHD. Moreover, through extensive interview procedures, behavioral studies, third-party observations, and comprehensive personal history. Usual diagnosis of ADHD from the behavioral information based on the guidelines in the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), offered by the American Psychiatric Association. For children, six or more numbers of the symptoms out of symptoms of impulsivity, inattention, and hyperactivity as per the DSM-5 diagnostic criteria for ADHD, should have prevailed for not less than six months to a level that is incompatible with developmental stage, and that adversely impacts occupational/academic and social affairs. Diagnosis of ADHD via subjective assessments of behavioral data may produce inconsistent results. It is a well-known fact that the incidence of ADHD causes regional atrophy in brain regions and alters the pattern of functional brain connectivity networks.,,,,,,,,,,,, Hence, automated/computerized methods based on Magnetic Resonance Imaging (MRI) can replace subjective methods for the diagnosis of ADHD from behavioral data. Sridhar et al. had broadly reviewed objective methods for the diagnosis of ADHD, recently. However, computerized methods for the diagnosis of ADHD from Magnetic Resonance Imaging (MRI) are not yet studied comprehensively. Moreover, the current computerized methods for the diagnosis are not only time-consuming but they are also susceptible to human decision-making bias. Thus, integrating advance computerized methods that are time efficient will assist human decision-making and enhance the success rate of diagnosis.
This paper is a review of machine learning methods for the diagnosis of ADHD from MRI. The time frame of the review is the last 5 years (2013–2019). Inferences sought out of this review are helpful for the design and development of machine learning frameworks for the automated diagnosis of ADHD. These inferences can act as roadmaps for the selection of feature extraction, feature selection, and classification techniques employed in such frameworks. Machine learning frameworks on MRI used for the diagnosis of ADHD are discussed in section 2 of this paper. Techniques for feature extraction, dimensionality reduction, and feature selection and classification, employed in the computerized methods for the diagnosis of ADHD from MRI, are focussed in this section. Computerized methods for the diagnosis of ADHD are compared in terms of accuracy of diagnosis in section 3. Further, the pros and cons of feature extraction, feature selection, and classification techniques employed in the frameworks are carefully analyzed.
Most of the computerized methods for the objective diagnosis of ADHD from MRI make use of machine learning techniques. Ghiassian et al. used the Support Vector Machine (SVM) classifier with radial basis function (RBF) kernel to distinguish ADHD from controls. Histogram of Oriented Gradient (HOG) features of structural MRI, and behavioral data were the feature inputs to SVM. HOG is a textural descriptor used for the detection of objects of interest from images. The number of occurrences of gradient orientations at local regions of the image is counted in HOG. HOG is similar to descriptors like shape contexts, edge orientation histograms, and scale-invariant feature transform and estimated from a dense grid of equally sized blocks and utilizes overlapping local contrast normalization to accomplish enhanced accuracy. SVM is a discriminative classifier formally defined by a separating hyperplane. The hyperplane refers to a line in two-dimensional feature space, which divides the plane into two parts with the samples from each class lying at either side. SVM algorithm outputs an optimal hyperplane computed from the labeled training data via supervised learning, which can categorize samples from the test data.
Structural change, especially regional atrophy in brain regions, is one of the indications of ADHD. Brain morphometry refers to the quantitative measurement of brain structures and volumetric changes resulting from diseases, aging, development. Tensor-based morphometry (TBM), voxel-based morphometry (VBM), and deformation-based morphometry (DBM) are three popular techniques used to measure the structural changes in the brain MRI. Gehricke et al. used concentrations of white matter (WM) and grey matter (GM), mean diffusivity (MD), fractional anisotropy (FA), radial diffusivity (RD) and outcome of TBM as feature inputs to the linear discriminant analysis (LDA) classifier, after dimensionality reduction with principal component analysis (PCA). LDA is a generalization of Fisher's linear discriminant. It is used in statistics, to identify the linear combination of features that can segregate a set of samples into two or more classes. The resulting linear combination of features can be employed for dimensionality reduction or as a linear classifier. PCA is an analytical technique which employs an orthogonal transformation to map a more extensive set of observations of probably correlated variables into a smaller set of values of linearly uncorrelated variables termed as principal components.
Like the morphometric features, features of the functional brain connectivity networks are also extensively used in the computerized analysis of functional MRI (fMRI). In every cognitive activity, a network or group of different brain regions are involved rather than a single isolated region. The brain regions simultaneously involved in a particular cognitive activity is said to be 'functionally connected.' The term 'large scale brain networks' refers to a group of brain regions that seem to be functionally connected or inter-linked during cognitive activities. Functionally connected or inter-linked group of brain regions are usually mapped via statistical analysis of the fMRI blood-oxygen-level Dependent (BOLD) signal, Magneto-Encephalogram (MEG), Electro Encephalogram (EEG). The functional connectivity of brain regions may be mapped from synchronization of EEG, MEG, BOLD. Statistical tools like spatial independent component analysis are helpful for the identification of synchronized brain regions from such signals––the pattern of large-scale networks changes with cognitive function. For example, cerebral networks identified with the help of fMRI in healthy controls are shown in [Figure 1] and [Figure 2]. Deshpande et al. used a Fully Connected Cascade (FCC) Artificial Neural Network (ANN) based on the features of functional connectivity networks, to categorize healthy subjects and ADHD. It is to be noted that both these figures are referred from which were taken from CHU Sart Tilman Hospital, Liège, Belgium.
Zou et al. used Voxel-Mirrored Homotopic Connectivity (VMHC), fractional Amplitude of Low-Frequency Fluctuations (fALFF) and Regional Homogeneity (ReHo) features extracted from fMRI and VBM features like GM, WM and Cerebro-Spinal Fluid (CSF) probabilities of each voxel, extracted from structural MRI as input to Convolution Neural Network (3D CNN) classifier, to differentiate ADHD from controls. CNN is a class of deep, feed-forward neural network architecture.
Peng et al. used morphometric features of the Cerebral cortex extracted from structural MRI, such as cortical thickness, as feature inputs to an Extreme Learning Machine (ELM) classifier. ELM is a neural network with feed-forward architecture with one or more layers of hidden nodes. One specialty of ELM is that parameters of hidden nodes are usually assigned randomly, and the value of these parameters are not updated during learning. Generally, the output weights of hidden nodes are updated in a single step. This is equivalent to 'learning a linear model'.
Due et al. used gSpan algorithm to identify discriminative functional brain networks and applied graph kernel PCA to acquire nonlinear features from these discriminative sub-networks. These nonlinear features were fed as input to the SVM classifier. The method proposed by Riaz et al. is also similar to that proposed by Due et al. In the primary stage of this framework, functional connectivity networks were mapped on the fMRI image with the help of Affinity Propagation (AP) clustering. Elastic Net (EN) was employed to opt-out the most discriminant features of the functional brain networks, and these features were integrated with behavioral data. This composite feature set was used as input to SVM. Affinity Propagation (AP) used in this framework is a clustering algorithm that works according to the principle of 'message passing' among data points. Different from clustering techniques such as k-medoids and k-means, initialization of the number of clusters is not necessary for the AP algorithm. Like k-medoids, affinity propagation identifies 'exemplars', members of the input set, which are representatives of clusters. Elastic Net is a combination of Least Absolute Shrinkage and Selection Operator (LASSO) and ridge regressions. It is a regularized regression, which is a linear combination of L1 and L2 penalties of LASSO and Ridge regressions.
Sen et al. used textural and functional connectivity features extracted from 3D structural MRI and 4D resting-state fMRI, as feature input to an SVM classifier. Features of anatomical atlas and cortical parcellations extracted from the functional connectivity maps over the cerebral cortex, generated from resting-state fMRI and structural MRI, respectively after screening with hierarchical sparse feature elimination were used as inputs to ELM classifier, in Qureshi et al. Hierarchical sparse feature elimination is a combination of Recursive Feature Elimination (RFE) and LASSO. LASSO is a method for regression analysis which performs both regularization and variable selection in such a way that interpretability of the resultant model and prediction accuracy as target functions are maximized.
Tan et al. used regional volumes computed from fMRI, termed as 'functional volumes' and behavioral data as feature inputs to SVM classifier, after dimensionality reduction with RFE algorithm, for discriminating ADHD from controls. It was suggested based on the observations made on the outcome of RFE that, regional functional volumes from occipital, cerebellum posterior, parietal, frontal, and temporal lobes are discriminative. RFE is a feature selection approach that recursively removes statistically insignificant features and builds a model with residue features, using the accuracy of the model as a target function.
Almost all frameworks meant for the computer-aided diagnosis of ADHD,,,,,,,,, contain intermediate steps like feature extraction, selection of determinant (statistically significant) features, or dimensionality reduction of feature vectors (elimination of redundant features) and classification. Intermediate step for feature selection or dimensionality reduction is absent in a few of the frameworks.,,,, Instead, the features are directly used as input to the classifier, without bothering about their statistical significance. But blind use of features without screening statistically insignificant or redundant features increase the computational burden of the framework. Segregation of ADHD and controls is a binary classification problem. In a binary classification problem, the extent to which features values of two classes differs from each other can be objectively assessed with the help of well-established statistical tests like the Kolmogorov-Smirnov test. However, the separability of the features was not studied in any of the frameworks available in the literature.
A concise summary of imaging modalities used and techniques for feature extraction, feature selection/dimensionality reduction and classification employed in the frameworks meant for the computerized diagnosis of ADHD is furnished in [Table 1]. From [Table 1], it can be noted that ELM with features of the functional connectivity networks of the brain as input, selected with the help of Hierarchical sparse feature elimination exhibits the highest accuracy (92.85%). Accuracy exhibited by SVM with features of the functional connectivity networks after dimensionality reduction with PCA, as input is also in par with this (92%). ELM performs pretty well for morphometric feature inputs as well. ANN with a Fully Connected Cascade architecture with features of the functional connectivity networks of the brain as input also performs appreciably. The augmentation of the behavioral features does not contribute to the improvement prediction. The level of accuracy offered by the frameworks meant for the computerized diagnosis of ADHD, available in the literature, does not justify their application prospects and feasibility in clinical practice.
The textural feature descriptor, HOG, is estimated from a compact grid of equally sized blocks and utilizes overlapping local contrast normalization to accomplish enhanced accuracy. This has the advantage that in larger scales, the HOG features provide more global information, while in smaller scales, and they provide more fine-grained detail. The disadvantage is that as the final descriptor vector grows larger, time is taken to extract features, and to train the classifier also will increase. HOG is sensitive to skew and rotation. Significant advantages of the most popular classifier, SVM,,,, are, it is free from the over-fitting problem and does not get stuck with local minima. The most significant disadvantage of SVM is that the selection of the appropriate value of hyperparameters that will allow adequate generalization performance is nontrivial. Similarly, choosing the appropriate kernel function is also tricky. SVM takes a long time for training on large data sets.
For a low sample size and a higher number of features, LDA is one of the highly preferred classifiers. It is expected to work well if the conditional class densities of clusters are approximately standard. LDA does not work well if the number of samples in various classes is unbalanced. LDA is sensitive to over-fit, and validation of LDA models is nontrivial. LDA is inferior to non-linear problems. The dimensionality reduction technique PCA, is an orthogonal transformation that is optimal in maintaining the sub-space with the largest variance. PCA is an unsupervised method because no prior information about the classes is required. One of the demerits of PCA is that the principal components are usually linear combinations of all input features, regardless of statistically significant and insignificant features. PCA makes dimensionality reduction of feature vectors but does not exclude statistically insignificant features.
The most popular kernel used in deep learning architectures, CNN is shared-weights multilayer perceptron architecture with translation invariance characteristics, particularly meant for reducing the need for pre-processing. Hyper-parameter tuning in CNN is non-trivial and needs a large dataset for training. The scale of the net's weights and weight updates critically influence the performance of CNN when the features are heterogeneous. In such contexts, input needs to be standardized as weights and updates will be on different scales.
ELM, has the advantages of fast learning speed and excellent generalization performance. Since the first layer is fixed in ELM, the number of parameters that need to be trained or supervised is comparatively less. Since ELM constitutes a linear classifier, its parameters can be regularized easily. Random initialization of weights in the first layer may not be useful if the amount of labeled samples is large, and the function which needs to be learned is complex. Random initialization causes uncertainty between high confidence of ELM estimator and small generalization or approximation errors, both in leaning and approximation. Consequently, it is difficult to understand whether a single trail of ELM is effective or not. Inappropriate selection of activation function in ELM degrades its generalization performance.
Affinity Propagation (AP) clustering used in to identify functional connectivity among brain regions has comparatively good accuracy and efficiency. The initialization of the number of clusters is not necessary for AP. The time complexity of Lloyd's k-means algorithm is proportional to the number of samples. Whereas, the time complexity of AP proportional to the square of the number of samples. Hence, AP is not suitable for large-scale data clustering. Like LASSO regularization, Elastic Net also results in sparse solutions. EN performs comparatively well when the features are highly correlated. The flexibility of the estimator is one advantage of EN, worth mentioning. However, increased flexibility results in the risk of over fitting.
Compared to the conventional subset selection techniques, LASSO has two major merits. The procedure adopted in LASSO for feature selection is based on continuous trajectories of regression coefficients as functions of the penalty level. Consequently, LASSO offers more stability than subset selection. It exhibits appreciable computational feasibility on multi-dimensional feature sets, as well. However, Ridge regression has comparatively better prediction power than LASSO. In the case of two highly correlated features, LASSO selects one of them arbitrarily. Ridge is not directly applicable for feature selection, and the model interpretability offered by the ridge is low. Elastic Net, which is a combination of both LASSO and Ridge regressions, is more advantageous for feature selection than individual LASSO or Ridge algorithms. Another feature selection method, RFE algorithm, also has serious limitations. There are two parameters in the RFE which need to be specified by the user, the number of features to be retained in the end model and the percentage of features to be eliminated during each iteration. Improper selection of these parameters may adversely affect the reliability of features selected by RFE.
Deep learning-based techniques meant for the diagnosis of ADHD from MRI are computationally heavy and need considerably large data set for training, which is often not available. Brain iron concentration, especially in areas like Putamen, Globus Pallidus, thalamus, and Caudate nucleus, is a robust non-invasive biomarker of ADHD. It can be easily computed from the Magnetic Field Correlation (MFC) and Magnetic Resonance (MR) relaxation rates. However, such highly specific biomarkers have not yet been incorporated into the computerized methods for the diagnosis of ADHD.
Machine Learning frameworks used for the diagnosis of ADHD from Magnetic Resonance Images were reviewed in this paper. Almost all machine learning frameworks meant for the computerized diagnosis of ADHD comprises intermediate steps like feature extraction, selection of determinant features, or dimensionality reduction of feature vectors and classification. The classifies used in these frameworks are Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Fully Connected Cascade (FCC) Artificial Neural Network (ANN), Convolution Neural Networks (CNN), Extreme Learning Machine (ELM), etc., Feature selection and dimensionality reduction techniques involve Elastic Nets (EN), Hierarchical sparse feature elimination, Recursive Feature Elimination (RFE), Principal Component Analysis (PCA). Input features were mainly behavioral data, textural features of structural MRI (sMRI), morphometric features computed from sMRI, and features of functional brain connectivity features extracted from functional MRI (fMRI).
Machine learning frameworks for the diagnosis of ADHD available in the literature were compared in terms of accuracy of diagnosis. ELM with features of the functional brain connectivity networks as input, selected with the help of hierarchical sparse feature elimination, exhibited the highest accuracy (92.85%). Accuracy exhibited by SVM with features of the functional brain connectivity networks after dimensionality reduction with PCA, as input was also in par with this (92%). Augmentation of the behavioral features did not contribute much to improve in accuracy. The level of accuracy offered by the frameworks meant for the computer-aided diagnosis of ADHD, available in the literature, does not justify their feasibility in clinical practice. Computerized methods using highly specific biomarkers of ADHD like brain iron concentration in Globus Pallidus, Putamen, Caudate nucleus, and thalamus as features were not available.
Financial support and sponsorship
Department of Instrumentation and Control, National Institute of Technology, Tiruchirappalli.
Conflicts of interest
There are no conflicts of interest.
[Figure 1], [Figure 2]