7
Fig. 8. FROC curves averaged on three-fold CV for the abdominal (left) and mediastinal (right) lymph nodes using different CNN models.
Region Mediastinum Abdomen
Method AUC TPR/3FP AUC TPR/3FP
[41] - 0.63 - 0.70
[22] 0.92 0.70 0.94 0.83
[36] - 0.78 - 0.78
CifarNet 0.91 0.70 0.81 0.44
AlexNet-ImNet 0.89 0.63 0.80 0.41
AlexNet-RI-H 0.94 0.79 0.92 0.67
AlexNet-TL-H 0.94 0.81 0.92 0.69
GoogLeNet-RI-H 0.85 0.61 0.80 0.48
GoogLeNet-TL-H 0.94 0.81 0.92 0.70
AlexNet-RI-L 0.94 0.77 0.88 0.61
GoogLeNet-RI-L 0.95 0.85 0.91 0.69
Combined 0.95 0.85 0.93 0.70
TABLE II
COMPARISON OF MEDIASTINAL AND ABDOMINAL LN DETECTION
RESULTS USING VARIOUS CNN MODELS. NUMBERS IN BOLD INDICATE
THE BEST PERFORMANCE VALUES ON CLASSIFICATION ACCURACY.
accommodate AlexNet-RI-L, AlexNet-TL-H, GoogLeNet-RI-
H and GoogLeNet-TL-H. For the modified AlexNet-RI-L at
(64 × 64) pixel resolution, we reduce the number of first layer
convolution filters from 96 to 64 and reduce the stride from 4
to 2. For the modified GoogLeNet-RI (64 × 64), we decrease
the number of first layer convolution filters from 64 to 32,
the pad size from 3 to 2, the kernel size from 7 to 5, stride
from 2 to 1 and the stride of the subsequent pooling layer
from 2 to 1. We slightly reduce the number of convolutional
filters in order to accommodate the smaller input image sizes
of target medical image datasets [22], [37], while preventing
over-fitting. This eventually improves performance on patch-
based classification. CifarNet is used in [22] to detect LN
samples of 32 × 32 × 3 images. For consistency purposes,
we down-sample 64 × 64 × 3 resolution LN sample images to
the dimension of 32 × 32 × 3.
Results for lymph node detection in the mediastinum and
abdomen are reported in Table II. FROC curves are illustrated
in Figure 8. The area-under-the-FROC-curve (AUC) and true
positive rate (TPR, recall or sensitivity) at three false positives
per patient (TPR/3FP) are used as performance metrics. Of
the nine investigated CNN models, CifarNet, AlexNet-ImNet
and GoogLeNet-RI-H generally yielded the least competitive
detection accuracy results. Our LN datasets are significantly
more complex (i.e., display much larger within-class appear-
ance variations), especially due to the extracted fields-of-view
(FOVs) of (35mm-128mm) compared to (30mm-45mm) in
[22], where CifarNet is also employed. In this experiment,
CifarNet is under-trained with respect to our enhanced LN
datasets, due to its limited input resolution and parameter com-
plexity. The inferior performance of AlexNet-ImNet implies
that using the pre-trained ImageNet CNNs alone as “off-the-
shelf” deep image feature extractors may not be optimal or
adequate for mediastinal and abdominal LN detection tasks.
To complement “off-the-shelf” CNN features, [10], [9], [12]
all add and integrate various other hand-crafted image features
as hybrid inputs for the final CADe classification.
GoogLeNet-RI-H performs poorly, as it is susceptible
to over-fitting. No sufficient data samples are available to
train GoogLeNet-RI-H with random initialization. Indeed,
due to GoogLeNet-RI-H’s complexity and 22-layer depth,
million-image datasets may be required to properly train
this model. However, GoogLeNet-TL-H significantly improves
upon GoogLeNet-RI-H (0.81 versus 0.61 TPR/3FP in me-
diastinum; 0.70 versus 0.48 TPR/3FP in abdomen). This
indicates that transfer learning offers a much better initial-
ization of CNN parameters than random initialization. Like-
wise, AlexNet-TL-H consistently outperforms AlexNet-RI-H,
though by smaller margins (0.81 versus 0.79 TPR/3FP in
mediastinum; 0.69 versus 0.67 TPR/3FP in abdomen). This
is also consistent with the findings reported for ILD detection
in Table III and Figure 11.
GoogLeNet-TL-H yields results similar to AlexNet-TL-H’s
for the mediastinal LN detection, and slightly outperforms
8
Alex-Net-H for abdominal LN detection. AlexNet-RI-H ex-
hibits less severe over-fitting than GoogLeNet-RI-H. We also
evaluate a simple ensemble by averaging the probability scores
from five CNNs: AlexNet-RI-H, AlexNet-TL-H, AlexNet-RI-
H, GoogLeNet-TL-H and GoogLeNet-RI-L. This combined
ensemble outputs the classification accuracies matching or
slightly exceeding the best performing individual CNN models
on the mediastinal or abdominal LN detection tasks, respec-
tively.
Many of our CNN models achieve notably better (FROC-
AUC and TPR/3FP) results than the previous state-of-the-art
models [36] for mediastinal LN detection: GoogLeNet-RI-L
obtains an AUC=0.95 and 0.85 TPR/3FP, versus AUC=0.92
and 0.70 TPR/3FP [22] and 0.78 TPR/3FP [36] which uses
stacked shallow learning. This difference lies in the fact that
annotated lymph node segmentation masks are required to
learn a mid-level semantic boundary detector [36], whereas
CNN approaches only need LN locations for training [22].
In abdominal LN detection, [22] obtains the best trade-
off between its CNN model complexity and sampled data
configuration. Our best performing CNN model is GoogLeNet-
TL (256x256) which obtains an AUC=0.92 and 0.70 TPR/3FP.
The main difference between our dataset preparation pro-
tocol and that from [22] is a more aggressive extraction of
random views within a much larger range of FOVs. The
usage of larger FOVs to capture more image spatial context is
inspired by deep zoom-out features [44] that improve semantic
segmentation. This image sampling scheme contributes to our
best reported performance results in both mediastinal LN
detection (in this paper) and automated pancreas segmentation
[45]. As shown in Figure 1, abdominal LNs are surrounded by
many other similar looking objects. Meanwhile, mediastinal
LNs are more easily distinguishable, due to the images’
larger spatial contexts. Finally, from the perspective of the
data-model trade-off: “Do We Need More Training Data or
Better Models?” [51], more abdomen CT scans from distinct
patient populations need to be acquired and annotated, in
order to take full advantage of deep CNN models of high
capacity. Nevertheless, deeper and wider CNN models (e.g.,
GoogLeNet-RI-L and GoogLeNet-TL-H versus Cifar-10 [22])
have shown improved results in the mediastinal LN detection.
Figure 9 provides examples of misclassified lymph nodes
(in axial view) (both false negatives (Left) and false posi-
tives(Right)), from the Abdomen and Mediastinum datasets.
The overall reported LN detection results are clinically signif-
icant, as indicated in [63].
B. Interstitial Lung Disease Classification
The CNN models evaluated in this experiment are 1)
AlexNet-RI (training from scratch on the ILD dataset with
random initialization); 2) AlexNet-TL (with transfer learn-
ing from [4]); 3) AlexNet-ImNet: pre-trained ImageNet-CNN
model [4] with only the last cost function layer retrained from
random initialization, according to the six ILD classes (similar
to [9] but without using additional hand-crafted non-deep
feature descriptors, such as GIST and BoVW); 4) GoogLeNet-
RI (random initialization); 5) GoogLeNet-TL (GoogLeNet
Fig. 9. Examples of misclassified lymph nodes (in axial view) of both false
negatives (Left) and false positives (Right). Mediastinal LN examples are
shown in the upper row, and abdominal LN examples in the bottom row.
NM EM GG FB MN CD
Patch-LOO [38] 0.84 0.75 0.78 0.84 0.86 -
Patch-LOO [39] 0.88 0.77 0.80 0.87 0.89 -
Patch-CV10 [54] 0.84 0.55 0.72 0.76 0.91 -
Patch-CV5 0.64 0.81 0.74 0.78 0.82 0.64
Slice-Test [40] 0.40 1.00 0.75 0.80 0.56 0.50
Slice-CV5 0.22 0.35 0.56 0.75 0.71 0.16
Slice-Random 0.90 0.86 0.85 0.94 0.98 0.83
TABLE IV
COMPARISON OF INTERSTITIAL LUNG DISEASE CLASSIFICATION RESULTS
USING F-SCORES: NM, EM, GG, FB, MN AND CD.
with transfer learning from [33]). All ILD images (patches
of 64 × 64 and CT axial slices of 512 × 512) are re-sampled
to a fixed dimension of 256 × 256 pixels.
We evaluate the ILD classification task with five-fold CV
on patient-level split, as it is more informative for real clinical
performance than LOO. The classification accuracy rates for
interstitial lung disease detection are shown in Table III. Two
sub-tasks on ILD patch and slice classifications are conducted.
In general, patch-level ILD classification is less challenging
than slice-level classification, as far more data samples can
be sampled from the manually annotated ROIs (up to 100
image patches per ROI), available from [37]. From Table III,
all five deep models evaluated obtain comparable results within
the range of classification accuracy rates [0.74, 0.76]. Their
averaged model achieves a slightly better accuracy of 0.79.
F1-scores [38], [39], [54] and the confusion matrix (Table
V) for patch-level ILD classification using GoogLeNet-TL
under five-fold cross-validation (we denote as Patch-CV5) are
Ground Prediction
truth NM EM GG FB MN CD
NM 0.68 0.18 0.10 0.01 0.03 0.01
EM 0.03 0.91 0.00 0.02 0.03 0.01
GG 0.06 0.01 0.70 0.09 0.06 0.08
FB 0.01 0.02 0.05 0.83 0.05 0.05
MN 0.09 0.00 0.07 0.04 0.79 0.00
CD 0.02 0.01 0.10 0.18 0.01 0.68
TABLE V
CONFUSION MATRIX FOR ILD CLASSIFICATION (PATCH-LEVEL) WITH
FIVE-FOLD CV USING GOOGLENET-TL.
9
Method AlexNet-ImNet AlexNet-RI AlexNet-TL GoogLeNet-RI GoogLeNet-TL Avg-All
Slice-CV5 0.45 0.44 0.46 0.41 0.57 0.53
Patch-CV5 0.76 0.74 0.76 0.75 0.76 0.79
TABLE III
COMPARISON OF INTERSTITIAL LUNG DISEASE CLASSIFICATION ACCURACIES ON BOTH SLICE-LEVEL (SLICE-CV5) AND PATCH-BASED (PATCH-CV5)
CLASSIFICATION USING FIVE-FOLD CV. BOLD NUMBERS INDICATE THE BEST PERFORMANCE VALUES ON CLASSIFICATION ACCURACY.
also computed. F1-scores are reported on patch classification
only (32×32 pixel patches extracted from manual ROIs) [38],
[39], [54], as shown in Table IV. Both [38] and [39] use the
evaluation protocol of “leave-one-patient-out” (LOO), which
is arguably much easier and not directly comparable to 10-fold
CV [54] or our Patch-CV5. In this study, we classify six ILD
classes by adding a consolidation (CD) class to five classes
of healthy (normal - NM), emphysema (EM), ground glass
(GG), fibrosis (FB), and micronodules (MN) in [38], [39],
[54]. Patch-CV10 [54] and Patch-CV5 report similar medium
to high F-scores. This implies that the ILD dataset (although
one of the mainstream public medical image datasets) may not
adequately represent ILD disease CT lung imaging patterns,
over a population of only 120 patients. Patch-CV5 yields
higher F-scores than [54] and classifies the extra consolidation
(CD) class. At present, the most pressing task is to drastically
expand the dataset or to explore across-dataset deep learning
on the combined ILD and LTRC datasets [64].
Recently, Gao et al. [40] have argued that a new CADe
protocol on holistic classification of ILD diseases directly,
using axial CT slice attenuation patterns and CNN, may be
more realistic for clinical applications. We refer to this as
slice-level classification, as image patch sampling from manual
ROIs can be completely avoided (hence, no manual ROI
inputs will be provided). The experimental results in [40] are
conducted with a patient-level hard split of 100 (training) and
20 (testing). The method’s testing F-scores (i.e., Slice-Test)
are given in Table IV. Note that the F-scores in [40] are not
directly comparable to our results, due to different evaluation
criteria. Only Slice-Test is evaluated and reported in [40], and
we find that F-scores can change drastically from different
rounds of the five-fold CV.
While it is a more practical CADe scheme, slice-level
CNN learning [40] is very challenging, as it is restricted
to only 905 CT image slices with tagged ILD labels. We
only benchmark the slice-level ILD classification results in
this section. Even with the help of data augmentation (de-
scribed in Sec. II), the classification accuracy of GoogLeNet-
TL from Table III is only 0.57. However, transfer learning
from ImageNet pre-trained model is consistently beneficial,
as evidenced by AlexNet-TL (0.46) versus AlexNet-RI (0.44),
and GoogLeNet-TL (0.57) versus GoogLeNet-RI (0.41). It
especially prevents GoogLeNet from over-fitting on the limited
CADe datasets. Finally, when the cross-validation is conducted
by randomly splitting the set of all 905 CT axial slices into five
folds, markedly higher F-scores are obtained (Slice-Random
in Table IV). This further validates the claim that the dataset
poorly generalizes ILDs for different patients. Figure 10 shows
examples of misclassified ILD patches (in axial view), with
their ground truth labels and inaccurately classified labels.
No existing work has reached the performance requirements
for a realistic clinical setting [40], in which simple ROI-guided
image patch extraction and classification (which requires man-
ual ROI selection by clinicians) is implemented. The main goal
of this paper is to investigate the three factors (CNN architec-
tures, dataset characteristics and transfer learning) that affect
performance on a specific medical image analysis problem
and to ultimately deliver clinically relevant results. For ILD
classification, the most critical performance bottlenecks are
the challenge of cross-dataset learning and the limited patient
population size. We attempt to overcome these obstacles by
merging the ILD [37] and LTRC datasets. Although the ILD
[37] and LTRC datasets [64] (used in [19]) were generated
and annotated separately, they contain many common disease
labels. For instance, the ILD disease classes emphysema (EM),
ground glass (GG), fibrosis (FB), and micronodules (MN)
belong to both datasets, and thus can be jointly trained/tested
to form a larger and unified dataset.
Adapting fully convolutional CNN or FCNN to parse every
pixel location in the ILD lung CT images or slices, or adapting
other methods from CNN based semantic image segmentation
using PASCAL or ImageNet, may improve accuracy and
efficiency. However, current FCNN approaches [65], [66]
lack adequate spatial resolution in their directly output label
space. A segmentation label propagation method was recently
proposed [47] to provide full pixel-wise labeling of the ILD
data images. In this work, we sample image patches from the
slice using the ROIs for the ILD provided in the dataset, in
order to be consistent with previous methods in patch-level
[38], [39], [54] and slice-level classification [40].
C. Evaluation of Five CNN Models using ILD Classification
In this work, we mainly focus on AlexNet and GoogLeNet.
AlexNet is the first notably successful CNN architecture on
the ImageNet challenge and has rekindled significant research
interests on CNN. GoogLeNet is the state-of-the-art deep
model, which has outperformed other notable models, such as
AlexNet, OverFeat, and VGGNet [67], [68] in various com-
puter vision benchmarks. Likewise, a reasonable assumption
is that OverFeat and VGGNet may generate quantitative per-
formance results ranked between AlexNet’s and GoogLeNet’s.
For completeness, we include the Overfeat and VGGNet in the
following evaluations, to bolster our hypothesis.
d) Overfeat: OverFeat is described in [67] as an inte-
grated framework for using CNN for classification, localiza-
tion and detection. Its architecture is similar to that of AlexNet,
but contains far more parameters (e.g., 1024 convolution filters
in both “conv4” and “conv5” layers compared to 384 and
256 convolution kernels in the “conv4” and “conv5” layers of
AlexNet), and operates more densely (e.g., smaller kernel size
10
Fig. 10. Visual examples of misclassified ILD 64x64 patches (in axial view), with their ground truth labels and inaccurately classified labels.
of 2 in “pool2” layer “pool5” compared to the kernel size 3 in
“pool2” and “pool5” of AlexNet) on the input image. Overfeat
is the winning model of the ILSVRC 2013 in detection and
classification tasks.
e) VGGNet: The VGGNet architecture is introduced in
[68], where it is designed to significantly increase the depth
of the existing CNN architectures with 16 or 19 layers. Very
small 3×3 size convolutional filters are used in all convolution
layers with a convolutional stride of size 1, in order to reduce
the number of parameters in deeper networks. Since VGGNet
is substantially deeper than the other CNN models, VGGNet
is more susceptible to the vanishing gradient problem [69],
[70], [71]. Hence, the network may be more difficult to
train. Training the network requires far more memory and
computation time than AlexNet. We use the 16 layer variant
as our default VGGNet model in our study.
The classification accuracy results for ILD slice and
patch level classification of five CNN architectures (CifarNet,
AlexNet, Overfeat, VGGNet and GoogLeNet) are shown in
Table VI. Based on the analysis in Sec. IV-B, transfer learning
is only used for the slice level classification task. From
Table VI, quantitative classification accuracy rates increase as
the CNN model becomes more complex (CifarNet, AlexNet,
Overfeat, VGGNet and GoogLeNet, in ascending order), for
both ILD slice and patch level classification problems. The
reported results validate our assumption that OverFeat’s and
VGGNets performance levels fall between AlexNet’s and
GoogLeNets (this observation is consistent with the computer
vision findings). CifarNet is designed for images with smaller
dimensions (32 × 32 images), and thus is not catered to
classification tasks involving 256 × 256 images.
To investigate the performance difference between five-fold
cross-validation (CV) in Sec. IV-B and leave-one-patient-out
(LOO) validation, this experiment is performed under the
LOO protocol. By comparing results in Table III (CV-5) to
those in Table VI (LOO), one can see that LOOs quantitative
performances are remarkably better than CV-5’s. For example,
Method ILD-Slice Method ILD-Patch
CifarNet - CifarNet 0.799
AlexNet-TL 0.867 AlexNet-TL 0.865
Overfeat-TL 0.877 Overfeat-TL 0.879
VGG-16-TL 0.90 VGG-16-TL 0.893
GoogLeNet-TL 0.902 GoogLeNet-TL 0.911
TABLE VI
CLASSIFICATION RESULTS ON ILD AND LN DETECTION WITH LOO.
CifarNet AlexNet Overfeat VGG-16 GoogLeNet
Time 7m16s 1h2m 1h26m 20h24m 2h49m
Memory 2.25 GB 3.45 GB 4.22 GB 9.26 GB 5.37 GB
TABLE VII
TRAINING TIME AND MEMORY REQUIREMENTS OF THE FIVE CNN
ARCHITECTURES ON ILD PATCH-BASED CLASSIFICATION UP TO 90
EPOCHS.
in ILD slice-level classification, the accuracy level drastically
increases from 0.46 to 0.867 using AlexNet-TL, and from 0.57
to 0.902 for GoogLeNet-TL.
CNN training is implemented with the Caffe [56] deep
learning framework, using a NVidia K40 GPU on Ubuntu
14.04 Linux OS. All models are trained for up to 90 epochs
with early stopping criteria, where a model snapshot with
low validation loss is taken for the final model. Other hyper-
parameters are fixed as follows: momentum: 0.9; weight de-
cay: 0.0005; and a step learning rate schedule with base learn-
ing rate of 0.01, decreased by a factor of 10 every 30 epochs.
The image batch size is set to 128, except for GoogLeNet’s
(64) and VGG-16’s (32), which are the maximum batch sizes
that can fit in the NVidia K40 GPU with 12GB of memory
capacity. Table VII illustrates the training time and memory
requirements of the five CNN architectures on ILD patch-
based classification up to 90 epochs.
11
D. Training with “Equal Prior” vs. “Biased Prior”
Medical datasets are often “biased”, in that the number of
healthy samples is much larger than the number of diseased
instances, or that the numbers of images per class are uneven.
In ILD dataset, the number of fibrosis samples is about
3.5 times greater than the number of emphysema samples.
The number of non-LNs is 3 ∼ 4 times greater than the
number of LNs in lymph node detection. Different sampling
or resampling rates are routinely applied to both ILD and LN
detection to balance the data sample number or scale per class,
as in[22]. We refer this as “Equal Prior”. If we use the same
sampling rate, that will lead to a “Biased Prior” across different
classes.
Without loss of generality, after GoogLeNet is trained on
the training sets under “Equal” or “Biased” priors, we com-
pare its classification results on the balanced validation sets.
Evaluating a classifier on a biased validation set will cause
unfair assessment of its performance. For instance, a classifier
that predicts every image patch as “non-LN” will still achieve a
70% accuracy rate on a biased set with 3.5 times as many non-
LN samples as LN samples. The classification accuracy results
of GoogLeNet trained under two configurations are shown in
Table VIII. Overall, it achieves lower accuracy results when
trained with a “biased prior” in both tasks, and the accuracy
difference for ILD patch-based classification is small.
ILD-Slice ILD-Patch
Equal Prior 0.902 0.953
Biased Prior 0.872 0.952
TABLE VIII
CLASSIFICATION ACCURACIES FOR ILD SLICE AND LN PATCH-LEVEL
DETECTION WITH “EQUAL PRIOR” AND “BIASED PRIOR”, USING
GOOGLENET-TL.
V. ANALYSIS VIA CNN LEARNING TRACES &
LULVISUALIZATION
In this section, we determine and analyze, via CNN visu-
alization, the reasons for which transfer learning is beneficial
to achieve better performance on CAD applications.
Thoracoabdominal LN Detection. In Figure 12, the
first layer convolution filters from five different CNN ar-
chitectures are visualized. We notice that without trans-
fer learning [57], [6], somewhat blurry filters are learned
(AlexNet-RI (256x256), AlexNet-RI (64x64), GoogLeNet-
RI (256x256) and GoogLeNet-RI (64x64)). However, in
AlexNet-TL (256x256), many higher orders of contrast- or
edge-preserving patterns (that enable capturing image ap-
pearance details) are evidently learned through fine-tuning
from ImageNet. With a smaller input resolution, AlexNet-RI
(64x64) and GoogLeNet-RI (64x64) can learn image contrast
filters to some degree; whereas, GoogLeNet-RI (256x256)
and AlexNet-RI (256x256) have over-smooth low-level filters
throughout.
ILD classification. We focus on analyzing visual CNN
optimization traces and activations from the ILD dataset, as
its slice-level setting is most similar to ImageNet’s. Indeed,
both datasets use full-size images. The traces of the training
loss, validation loss and validation accuracy of AlexNet-RI and
AlexNet-TL, are shown in Figure 11. For AlexNet-RI in Figure
11 (a), the training loss significantly decreases as the number
of training epochs increases, while the validation loss notably
increases and the validation accuracy does not improve much
before reaching a plateau. With transfer learning and fine-
tuning, much better and consistent performances of training
loss, validation loss and validation accuracy traces are obtained
(see Figure 11 (b)). We begin the optimization problem – that
of fine-tuning the ImageNet pre-trained CNN to classify a
comprehensive set of images – by initializing the parameters
close to an optimal solution. One could compare this process
to making adults learn to classify ILDs, as opposed to babies.
During the process, the validation loss, having remained at
lower values throughout, achieves higher final accuracy levels
than the validation loss on a similar problem with random
initialization. Meanwhile, the training losses in both cases
decrease to values near zero. This indicates that both AlexNet-
RI and AlexNet-TL over-fit on the ILD dataset, due to its small
instance size. The quantitative results in Table III indicate
that AlexNet-TL and GoogLeNet-TL have consistently better
classification accuracies than AlexNet-RI and GoogLeNet-RI,
respectively.
The last pooling layer (pool-5) activation maps of the Ima-
geNet pre-trained AlexNet [4] (analogical to AlexNet-ImNet)
and AlexNet-TL, obtained by processing two input images of
Figure 2 (b,c), are shown in Figure 13 (a,b). The last pooling
layer activation map summarizes the entire input image by
highlighting which relative locations or neural reception fields
relative to the image are activated. There are a total of 256
(6x6) reception fields in AlexNet [4]. Pooling units where the
relative image location of the disease region is present in the
image are highlighted with green boxes. Next, we reconstruct
the original ILD images using the process of de-convolution,
back-propagating with convolution and un-pooling from the
activation maps of the chosen pooling units [72]. From the
reconstructed images (Figure 13 bottom), we observe that
with fine-tuning, AlexNet-TL detects and localizes objects of
interest (ILD disease regions depicted in in Figure 2 (b) and
(c)) better than AlexNet-ImNet. The filters shown in Figure
13 that better localize regions on the input images (Figure 2
(b) and (c)) respectively, produce relatively higher activations
(in the top 5%) among all 512 reception field responses in
the fine-tuned AlexNet-TL model. As observed in [73], the
final CNN classification score can not be driven solely by a
single strong activation in the receptions fields, but often by a
sparse set of high activations (i.e., varying selective or sparse
activations per input image).
VI. FINDINGS AND FUTURE DIRECTIONS
We summarize our findings as follows.
• Deep CNN architectures with 8, even 22 layers [4],
[33], can be useful even for CADe problems where the
available training datasets are limited. Previously, CNN
models used in medical image analysis applications have
often been 2 ∼ 5 orders of magnitude smaller.
12
Fig. 11. Traces of training and validation loss (blue and green lines) and validation accuracy (orange lines) during (a) training AlexNet from random
initialization and (b) fine-tuning from ImageNet pre-trained CNN, for ILD classification.
Fig. 12. Visualization of first layer convolution filters of CNNs trained on abdominal and mediastinal LNs in RGB color, from random initialization (AlexNet-RI
(256x256), AlexNet-RI (64x64), GoogLeNet-RI (256x256) and GoogLeNet-RI (64x64)) and with transfer learning (AlexNet-TL (256x256)).
• The trade-off between using better learning models and
using more training data [51] should be carefully consid-
ered when searching for an optimal solution to any CADe
problem (e.g., mediastinal and abdominal LN detection).
• Limited datasets can be a bottleneck to further ad-
vancement of CADe. Building progressively growing (in
scale), well annotated datasets is at least as crucial as
developing new algorithms. This has been accomplished,
for instance, in the field of computer vision. The well-
known scene recognition problem has made tremendous
progress, thanks to the steady and continuous develop-
ment of Scene-15, MIT Indoor-67, SUN-397 and Place
datasets [58].
• Transfer learning from the large scale annotated natural
image datasets (ImageNet) to CADe problems has been
consistently beneficial in our experiments. This sheds
some light on cross-dataset CNN learning in the medical
image domain, e.g., the union of the ILD [37] and LTRC
datasets [64], as suggested in this paper.
• Finally, applications of off-the-shelf deep CNN image
features to CADe problems can be improved by either
exploring the performance-complementary properties of
hand-crafted features [10], [9], [12], or by training CNNs
from scratch and better fine-tuning CNNs on the target
medical image dataset, as evaluated in this paper.
VII. CONCLUSION
In this paper, we exploit and extensively evaluate three im-
portant, previously under-studied factors on deep convolutional
neural networks (CNN) architecture, dataset characteristics,
and transfer learning. We evaluate CNN performance on
two different computer-aided diagnosis applications: thoraco-
abdominal lymph node detection and interstitial lung disease
classification. The empirical evaluation, CNN model visual-
ization, CNN performance analysis, and conclusive insights
can be generalized to the design of high performance CAD
systems for other medical imaging tasks.
ACKNOWLEDGMENT
This work was supported in part by the Intramural Re-
search Program of the National Institutes of Health Clinical
Center, and in part by a grant from the KRIBB Research
Initiative Program (Korean Biomedical Scientist Fellowship
Program), Korea Research Institute of Bioscience and Biotech-
nology, Republic of Korea. This study utilized the high-
performance computational capabilities of the Biowulf Linux
cluster at the National Institutes of Health, Bethesda, MD
(http://biowulf.nih.gov). We thank NVIDIA for the K40 GPU
donation.
No comments:
Post a Comment