we develop a visualization tool suit-
able for deep neural networks (DNN). Although typi-cal dimensionality reduction methods, such as principal
component analysis (PCA), are useful to visualize high-
dimensional data as 2 or 3 dimensional representations,
most of those methods focus their attention on how to
create essential subspaces based only on a given unique
feature representation. On the other hand, DNN natu-
rally have consecutive multiple feature representations
corresponding to their intermediate layers. In order to
understand relationships of those consecutive interme-
diate layers, we utilize canonical correlation analysis
(CCA) to visualize them in a unified subspace. Our
method (called consecutive CCA) can visualize “fea-
ture flow” which represents movement of samples be-
tween two consecutive layers of DNN. By using stan-
dard benchmark datasets, we show that our visualiza-
tion results contain much information that typical vi-
sualization methods (such as PCA) do not represent.
1 Introduction
Over the last decade, tremendous progress has been
made in the field of artificial neural networks. Deep-
layered convolutional neural networks (CNN) [2, 5, 6]
have demonstrated state-of-the-art results on many ma-
chine learning problems, especially image recognition
tasks [9].
Although Deep CNN have proven to be extremely
powerful and flexible tools, their properties or the na-
ture have not been sufficiently revealed yet. In order to
understand properties of deep neural networks, many
visualization techniques have been proposed. Feature
visualization [10], in which weight coefficients or convo-
lutional filters in trained networks are visually shown as
shapes of objects which they can respond to, is one of
the most helpful approaches to intuitively understand
characteristics of network responses.
On the other hand, there is another approach for visu-
alization of machine learning data and classifiers; Many
dimensionality reduction methods to visualize feature
space, such as principal component analysis (PCA) [8],
Fisher’s linear discriminant analysis (FLDA) [1], or t-
stochastic neighbor embedding (t-SNE) [7], have been
proposed and successfully utilized. However, most of
those methods focus their attention on how they create
essential (consistent or discriminative) low dimensional
space based only on a unique feature representation.
Meanwhile, as one of good properties, CNN have con-
secutive multiple feature representations obtained from
their intermediate layers, which are automatically or-
ganized through machine learning process (i.e., back-
propagation learning). In spite of such scenario, typical
dimensionality reduction methods (PCA, FLDA, and so
on) individually visualize each feature representation.
As a result, those multiple visualized results can not be
directly compared to each other.
To overcome this drawback, in this paper, we inves-
tigate a novel way to visualize feature spaces for CNN,
where consecutive intermediate layers are dropped into
a common low dimensional space. In order to construct
common subspace for two consecutive layers of CNN,
we utilize canonical correlation analysis (CCA) [3] that
is a method for exploring relationships between two
multivariate sets of variables observed from the same
instance.
Our method can visualize “feature flow” of input sam-
ples from a previous layer to a next layer. By using
standard benchmark datasets, we show that our visu-
alization results contain much information that typical
visualization methods (such as PCA) do not represent.
2 Consecutive Visualization of CNN
2.1 Convolutional Neural Networks
CNN is one of artificial neural networks which has
distinctive architectures as shown in Fig. 1; Input data
of CNN are usually RGB images (3 channels) or gray-
scale images (1 channel). Several convolutional or pool-
ing layers (with or without activation functions) follows
the input layer. For classification problems, one or more
full connection (FC) layers are often employed. The fi-
nal layer outputs prediction values (such as posterior
probability or likelihood) for K kinds of objects where
the input image should be classified in.
Each layer of CNN can have a certain activation func-Proceedings of the 48th ISCIE International Symposium
on Stochastic Systems Theory and Its Applications
Fukuoka, Nov. 4-5, 2016– 160 –
Fig. 1: An example of CNN architecture.
Fig. 2: A concept of our method.
tion which controls amount of output value to propa-
gate its next layer. For intermediate layers, the rectified
linear unit (ReLU),
f (al
i) = max(0, al
i), (1)
is often used. Note that al
i ∈ R is a sum of signals
received by the i-th unit in the l-th intermediate layer.
Meanwhile, for the last layer, the soft-max function,
fk(z) = exp(zk)
∑K
κ=1 exp(zκ) , (2)
is often used to obtain probabilistic outputs. Note that
z is a K dimensional vector where zk is a sum of signals
received by the k-th unit in the last layer. Since the
function is non-negative and has the unit sum property
(∑
k fk(z) = 1), the value of fk implies a class posterior
probability that an input data belongs to the k-th class.
Therefore, by using the soft-max function in the output
layer, CNN can act a role of probability estimators for
the object classification problems.
As one of the distinctive properties of CNN, they
have consecutive multiple feature representations which
are automatically organized in their each convolutional
layer through the training using given labeled instances.
In spite of this interesting situation, typical dimen-
sionality reduction methods (such as PCA) will visu-
alize each feature representation individually, without
regarding the relationships between those consecutive
features.
In order to construct common subspace for two con-
secutive layers of CNN, in this paper, we utilize canon-
ical correlation analysis (CCA) [3] that is a method for
exploring relationships between two multivariate sets of
variables observed from the same instance.
2.2 Canonical Correlation Analysis
CCA [3] is known as a method to find linear combi-
nations of two different data representation which have
maximum correlation with each other.
Let us denote p-dimensional vectors by x =
[x1, · · · , xp]T , and q-dimensional vectors by y =
[y1, · · · , yq ]T . Given sets of N samples X =
{x1, · · · , xN } and Y = {y1, · · · , yN } where xi and yi
are observed from the same i-th instance, CCA gives
linear transformations
ui = AT (xi − ̄x) (3)
vi = BT (yi − ̄y), (4)
which maximizes a correlation coefficient between
sets of new features U = {u1, · · · , uN } and V =
{v1, · · · , vN }. ̄x and ̄y are the mean vectors of x ∈ X
and y ∈ Y , respectively. The coefficient matrices A and
B are obtained by solving following eigenequations,
RXY R−1
Y RY X A = RX AΛ2 (AT RX A = Id∗ ) (5)
RY X R−1
X RXY B = RY BΛ2 (BT RY B = Id∗ ), (6)
where Λ2 is a diagonal matrix in which diagonal compo-
nents are the eigenvalues, d∗ ≤ min(p, q) is a dimension
of canonical principal components (CPC) u and v, and
RX =
N∑
i=1
xixT
i (7)
RY =
N∑
i=1
yiyT
i (8)
RXY =
N∑
i=1
xiyT
i = (RY X )T . (9)
As well as PCA [8], we can arbitrary select the dimen-
sion d∗ of u and v. For the purpose of visualization of
feature space, d∗ = 2 or 3 are usually selected.
2.3 Consecutive CCA for CNN Visualization
In this paper, we utilize CCA in order to construct
common subspace for two consecutive layers of CNN.
The idea of our method is illustrated in Fig. 2; For
example, given two consecutive convolutional layers
(Conv 1 and Conv 2), PCA creates two subspaces
(PCA1 and PCA2) individually. However, these sub-
spaces do not correspond to each other. On the other
hand, CCA can create a common subspace for Conv 1
and Conv 2, where their correlation is maximized. Such
a common representation will be helpful to intuitively
analyze the relationship between these consecutive lay-
ers.
Assume that a set of N training samples I =
{x1, · · · , xN } ⊂ Rd and a set of labels T = {t1, · · · , tN }
where ti ∈ {1, 2, · · · , K} is a class label of xi are given.
Also, assume that CNN which is already trained by
using I and T is given.
Let f l = [f l
1, · · · , f l
dl ]T be a feature representation of
an input image x in the l-th CNN layer where f 1 is the
input x itself. Note that dl is a feature dimension of
f l, and it is determined from a number of channels cl
and a map size sl of the l-th layer, as follows:
dl = cl(sl)2. (10)– 161 –
Fig. 3: CNN structures for MNIST (left) and CIFAR-10 (right).
For examples, if we use 28 × 28 grayscale images as the
input of CNN, then s1 = 28, c1 = 1, and d1 = 784.
If we apply 20 convolutional filters where a kernel size
is 5 × 5 for these images, then we obtain 20 different
feature maps which are 24 × 24 pixels. In this case,
s2 = 24, c2 = 20 and d2 = 11, 520.
By inputting all training instances xi ∈ I to the given
CNN, we can obtain sets of intermediate features F l =
{f l
1, · · · , f l
N } for l = 1 to L, where L is the number of
CNN layers.
Let us consider f m
i and f n
i (1 ≤ m < n ≤ L), which
are different intermediate feature representations of xi.
Note that generally f m
i and f n
i have different feature
dimensions (dm and dn) depending on designs of each
CNN layer. In this case, CCA using F m and F n gives
linear transformations
um,n
i = (Am,n)T (f m
i − ̄f m) (11)
vm,n
i = (Bm,n)T (f n
i − ̄f n), (12)
which maximizes a correlation coefficient between sets
of new features U m,n = {um,n
1 , · · · , um,n
N } and V m,n =
{vm,n
1 , · · · , vm,n
N }. ̄f m and ̄f n are the mean vectors of
F m and F n, respectively. The coefficient matrices Am,n
and Bm,n are obtained by solving following eigenequa-
tions,
Rm,n(Rn)−1Rn,mAm,n = RmAm,nΛ2 (13)
Rn,m(Rm)−1Rm,nBm,n = RnBm,nΛ2 (14)
where
Rm =
N∑
i=1
f m
i (f m
i )T (15)
Rn =
N∑
i=1
f n
i (f n
i )T (16)
Rm,n =
N∑
i=1
f m
i (f n
i )T = (Rn,m)T . (17)
In this situation, a difference vector
δm→n
i ≡ vm,n
i − um,n
i (18)
indicates feature flow of the sample xi from the m-
th layer to the n-th layer of CNN, in the correspond-
ing canonical space. This flow represents movement of
samples between two consecutive layers of CNN. This
information will be helpful to analyze or understand
properties of each layer of CNN.
3 Experiments
In this paper, we used MNIST data [6] and CIFAR-10
data [4] to confirm the property of our method. MNIST
consists of 70,000 images of handwritten figures of 0 to
9 (thus, K = 10). For each class, 6,000 images are
used for training and 1,000 images are used for testing.
CIFAR-10 consists of 60,000 images of K = 10 kinds of
objects (airplane, automobile, bird, · · ·). For each class,
5,000 images are used for training and 1,000 images are
used for testing. CNN structures that we used here are
illustrated in Fig. 3.
To remove redundancy in each feature representation,
we apply preliminary PCA to f l and reduce the feature
dimension dl to the minimum dimension where the ac-
cumulated contribution rate is larger than 99.9%. Note
that projection coefficients of preliminary PCA are cal-
culated from only the training instances.
In order to compare PCA (as the standard visualiza-
tion method) and our consecutive CCA, we create two
kinds of subspaces:
• By using PCA, we create subspaces for the input
layer and all activation layers (PCA1, PCA2, · · ·
in Fig. 3).
• By using consecutive CCA, we create subspaces
for all consecutive pairs of the input and activa-
tion layers (CCA1, CCA2, · · · in Fig. 3).
Note that “activation layers” include the output (soft-
max) layer. All coefficient matrices to construct sub-
spaces are calculated from only training images. In
contrast, all visualized results (shown in Figs. 4 to 9)
contain only testing images which are just mapped by
using the obtained coefficient matrices.
Figs. 4 and 5 show a part of visualized results of
CNN features obtained from MNIST data. Note that
the picked classes (4 and 9) in the MNIST data are
relatively easy to be misclassified as each other. In gen-
eral, PCA gives the subspaces where a distribution of
the original samples is most faithfully kept. For ex-
ample, it can be seen that PCA 1-4 in Fig. 4 seem
to show the locality of picked classes. On the other
hand, PCA 5 seems to have slightly poor information;
Since the last (soft-max) layer of CNN only consists of
probabilistic information, and since the CNN used here
can correctly classify 99.2% of test images, almost all
samples are located near the class centers (where the
posterior probability becomes approximately 1).– 162 –
In contrast, CCA 4 in Fig. 5 showed more informative
visualized results; These are results of CCA between
the last two activation layers (ReLU3 and Softmax).
In general, the last convolutional block (i.e., the pair
of conv, pool, and ReLU layers) of CNN usually have
the most important information for classification tasks.
Since our consecutive CCA can integrate such a im-
portant layer (ReLU3) with the prediction (Softmax)
layer, the obtained common subspace seems to be more
informative than PCA. For example, in these figures,
the feature flow δm→n
i are illustrated as the lines at-
tached circles. The feature flows indicate the movement
of samples from the previous layer to the next layer. It
may be helpful for analysis of which layer cause mis-
classifications of certain instances.
Figs. 6 and 7, visualized results of CIFAR-10, have
the same contents in Figs. 4 and 5. The picked classes
(“cat” and “dog”) are relatively hard to discriminate.
As with the MNIST data, consecutive CCA also seems
to show more informative subspace for the last layer(s)
of CNN than PCA.
It is interesting to note that, for higher-layer(s) vi-
sualization (PCA4 and CCA3 in Figs. 4, 5, or PCA5
and CCA4 in Figs. 6, 7), PCA and consecutive CCA
construct similar subspaces. Since consecutive CCA in-
cludes additional information, the feature flows, consec-
utive CCA seems to be more suitable for visualization
of higher-layers of CNN than PCA.
Figs. 8 and 9 show all of CCA subspaces for CIFAR-
10. It can be seen that the separation of correctly clas-
sified samples (black circles) and misclassified samples
(red circles) are mainly caused in CCA 4 (i.e., CCA
for ReLU3 and ReLU4). It may indicates that the lay-
ers until ReLU3 have a role to extract common visual
features (such as edges, corners, and primitive shapes),
and the layers after ReLU4 have a role of discrimina-
tion, i.e., to find different features for each object class.
4 Conclusions
In this paper, we developed a visualization tool suit-
able for deep neural networks, called consecutive CCA.
By using standard benchmark datasets, we showed that
our visualization results contain much information that
typical visualization methods (such as PCA) do not
represent. Our method can effectively integrate higher
feature layers or the prediction layer, and it brought us
more informative visualized subspaces.
Our consecutive CCA can visualize “feature flow”
which indicates movement of samples between two con-
secutive layers of CNN, and it may bring us a new infor-
mation for understanding characteristics of deep neural
networks. For example, if almost all samples drastically
move on the specific canonical space, it might show that
it requires more intermediate layers in order to achieve
staged and precise feature extraction.
As future works, we will utilize consecutive CCA to
evaluate several properties (such as local invariance)
of CNN in more detail. Visualization of extremely
deep CNN (e.g. the residual network [2]) by using our
method may be also interesting.
No comments:
Post a Comment