CNNs

This is part 4 of my applied DL notes. They will cover very high-level intuition about CNNs.

Motivation

Basics

Some desirables for processing image-type sequence data:

1) Early layers should be translationally invariant.

A shift in a 2D input XX shoud simply lead to a shift in the hidden representation HH. Neurons should not respond differently based on the global location of the same patch. That means we use convolutions! That is if VV is a weight matrix we have:

[H]i,j=u+ab[V]a,b[X]i+a,j+b[H]_{i, j} = u + \sum_{a} \sum_{b} [V]_{a, b} [X]_{i + a, j + b}

Notice that VV does not depend on (i,j)(i, j), while it could. Pooling also helps with translational invariance.

2) Early layers should be capture local information.

We should not have to look far from (i,j)(i, j) to glean info about what is going on at Hi,jH_{i, j}. That is for some Δ>0\Delta > 0, we want:

[H]i,j=u+a=ΔΔb=ΔΔ[V]a,b[X]i+a,j+b [H]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [V]_{a, b} [X]_{i + a, j + b}

This is exactly what we call a convolutional layer. The smaller Δ\Delta, the smaller the number of parameters. In general, we are going to actually have cc input channels (i.e. RGB images) and may want to have dd output channels, So:

[H]i,j,d=a=ΔΔb=ΔΔc[V]a,b,c,d[X]i+a,j+b [H]_{i, j, d} = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} \sum_{c} [V]_{a, b, c, d} [X]_{i + a, j + b}

Intuitively, one channel may be a “feature map” which learns edge detection while another learns textures.

3) Later layers should capture global information.

Since we are usually interested in global questions such as “Does this whole image contain a cat?”, we want later layers to aggregate information from earlier layers. Each hidden node should have a large receptive field with respect to the inputs. Pooling helps spatially downsample local representations.

Implementation Details

Architectures

Below is a summary of the history of top-performing CNNs for image classification.