STATISTICAL LEARNING 2.1.3.1 17 Convolutional Neural Networks Convolutional Neural Networks (CNNs) [244] are a class of DNNs designed primarily for visual and grid-spatial data such as images. They are inspired by the visual cortex of animals, which contains neurons that are sensitive to small subregions of the visual field, called a receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field, growing larger in deeper layers of the visual cortex. Figure 2.2. Sketch of a CNN architecture. The input is a 2D image, which is iteratively convolved with a set of learned filters detecting specific input features, e.g., edges, corners, blobs, to produce feature maps. Feature maps are then downsampled using a pooling operation. As illustrated in Figure 2.2, CNNs are composed of multiple convolutional layers, which hierarchically extract features from the input, followed by pooling and fully-connected layers to classify the input based on the downsampled features. A filter K ∈ Rd×d is a rectangular matrix of trainable weights with width and height d typically smaller than the input x. A convolutional layer applies filters sliding over the input, with each filter producing a feature map: F = K ∗ x, (2.6) where the convolution operation ∗ computes a dot product between filter entries and the covered portions of the input. Thanks to the weight sharing property of the convolution operation, CNNs are able to learn translation invariance, i.e., the ability to recognize an object regardless of its position in the image. This is particularly useful for object detection, where the position of the object in the image is unknown. This architecture was used for document image classification and document layout analysis (Section 6.3.2). A special version is 1-D CNNs, which we applied to one-hot encoded text data in text classification benchmarking (Section 3.4.3).