%We focus on the general task of mapping one variable-length sequence of symbol representations ${x_1, ..., x_n} \in \mathbb{R}^d$ to another sequence of the same length ${y_1, ..., y_n} \in \mathbb{R}^d$. \marginpar{should we use this notation? alternatively we can just say "d-dimensional vectors"} In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1, ..., x_n)$ to another sequence of equal length $(z_1, ..., z_n)$, with $x_i, z_i \in \mathbb{R}^d$, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata. One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies \citep{hochreiter2001gradient}. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types. %\subsection{Computational Performance and Path Lengths} \begin{table}[t] \caption{ Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. $n$ is the sequence length, $d$ is the representation dimension, $k$ is the kernel size of convolutions and $r$ the size of the neighborhood in restricted self-attention.} %Attention models are quite efficient for cross-positional communications when sequence length is smaller than channel depth. \label{tab:op_complexities} \begin{center} \vspace{-1mm} %\scalebox{0.75}{ \begin{tabular}{lccc} \toprule Layer Type & Complexity per Layer & Sequential & Maximum Path Length \\ & & Operations & \\ \hline \rule{0pt}{2.0ex}Self-Attention & $O(n^2 \cdot d)$ & $O(1)$ & $O(1)$ \\ Recurrent & $O(n \cdot d^2)$ & $O(n)$ & $O(n)$ \\ Convolutional & $O(k \cdot n \cdot d^2)$ & $O(1)$ & $O(log_k(n))$ \\ %\cmidrule Self-Attention (restricted)& $O(r \cdot n \cdot d)$ & $O(1)$ & $O(n/r)$ \\ %Convolutional (separable) & $O(k \cdot n \cdot d + n \cdot d^2)$ & $O(1)$ & $O(log_k(n))$ \\ %Position-wise Feed-Forward & $O(n \cdot d^2)$ & $O(1)$ & $\infty$ \\ %Fully Connected & $O(n^2 \cdot d^2)$ & $O(1)$ & $O(1)$ \\ %Convolutional (separable) & $O(k \cdot n \cdot d + n \cdot d^2)$ & $O(1)$ & $O(log_k(n))$ \\ %Position-wise Feed-Forward & $O(n \cdot d^2)$ & $O(1)$ & $\infty$ \\ %Fully Connected & $O(n^2 \cdot d^2)$ & $O(1)$ & $O(1)$ \\ \bottomrule \end{tabular} %} \end{center} \end{table} %\begin{table}[b] %\caption{ % Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. $n$ is the sequence length, $d$ is the representation dimensionality, $k$ is the kernel size of convolutions and $r$ the size of the neighborhood in localized self-attention.} %Attention models are quite efficient for cross-positional communications when sequence length is smaller than channel depth. %\label{tab:op_complexities} %\begin{center} %\vspace{-1mm} %%\scalebox{0.75}{ % %\begin{tabular}{lccc} %\hline %Layer Type & Receptive & Complexity per Layer & Sequential %\\ % & Field Size & & Operations \\ %\hline %Self-Attention & $n$ & $O(n^2 \cdot d)$ & $O(1)$ \\ %Recurrent & $n$ & $O(n \cdot d^2)$ & $O(n)$ \\ %Convolutional & $k$ & $O(k \cdot n \cdot d^2)$ & %$O(log_k(n))$ \\ %\hline %Self-Attention (localized)& $r$ & $O(r \cdot n \cdot d)$ & %$O(1)$ \\ %Convolutional (separable) & $k$ & $O(k \cdot n \cdot d + n %\cdot d^2)$ & $O(log_k(n))$ \\ %Position-wise Feed-Forward & $1$ & $O(n \cdot d^2)$ & $O(1)$ %\\ %Fully Connected & $n$ & $O(n^2 \cdot d^2)$ & $O(1)$ \\ %\end{tabular} %%} %\end{center} %\end{table} %The receptive field size of a layer is the number of different input representations that can influence any particular output representation. Recurrent layers and self-attention layers have a full receptive field equal to the sequence length $n$. Convolutional layers have a limited receptive field equal to their kernel width $k$, which is generally chosen to be small in order to limit computational cost. As noted in Table \ref{tab:op_complexities}, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O(n)$ sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length $n$ is smaller than the representation dimensionality $d$, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece \citep{wu2016google} and byte-pair \citep{sennrich2015neural} representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size $r$ in the input sequence centered around the respective output position. This would increase the maximum path length to $O(n/r)$. We plan to investigate this approach further in future work. A single convolutional layer with kernel width $k < n$ does not connect all pairs of input and output positions. Doing so requires a stack of $O(n/k)$ convolutional layers in the case of contiguous kernels, or $O(log_k(n))$ in the case of dilated convolutions \citep{NalBytenet2017}, increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of $k$. Separable convolutions \citep{xception2016}, however, decrease the complexity considerably, to $O(k \cdot n \cdot d + n \cdot d^2)$. Even with $k=n$, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model. %\subsection{Unfiltered Bottleneck Argument} %An orthogonal argument can be made for self-attention layers based on when the layer imposes the bottleneck of mapping all of the information used to compute a given output position into a single, fixed-length vector. ... %There is a second argument for self-attention layers which we call the unfiltered bottleneck argument. In both recurrent and the convolutional layers, the information that position $i$ receives from the other positions is compressed to a vector of dimension $d$ before it ever can be filtered by the content $x_i$. More precisely, we can express $y_i = F(i, x_i, G(i, \{x_{j \neq i}\}))$, where $G(i, \{x_{j \neq i}\})$ is a vector of dimension $d$. Intuitively, we would expect that this would cause a large amount of irrelevant information to crowd out the relevant information. Self-attention does not suffer from the unfiltered bottleneck problem, since the aggregation happens after filtering, and so, intuitively, we have the chance of transmitting lots of relevant information. As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.