--- title: "Activation functions" notebook-links: false crossref: lof-title: "List of Figures" number-sections: false format: html: default --- When choosing an activation function, consider the following: - **Non-saturation:** Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients. - **Computational efficiency:** Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications. - **Smoothness:** Smooth activations (e.g., GELU, Mish) can help with optimization and convergence. - **Domain knowledge:** Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification). - **Experimentation:** Try different activations and evaluate their performance on your specific task. [But what is a neural network?](https://youtu.be/aircAruvnKk?si=64sscTHzYeZ9x-5L) [Slideshow](activations_slideshow.qmd) {{< embed ActivationFunctions.ipynb#fig-overview >}} ## Sigmoid {#sec-sigmoid} **Strengths:** Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems. **Weaknesses:** Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation. **Usage:** Binary classification, logistic regression. ::: columns ::: {.column width="50%"} $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$ ``` python def sigmoid(x): return 1 / (1 + np.exp(-x)) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-sigmoid >}} ::: ::: ## Hyperbolic Tangent (Tanh) {#sec-tanh} **Strengths:** Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models. **Weaknesses:** Also saturates, leading to vanishing gradients. **Usage:** Similar to sigmoid, but with a larger output range. ::: columns ::: {.column width="50%"} $$ \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} $$ ``` python def tanh(x): return np.tanh(x) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-tanh >}} ::: ::: ## Rectified Linear Unit (ReLU) **Strengths:** Computationally efficient, non-saturating, and easy to compute. **Weaknesses:** Not differentiable at x=0, which can cause issues during optimization. **Usage:** Default activation function in many deep learning frameworks, suitable for most neural networks. ::: columns ::: {.column width="50%"} $$ \text{ReLU}(x) = \max(0, x) $$ ``` python def relu(x): return np.maximum(0, x) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-relu >}} ::: ::: ## Leaky ReLU **Strengths:** Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons. **Weaknesses:** Still non-differentiable at x=0. **Usage:** Alternative to ReLU, especially when dealing with dying neurons. ::: columns ::: {.column width="50%"} $$ \text{Leaky ReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases} $$ ``` python def leaky_relu(x, alpha=0.01): # where α is a small constant (e.g., 0.01) return np.where(x > 0, x, x * alpha) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-leaky_relu >}} ::: ::: ## Swish **Formula:** where g(x) is a learned function (e.g., sigmoid or ReLU) **Strengths:** Self-gated, adaptive, and non-saturating. **Weaknesses:** Computationally expensive, requires additional learnable parameters. **Usage:** Can be used in place of ReLU or other activations, but may not always outperform them. ::: columns ::: {.column width="50%"} $$ \text{Swish}(x) = x \cdot \sigma(x) $$ ``` python def swish(x): return x * sigmoid(x) ``` See also: [sigmoid](#sec-sigmoid) ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-swish >}} ::: ::: ## Mish **Strengths:** Non-saturating, smooth, and computationally efficient. **Weaknesses:** Not as well-studied as ReLU or other activations. **Usage:** Alternative to ReLU, especially in computer vision tasks. ::: columns ::: {.column width="50%"} $$ \text{Mish}(x) = x \cdot \tanh(\text{Softplus}(x)) $$ ``` python def mish(x): return x * np.tanh(softplus(x)) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-mish >}} ::: ::: See also: [softplus](#softplus) [tanh](#sec-tanh) ## Softmax **Strengths:** Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification. **Weaknesses:** Only suitable for output layers with multiple classes. **Usage:** Output layer activation for multi-class classification problems. ::: columns ::: {.column width="50%"} $$ \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{k=1}^{K} e^{x_k}} $$ ``` python def softmax(x): e_x = np.exp(x - np.max(x)) return e_x / e_x.sum() ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-softmax >}} ::: ::: ## Softsign **Strengths:** Similar to sigmoid, but with a more gradual slope. **Weaknesses:** Not commonly used, may not provide significant benefits over sigmoid or tanh. **Usage:** Alternative to sigmoid or tanh in certain situations. ::: columns ::: {.column width="50%"} $$ \text{Softsign}(x) = \frac{x}{1 + |x|} $$ ``` python def softsign(x): return x / (1 + np.abs(x)) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-softsign >}} ::: ::: ## SoftPlus {#softplus} **Strengths:** Smooth, continuous, and non-saturating. **Weaknesses:** Not commonly used, may not outperform other activations. **Usage:** Experimental or niche applications. ::: columns ::: {.column width="50%"} $$ \text{Softplus}(x) = \log(1 + e^x) $$ ``` python def softplus(x): return np.log1p(np.exp(x)) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-softplus >}} ::: ::: ## ArcTan **Strengths:** Non-saturating, smooth, and continuous. **Weaknesses:** Not commonly used, may not outperform other activations. **Usage:** Experimental or niche applications. ::: columns ::: {.column width="50%"} $$ arctan(x) = arctan(x) $$ ``` python def arctan(x): return np.arctan(x) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-arctan >}} ::: ::: ## Gaussian Error Linear Unit (GELU) **Strengths:** Non-saturating, smooth, and computationally efficient. **Weaknesses:** Not as well-studied as ReLU or other activations. **Usage:** Alternative to ReLU, especially in Bayesian neural networks. ::: columns ::: {.column width="50%"} $$ \text{GELU}(x) = x \cdot \Phi(x) $$ ``` python def gelu(x): return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * np.power(x, 3)))) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-gelu >}} ::: ::: See also: [tanh](#sec-tanh) ## Silu (SiLU) **Strengths:** Non-saturating, smooth, and computationally efficient. **Weaknesses:** Not as well-studied as ReLU or other activations. **Usage:** Alternative to ReLU, especially in computer vision tasks. ::: columns ::: {.column width="50%"} $$ silu(x) = x * sigmoid(x) $$ ``` python def silu(x): return x / (1 + np.exp(-x)) ``` ::: ::: {.column width="50%"} {{< embed ActivationFunctions.ipynb#fig-silu >}} ::: ::: ## GELU Approximation (GELU Approx.) $$ f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x^3))) $$ **Strengths:** Fast, non-saturating, and smooth. **Weaknesses:** Approximation, not exactly equal to GELU. **Usage:** Alternative to GELU, especially when computational efficiency is crucial. ## SELU (Scaled Exponential Linear Unit) $$ f(x) = \lambda \begin{cases} x & x > 0 \\ \alpha e^x - \alpha & x \leq 0 \end{cases} $$ **Strengths:** Self-normalizing, non-saturating, and computationally efficient. **Weaknesses:** Requires careful initialization and α tuning. **Usage:** Alternative to ReLU, especially in deep neural networks.