Spaces:
Running
Running
--- | |
title: "Activation functions" | |
notebook-links: false | |
crossref: | |
lof-title: "List of Figures" | |
number-sections: false | |
format: | |
html: default | |
--- | |
When choosing an activation function, consider the following: | |
- **Non-saturation:** Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients. | |
- **Computational efficiency:** Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications. | |
- **Smoothness:** Smooth activations (e.g., GELU, Mish) can help with optimization and convergence. | |
- **Domain knowledge:** Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification). | |
- **Experimentation:** Try different activations and evaluate their performance on your specific task. | |
[But what is a neural network?](https://youtu.be/aircAruvnKk?si=64sscTHzYeZ9x-5L) | |
[Slideshow](activations_slideshow.qmd) | |
{{< embed ActivationFunctions.ipynb#fig-overview >}} | |
## Sigmoid {#sec-sigmoid} | |
**Strengths:** Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems. | |
**Weaknesses:** Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation. | |
**Usage:** Binary classification, logistic regression. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\sigma(x) = \frac{1}{1 + e^{-x}} | |
$$ | |
``` python | |
def sigmoid(x): | |
return 1 / (1 + np.exp(-x)) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-sigmoid >}} | |
::: | |
::: | |
## Hyperbolic Tangent (Tanh) {#sec-tanh} | |
**Strengths:** Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models. | |
**Weaknesses:** Also saturates, leading to vanishing gradients. | |
**Usage:** Similar to sigmoid, but with a larger output range. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} | |
$$ | |
``` python | |
def tanh(x): | |
return np.tanh(x) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-tanh >}} | |
::: | |
::: | |
## Rectified Linear Unit (ReLU) | |
**Strengths:** Computationally efficient, non-saturating, and easy to compute. | |
**Weaknesses:** Not differentiable at x=0, which can cause issues during optimization. | |
**Usage:** Default activation function in many deep learning frameworks, suitable for most neural networks. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\text{ReLU}(x) = \max(0, x) | |
$$ | |
``` python | |
def relu(x): | |
return np.maximum(0, x) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-relu >}} | |
::: | |
::: | |
## Leaky ReLU | |
**Strengths:** Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons. | |
**Weaknesses:** Still non-differentiable at x=0. | |
**Usage:** Alternative to ReLU, especially when dealing with dying neurons. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\text{Leaky ReLU}(x) = | |
\begin{cases} | |
x & \text{if } x > 0 \\ | |
\alpha x & \text{if } x \leq 0 | |
\end{cases} | |
$$ | |
``` python | |
def leaky_relu(x, alpha=0.01): | |
# where α is a small constant (e.g., 0.01) | |
return np.where(x > 0, x, x * alpha) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-leaky_relu >}} | |
::: | |
::: | |
## Swish | |
**Formula:** | |
where g(x) is a learned function (e.g., sigmoid or ReLU) | |
**Strengths:** Self-gated, adaptive, and non-saturating. | |
**Weaknesses:** Computationally expensive, requires additional learnable parameters. | |
**Usage:** Can be used in place of ReLU or other activations, but may not always outperform them. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\text{Swish}(x) = x \cdot \sigma(x) | |
$$ | |
``` python | |
def swish(x): | |
return x * sigmoid(x) | |
``` | |
See also: [sigmoid](#sec-sigmoid) | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-swish >}} | |
::: | |
::: | |
## Mish | |
**Strengths:** Non-saturating, smooth, and computationally efficient. | |
**Weaknesses:** Not as well-studied as ReLU or other activations. | |
**Usage:** Alternative to ReLU, especially in computer vision tasks. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\text{Mish}(x) = x \cdot \tanh(\text{Softplus}(x)) | |
$$ | |
``` python | |
def mish(x): | |
return x * np.tanh(softplus(x)) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-mish >}} | |
::: | |
::: | |
See also: [softplus](#softplus) [tanh](#sec-tanh) | |
## Softmax | |
**Strengths:** Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification. | |
**Weaknesses:** Only suitable for output layers with multiple classes. | |
**Usage:** Output layer activation for multi-class classification problems. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{k=1}^{K} e^{x_k}} | |
$$ | |
``` python | |
def softmax(x): | |
e_x = np.exp(x - np.max(x)) | |
return e_x / e_x.sum() | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-softmax >}} | |
::: | |
::: | |
## Softsign | |
**Strengths:** Similar to sigmoid, but with a more gradual slope. | |
**Weaknesses:** Not commonly used, may not provide significant benefits over sigmoid or tanh. | |
**Usage:** Alternative to sigmoid or tanh in certain situations. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\text{Softsign}(x) = \frac{x}{1 + |x|} | |
$$ | |
``` python | |
def softsign(x): | |
return x / (1 + np.abs(x)) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-softsign >}} | |
::: | |
::: | |
## SoftPlus {#softplus} | |
**Strengths:** Smooth, continuous, and non-saturating. | |
**Weaknesses:** Not commonly used, may not outperform other activations. | |
**Usage:** Experimental or niche applications. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\text{Softplus}(x) = \log(1 + e^x) | |
$$ | |
``` python | |
def softplus(x): | |
return np.log1p(np.exp(x)) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-softplus >}} | |
::: | |
::: | |
## ArcTan | |
**Strengths:** Non-saturating, smooth, and continuous. | |
**Weaknesses:** Not commonly used, may not outperform other activations. | |
**Usage:** Experimental or niche applications. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
arctan(x) = arctan(x) | |
$$ | |
``` python | |
def arctan(x): | |
return np.arctan(x) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-arctan >}} | |
::: | |
::: | |
## Gaussian Error Linear Unit (GELU) | |
**Strengths:** Non-saturating, smooth, and computationally efficient. | |
**Weaknesses:** Not as well-studied as ReLU or other activations. | |
**Usage:** Alternative to ReLU, especially in Bayesian neural networks. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
\text{GELU}(x) = x \cdot \Phi(x) | |
$$ | |
``` python | |
def gelu(x): | |
return 0.5 * x | |
* (1 + np.tanh(np.sqrt(2 / np.pi) | |
* (x + 0.044715 * np.power(x, 3)))) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-gelu >}} | |
::: | |
::: | |
See also: [tanh](#sec-tanh) | |
## Silu (SiLU) | |
**Strengths:** Non-saturating, smooth, and computationally efficient. | |
**Weaknesses:** Not as well-studied as ReLU or other activations. | |
**Usage:** Alternative to ReLU, especially in computer vision tasks. | |
::: columns | |
::: {.column width="50%"} | |
$$ | |
silu(x) = x * sigmoid(x) | |
$$ | |
``` python | |
def silu(x): | |
return x / (1 + np.exp(-x)) | |
``` | |
::: | |
::: {.column width="50%"} | |
{{< embed ActivationFunctions.ipynb#fig-silu >}} | |
::: | |
::: | |
## GELU Approximation (GELU Approx.) | |
$$ | |
f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x^3))) | |
$$ | |
**Strengths:** Fast, non-saturating, and smooth. | |
**Weaknesses:** Approximation, not exactly equal to GELU. | |
**Usage:** Alternative to GELU, especially when computational efficiency is crucial. | |
## SELU (Scaled Exponential Linear Unit) | |
$$ | |
f(x) = \lambda | |
\begin{cases} | |
x & x > 0 \\ | |
\alpha e^x - \alpha & x \leq 0 | |
\end{cases} | |
$$ | |
**Strengths:** Self-normalizing, non-saturating, and computationally efficient. | |
**Weaknesses:** Requires careful initialization and α tuning. | |
**Usage:** Alternative to ReLU, especially in deep neural networks. | |