Spaces:
Paused
Paused
STATISTICAL LEARNING | |
15 | |
Sigmoid Function | |
1 | |
σ(z) = | |
1 + exp−z | |
Softmax Function | |
exp(z) | |
softmax(z) = PK | |
k=1 exp(zk ) | |
Table 2.1. Sigmoid and softmax activation functions for binary and multi-class | |
classification, respectively. | |
ready continuous-spatial signal, different DL architectures have been established, | |
which will be discussed in Section 2.1.3. | |
A K-class classification function with an l-layer NN with d dimensional input x ∈ | |
Rd is shorthand fθ : Rd → RK , with θ = {θj }lj=1 assumed to be optimized, either | |
partially or fully, using backpropagation and a loss function. More specifically, | |
it presents a non-convex optimization problem, concerning multiple feasible | |
regions with multiple locally optimal points within each. With maximumlikelihood estimation estimation, the goal is to find the optimal parameters | |
or weights that minimize the loss function, effectively interpolating the training | |
data. This process involves traversing the high-dimensional loss landscape. | |
Upon convergence of model training, the optimized parameters form a solution | |
in the weight-space, representing a unique mode (specific function fθ̂ ). However, | |
when regularization techniques such as weight decay, dropout, or early stopping | |
are applied, the objective shifts towards maximum-a-posteriori (MAP), to | |
take into account the prior probability of the parameters. The difference in | |
parameter estimation forms the basis for several uncertainty estimation methods, | |
covered in Section 2.2.5. | |
A prediction is a translation of a model’s output to which a standard decision | |
rule is applied, e.g., to obtain the top-1/k prediction (Equation (2.5)), or decode | |
structured output according to a function maximizing total likelihood with | |
optionally additional diversity criteria. | |
ŷ = argmax fθ̂ (x) | |
(2.5) | |
Considering standard NNs, the last layer outputs a vector of real-valued logits | |
z ∈ RK , which in turn are normalized to a probability distribution over K | |
classes using a sigmoid or softmax function (Table 2.1). | |
2.1.2 | |
Probabilistic Evaluation | |
The majority of our works involves supervised learning with NNs, formulated | |
generically as a probabilistic predictor in Definition 1. | |