# CARPE DIEM, SEIZE THE SAMPLES UNCERTAIN “AT
## THE MOMENT” FOR ADAPTIVE BATCH SELECTION

**Anonymous authors**
Paper under double-blind review

ABSTRACT

The performance of deep neural networks is significantly affected by how well
mini-batches are constructed. In this paper, we propose a novel adaptive batch
selection algorithm called Recency Bias that exploits the uncertain samples
predicted inconsistently in recent iterations. The historical label predictions of
each sample are used to evaluate its predictive uncertainty within a sliding window.
By taking advantage of this design, Recency Bias not only accelerates the training
step but also achieves a more accurate network. We demonstrate the superiority
of Recency Bias by extensive evaluation on two independent tasks. Compared with
existing batch selection methods, the results showed that Recency Bias reduced
the test error by up to 20.5% in a fixed wall-clock training time. At the same time,
it improved the training time by up to 59.3% to reach the same test error.

1 INTRODUCTION

Stochastic gradient descent (SGD) for randomly selected mini-batch samples is commonly used to
train deep network netowrks (DNNs). However, many recent studies have pointed out that the performance of DNNs is heavily dependent on how well the mini-batch samples are selected (Shrivastava
et al., 2016; Chang et al., 2017; Katharopoulos & Fleuret, 2018). In earlier approaches, a sample’s difficulty is employed to identify proper mini-batch samples, and these approaches achieve
a more accurate and robust network (Han et al., 2018) or expedite the training convergence of
SGD (Loshchilov & Hutter, 2016). However, the two opposing difficulty-based strategies, i.e., preferring easy samples (Kumar et al., 2010; Han et al., 2018) versus hard samples (Loshchilov & Hutter,
2016; Shrivastava et al., 2016), work well in different situations. Thus, for practical reasons to cover
more diverse situations, recent approaches begin to exploit a sample’s uncertainty that indicates the
consistency of previous predictions (Chang et al., 2017; Song et al., 2019).

An important question here is how to evaluate the sample’s uncertainty based on its historical
predictions during the training process. Intuitively, because a series of historical predictions can
be seen as a series of data indexed in chronological order, the uncertainty can be measured based on
_two forms of handling time-series observations: (i) a growing window (Figure 1(a)) that consistently_
increases the size of a window to use all available observations and (ii) a sliding window (Figure 1(b))
that maintains a window of a fixed size on the most recent observations by deleting outdated ones.
While the state-of-the-art algorithm, Active Bias (Chang et al., 2017), adopts the growing window,
we propose to use the sliding window in this paper.

|Historical observations|Col2|
|---|---|
|||
|||


Historical observations Historical observations

Growing Sliding

All available observations Outdated observations Recent observations


(a) Growing Window. (b) Sliding Window.

Figure 1: Two forms of handling the time-series observations.

In more detail, Active Bias recognizes uncertain samples based on the inconsistency of the predictions
in the entire history of past SGD iterations. Then, it emphasizes such uncertain samples by choosing
them with high probability for the next mini-batch. However, according to our experiments presented


-----

|… Horse Horse Horse|Col2|
|---|---|

|… Deer Deer Deer|Col2|
|---|---|


Images Inconsistent Predictions Consistent Predictions Sample Method

(Horse) History Uncertainty

Outdated Recent (too easy)

**High**

Horse Deer Horse Deer Deer Horse Deer … Horse Horse … Horse Active Bias

**Low**

Outdated Recent (too hard)

Deer Horse Horse Deer Horse Deer Horse … Deer Deer … Deer **High** **Recency Bias**

**Low**

Previous Training Iterations


Figure 2: The difference in sample uncertainty estimated by Active Bias and Recency Bias.

in Section 5.2, such uncertain samples slowed down the convergence speed of training, though they
ultimately reduced the generalization error. This weakness is attributed to the inherent limitation of
the growing window, where older observations could be too outdated (Torgo, 2011). In other words,
the outdated predictions no longer represent a network’s current behavior. As illustrated in Figure
2, when the label predictions of two samples were inconsistent for a long time, Active Bias invariably
regards them as highly uncertain, although their recent label predictions become consistent along
with the network’s training progress. This characteristic evidently entails the risk of emphasizing
uninformative samples that are too easy or too hard at the current moment, thereby slowing down
the convergence speed of training.

Therefore, we propose a simple but effective batch selection method, called Recency Bias, that takes
advantage of the sliding window to evaluate the uncertainty in fresher observations. As opposed to
_Active Bias, Recency Bias excludes the outdated predictions by managing a sliding window of a fixed_
size and picks up the samples predicted inconsistently within the sliding window. Thus, as shown
in Figure 2, the two samples uninformative at the moment are no longer selected by Recency Bias
simply because their recent predictions are consistent. Consequently, since informative samples are
effectively selected throughout the training process, this strategy not only accelerates the training
speed but also leads to a more accurate network.

To validate the superiority of Recency Bias, two popular convolutional neural networks (CNNs) were
trained for two independent tasks: image classification and fine tuning. We compared Recency Bias
with not only random batch selection (baseline) but also two state-of-the-art batch selection strategies.
Compared with three batch selection strategies, Recency Bias provided a relative reduction of test
error by 1.81%–20.5% in a fixed wall-clock training time. At the same time, it significantly reduced
the execution time by 24.6%–59.3% to reach the same test error.

2 RELATED WORK

Let D = {(xi, yi)|1 ≤ _i ≤_ _N_ _} be the entire training dataset composed of a sample xi with its_
true label yi, where N is the total number of training samples. Then, a straightforward strategy to
construct a mini-batch = (xi, yi) 1 _i_ _b_ is to select b samples uniformly at random (i.e.,
_M_ _{_ _|_ _≤_ _≤_ _}_
_P_ (xi ) = 1/N ) from the training dataset .
_|D_ _D_

Because not all samples have an equal impact on training, many research efforts have been devoted
to develop advanced sampling schemes. Bengio et al. (2009) first took easy samples and then
gradually increased the difficulty of samples using heuristic rules. Kumar et al. (2010) determined the
easiness of the samples using their prediction errors. Recently, Tsvetkov et al. (2016) used Bayesian
optimization to learn an optimal curriculum for training dense, distributed word representations.
Sachan & Xing (2016) emphasized that the right curriculum must introduce a small number of the
samples dissimilar to those previously seen. Fan et al. (2017) proposed a neural data filter based on
reinforcement learning to select training samples adaptively. However, it is common for deep learning
to emphasize hard samples because of the plethora of easy ones (Katharopoulos & Fleuret, 2018).

Loshchilov & Hutter (2016) proposed a difficulty-based sampling scheme, called Online Batch,
that uses the rank of the loss computed from previous epochs. Online Batch sorts the previously
computed losses of samples in descending order and exponentially decays the sampling probability
of a sample according to its rank r. Then, the r-th ranked sample x(r) is selected with the probability
dropping by a factor of exp log(se)/N, where se is the selection pressure parameter that affects
the probability gap between the most and the least important samples. When normalized to sum
to 1.0, the probability P (x( r) ; se) is defined by Eq. (1). It has been reported that _Online Batch_
_|D_


-----

accelerates the convergence of training but deteriorates the generalization error because of the
overfitting to hard training samples (Loshchilov & Hutter, 2016).

_r_
1/ exp log(se)/N
_P_ (x(r) ; se) = _N_ _j_ (1)
_|D_ _j=1_ [1][/][ exp]  log(se)/N

Most close to our work, Chang et al. (2017) devised anP _uncertainty _ -based sampling scheme, called
_Active Bias, that chooses uncertain samples with high probability for the next batch. Active Bias_
maintains the history _i_ that stores all h(yi _xi) before the current iteration t (i.e., growing window),_
_H[t][−][1]_ _|_
where h(yi|xi) is the softmax probability of a given sample xi for its true label yi. Then, it measures
the uncertainty of the sample xi by computing the variance over all h(yi _xi) in_ _i_ and draws the
_|_ _H[t][−][1]_
next mini-batch samples based on the normalized probability P (xi _,_ _i_ ; ϵ) in Eq. (2), where ϵ is
_|D_ _H[t][−][1]_
the smoothness constant to prevent the low variance samples from never being selected again. As
mentioned earlier in Section 1, Active Bias slows down the training process because the oldest part in
the history _i_ no longer represents the current behavior of the network.
_H[t][−][1]_

_P_ (xi|D, Hi[t][−][1]; ϵ) = _Nj=1stdˆ_ ˆstdi(Hji[t]([−][1]j) +) + ϵ _ϵ_ _,_ _stdˆ_ (Hi[t][−][1]) = vuvar _h(yi|xi)_ + _[var] h(iyi|xi)2_
_H[t][−][1]_ ut    _|H[t][−][1]|_ (2)
P   

For the completeness of the survey, we include the recent studies on submodular batch selection.
Joseph et al. (2019) and Wang et al. (2019) designed their own submodular objectives that cover
diverse aspects, such as sample redundancy and sample representativeness, for more effective
batch selection. Differently from their work, we explore the issue of truly uncertain samples in
an orthogonal perspective. Our uncertainty measure can be easily injected into their submodular
optimization framework as a measure of sample informativeness.

In Section 5, we will confirm that Recency Bias outperforms Online Batch and Active Bias, which are
regarded as two state-of-the-art adaptive batch selection methods for deep learning.

3 _Recency Bias COMPONENTS_

3.1 CRITERION OF AN UNCERTAIN SAMPLE

The main challenge of Recency Bias is to identify the samples whose recent label predictions are
highly inconsistent, which are neither too easy nor too hard at the moment. Thus, we adopt the
_predictive uncertainty (Song et al., 2019) in Definition 3.1 that uses the information entropy (Chandler,_
1987) to measure the inconsistency of recent label predictions. Here, the sample with high predictive
uncertainty is regarded as uncertain and selected with high probability for the next mini-batch.
**Definition 3.1. (Predictive Uncertainty) Let ˆyt = Φ(xi, θt) be the predicted label of a sample xi at**
time t and Hxi (q) = {yˆt1 _, ˆyt2_ _, . . ., ˆytq_ _} be the label history of the sample xi that stores the predicted_
labels at the previous q times, where Φ is a neural network. The label history _xi_ (q) corresponds
_H_
to the sliding window of size q to compute the uncertainty of the sample xi. Next, p(yi _xi; q) is_
_|_
formulated such that it provides the probability of the label y ∈{1, 2, ..., k} estimated as the label of
the sample xi based on Hxi (q) as in Eq. (3), where [·] is the Iverson bracket[1].

_p(y_ _xi; q) =_ _yˆ∈Hxi_ (q)[[ˆ]y = y] (3)
_|_ P _xi_ (q)

_|H_ _|_

Then, to quantify the uncertainty of the sample xi, the predictive uncertainty F (xi; q) is defined by
Eq. (4), where δ is the standardization term to normalize the value to [0, 1].


_F_ (xi; q) = (1/δ)
_−_


_p(j_ _xi; q) log p(j_ _xi; q)_
_|_ _|_
_j=1_

X


(4)


_δ = −_ log (1/k) □

1The Iverson bracket [p] returns 1 if p is true; 0 otherwise.


-----

3.2 SAMPLING PROBABILITY FOR MINI-BATCH CONSTRUCTION

To construct next mini-batch samples, we assign the sampling probability according to the predictive
uncertainty in Definition 3.1. Motivated by Loshchilov & Hutter (2016), the sampling probability
of a given sample xi is exponentially decayed with its predictive uncertainty F (xi; q). In detail,
we adopt the quantization method (Chen & Wornell, 2001) and use the quantization index to decay
the sampling probability. The index is obtained by the simple quantizer Q in Eq. (5), where ∆ is
the quantization step size. Compared with the rank-based index (Loshchilov & Hutter, 2016), the
quantization index is known to well reflect the difference in actual values (Widrow et al., 1996).

_Q_ _F_ (xi; q) = 1 _F_ (xi; q) _/∆_ _, 0_ _F_ (xi; q) 1 (5)
_⌈_ _−_ _⌉_ _≤_ _≤_

In Eq. (5), we set ∆ to be  1/N such that the index is bounded to    _N (the total number of samples)._
Then, the sampling probability P (xi ; se) is defined as in Eq. (6). The higher the predictive
_|D_
uncertainty, the smaller the quantization index. Therefore, a higher sampling probability is assigned
for uncertain samples in Eq. (6).

1/ exp log(se)/N _Q(F (xi;q))_
_P_ (xi|D; se) = _N_ _Q(F (xj_ ;q)) (6)

_j=1_ [1][/][ exp]  log(se)/N

Meanwhile, it is known that using only some part of training data exacerbates the overfitting problemP   
at a late stage of training (Loshchilov & Hutter, 2016; Zhou & Bilmes, 2018). Thus, to alleviate
the problem, we include more training samples as the training progresses by exponentially decaying
the selection pressure se as in Eq. (7). At each epoch e from e0 to eend, the selection pressure
_se exponentially decreases from se0 to 1. Because this technique gradually reduces the sampling_
probability gap between the most and the least uncertain samples, more diverse samples are selected
for the next mini-batch at a later epoch. When the selection pressure se becomes 1, the mini-batch
samples are randomly chosen from the entire dataset.

0
_se = se0_ exp log (1/se0 )/(eend − _e0)_ (7)
   [][e][−][e]

4 _Recency Bias ALGORITHM_

**Algorithm 1 Recency Bias Algorithm**

INPUT: : data, epochs, b: batch size, q: window size, se0 : initial selection pressure, γ: warm-up
_D_
OUTPUT: θt: model parameter

1: t ← 1;
2: θt ← Initialize the model parameter;
3: for i = 1 to epochs do
4: /* Sampling Probability Derivation */

5: **if i > γ then**

6: _se ←_ Decay_Selection_Pressure(se0, i); /* Decaying se by Eq. (7) */

7: **for m = 1 to N do** /* Updating the index and the sampling probability in a batch */

8: _q_dict[xm] = Q_ _F_ (xm; q) ; /* By Eq. (5) */


9: _p_table ←_ Compute_Prob(q_dict, se); /* By Eq. (6) */

10: /* Network Training */

11: **for j = 1 to N/b do** /* Mini-batch */

12: **if i ≤** _γ then_ /* Warm-up */

13: (x1, y1), . . ., (xb, yb) Randomly select next mini-batch samples;
_{_ _} ←_

14: **else /* Adaptive batch selection */**

15: (x1, y1), . . ., (xb, yb) Select next mini-batch samples based on p_table;
_{_ _} ←_

16: _losses, labels_ Inference_Step( (x1, y1), . . ., (xb, yb),θt); /* Forward */
_←_ _{_ _}_

17: _θt+1 ←_ SGD_Step(losses, θt); /* Backward */

18: Update_Label_History(labels); /* By Definition 3.1 */

19: _t ←_ _t + 1;_

20: return θt;

Algorithm 1 describes the overall procedure of Recency Bias. The algorithm requires a warm-up
period of γ epochs because the quantization index for each sample is not confirmed yet. During
the warm-up period, which should be at least q epochs (γ ≥ _q) to obtain the label history of size_


-----

_q, randomly selected mini-batch samples are used for the network update (Lines 12–13). After the_
warm-up period, the algorithm decays the selection pressure se and updates not only the quantization
index but also the sampling probability in a batch at the beginning of each epoch (Lines 4–9).
Subsequently, the uncertain samples are selected for the next mini-batch according to the updated
sampling probability (Line 14–15), and then the label history is updated along with the network
update (Lines 16–19).

Overall, the key technical novelty of Recency Bias is to incorporate the notion of a sliding win_dow (Line 8) rather than a growing window into adaptive batch selection, thereby improving both_
training speed and generalization error.

**Time Complexity: The main “additional” cost of Recency Bias is the derivation of the sampling**
probability for each sample (Lines 4–9). Because only simple mathematical operations are needed
per sample, its time complexity is linear to the number of samples (i.e., O(N )), which is negligible
compared with that of the forward and backward steps of a complex network (Lines 16–17). Therefore,
we contend that Recency Bias does not add the complexity of an underlying optimization algorithm.

5 EVALUATION

We empirically show the improvement of Recency Bias over not only Random Batch (baseline) but also
_Online Batch (Loshchilov & Hutter, 2016) and Active Bias (Chang et al., 2017), which are two state-_
of-the-art adaptive batch selections. In particular, we elaborate on the effect of the sliding window
approach (Recency Bias) compared with the growing window approach (Active Bias). Random Batch
selects next mini-batch samples uniformly at random from the entire dataset. Online Batch selects hard
samples based on the rank of the loss computed from previous epochs. Active Bias selects uncertain
samples with high variance of true label probabilities in the growing window. All the algorithms
were implemented using TensorFlow 1.8.0 and executed using a single NVIDIA Titan Volta GPU.
[For reproducibility, we provide the source code at https://github.com/anonymized.](https://github.com/anonymized)

Image classification and fine-tuning tasks were performed to validate the superiority of Recency Bias.
Because fine-tuning is used to quickly adapt to a new dataset, it is suitable to reap the benefit of fast
training speed. In support of reliable evaluation, we repeated every task thrice and reported the average
and standard error of the best test errors. The best test error in a given time has been widely used for
the studies on fast and accurate training (Katharopoulos & Fleuret, 2018; Loshchilov & Hutter, 2016).

5.1 ANALYSIS ON SELECTED MINI-BATCH SAMPLES

For an in-depth analysis on selected samples, we plot the loss distribution of mini-batch samples
selected from CIFAR-10 by four different strategies in Figure 3. (i) The distribution of Online Batch
is the most skewed toward high loss by the design principle of selecting hard samples. (ii) Active Bias
emphasizes moderately hard samples at an early training stage in considering that its loss distribution
lies between those of Random Batch and Online Batch. However, owing to the outdated predictions
caused by the growing window, the proportion of easy samples with low loss increases at a late
training stage. These easy samples, which are misclassified as uncertain at that stage, tend to make the
convergence of training slow down. (iii) In contrast to Active Bias, by virtue of the sliding window,
the distribution of Recency Bias lies between those of Random Batch and Online Batch regardless of
the training stage. Consequently, Recency Bias continues to highlight the moderately hard samples,
which are likely to be informative, during the training process.

Random Batch
Online Batch
Active Bias
(Growing window)
Recency Bias
(Sliding window)

Loss (Log-scale) Loss (Log-scale)


(a) Early Stage (30%). (b) Late Stage (70%).

Figure 3: The loss distribution of mini-batch samples selected by four batch selection strategies: (a)
and (b) show the loss distribution at the 30% and 70% of total training epochs, respectively.


-----

5.2 TASK I: IMAGE CLASSIFICATION

**Experiment Setting: We trained DenseNet (L=40, k=12) and ResNet (L=50) with a momentum**
optimizer and an SGD optimizer on three benchmark datasets: MNIST (10 classes)[2], classification
of handwritten digits (LeCun, 1998), and CIFAR-10 (10 classes)[3] and CIFAR-100 (100 classes)[3],
classification of a subset of 80 million categorical images (Krizhevsky et al., 2014). Specifically, we
used data augmentation, batch normalization, a momentum of 0.9, and a batch size of 128. As for the
algorithm parameters, we fixed the window size q = 10 and the initial selection pressure se0 = 100,[4]
which were the best values found by the grid search (see Appendix A for details). The warm-up
epoch γ was set to be 15. To reduce the performance variance caused by randomly initialized model
parameters, all parameters were shared by all algorithms during the warm-up period. Regarding
the training schedule, we trained the network for 40, 000 iterations and used an initial learning rate
of 0.1, which was divided by 10 at 50% and 75% of the total number of training iterations.

**Results: Figure 4 shows the convergence curves of training loss and test error for four batch selection**
strategies using DenseNet and a momentum optimizer. In order to highlight the improvement of
_Recency Bias over the baseline (Random Batch), their lines are dark colored. The best test errors in_
Figures 4(b), 4(d), and 4(f) are summarized on the left side of Table 1.

In general, Recency Bias achieved the most accurate network while accelerating the training process
on all datasets. The training loss of Recency Bias converged faster (Figures 4(a), 4(c), and 4(e))
without the increase in the generalization error, thereby achieving the lower test error (Figures 4(b),
4(d), and 4(f)). In contrast, the test error of Online Batch was not the best even if its training loss
converged the fastest among all strategies. As the training difficulty increased from CIFAR-10 to
CIFAR-100, the test error of Online Batch became even worse than that of Random Batch. That
is, emphasizing hard samples accelerated the training step but made the network overfit to hard
samples. Meanwhile, Active Bias was prone to make the network better generalized on test data.
In CIFAR-10, despite its highest training loss, the test error of Active Bias was better than that of
_Random Batch. However, Active Bias slowed down the training process because of the limitation_
of growing windows, as discussed in Section 5.1. We note that, although both Recency Bias and
_Active Bias exploited uncertain samples, only Recency Bias based on sliding windows succeeded_
to not only speed up the training process but also reduce the generalization error.

The results of the best test error for ResNet or an SGD optimizer are summarized in Tables 1 and
2 (see Appendix B for more details). Regardless of a neural network and an optimizer, Recency
_Bias achieved the lowest test error except in MNIST with an SGD optimizer. The improvement of_
_Recency Bias over the others was higher with an SGD optimizer than with a momentum optimizer._

Table 1: The best test errors (%) of four batch selection strategies using DenseNet.

|Optimizer|Momentum in Figure 4|Col3|Col4|SGD in Figure 7 (Appendix B.1)|Col6|Col7|
|---|---|---|---|---|---|---|
|Method|MNIST|CIFAR-10|CIFAR-100|MNIST|CIFAR-10|CIFAR-100|
|Random Batch|0.527 ± 0.03|7.33 ± 0.09|28.0 ± 0.16|1.23 ± 0.03|14.9 ± 0.09|40.2 ± 0.06|
|Online Batch|0.514 ± 0.01|7.00 ± 0.10|28.4 ± 0.25|0.765 ± 0.02|13.5 ± 0.02|40.7 ± 0.12|
|Active Bias|0.616 ± 0.03|7.07 ± 0.04|27.9 ± 0.11|0.679 ± 0.02|14.2 ± 0.25|42.9 ± 0.05|
|Recency Bias|0.490 ± 0.02|6.60 ± 0.02|27.1 ± 0.19|0.986 ± 0.06|13.2 ± 0.11|38.7 ± 0.11|


Table 2: The best test errors (%) of four batch selection strategies using ResNet.

|Optimizer|Momentum in Figure 8 (Appendix B.2)|Col3|Col4|SGD in Figure 9 (Appendix B.3)|Col6|Col7|
|---|---|---|---|---|---|---|
|Method|MNIST|CIFAR-10|CIFAR-100|MNIST|CIFAR-10|CIFAR-100|
|Random Batch|0.636 ± 0.04|10.2 ± 0.12|33.2 ± 0.07|1.16 ± 0.03|12.7 ± 0.09|40.1 ± 0.16|
|Online Batch|0.666 ± 0.05|10.1 ± 0.05|33.4 ± 0.01|0.890 ± 0.03|12.2 ± 0.08|40.7 ± 0.09|
|Active Bias|0.613 ± 0.04|10.6 ± 0.08|34.2 ± 0.07|0.804 ± 0.01|13.5 ± 0.07|45.6 ± 0.07|
|Recency Bias|0.607 ± 0.01|9.79 ± 0.04|32.4 ± 0.04|0.972 ± 0.03|11.6 ± 0.09|38.9 ± 0.14|


[2http://yann.lecun.com/exdb/mnist](http://yann.lecun.com/exdb/mnist)
[3https://www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html)
4Online Batch also used the same decaying selection pressure value.


-----

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|
|---|---|---|---|---|---|---|---|---|---|
||Random Batch Online||||Batch Active Bias Recency Bias|||||
|E-01 E-02 E-03|||||3.6% Error 1.2% Test|||||
|||||||||||
|||||||||||
|||||||||||
|||||||||||


2125 4250 6375 8500

Time (s)


2125 4250 6375 8500

Time (s)


0.90.80.70.60.50.40.30.20.110


|Col1|Col2|Col3|Col4|
|---|---|---|---|
|||||
|||||

|0|Col2|Col3|Col4|
|---|---|---|---|
|||||
|||||


|(a) MNIST Training Loss. (b) MNIST Test Error.|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
|(a) MNIST Training Loss. (b) MNIST Test Error. 16E-01 20 40 60 80 26.0%10 0 4E-01 Error 13.0% Test 4E-02 0E-03 6.5% 0 2500 5000 7500 10000 0 2500 5000 7500 100 Time (s) Time (s) (c) CIFAR-10 Training Loss. (d) CIFAR-10 Test Error. 4E+00 54.0% Error 0E-01 Test|||||
||||||


2500 5000 7500 10000


2.4E+00

6.0E-01


1.5E-01


27.0%


2500 5000 7500 10000

Time (s)


Time (s)


(e) CIFAR-100 Training Loss. (f) CIFAR-100 Test Error.

Figure 4: Convergence curves of four batch selection strategies using DenseNet with momentum.


5.3 TASK II: FINE-TUNING

**Experiment Setting: We prepared DenseNet (L=121, k=32) previously trained on ImageNet (Deng**
et al., 2009) and then fine-tuned the network on two benchmark datasets: MIT-67 (67 classes)[5],
classification of indoor scenes (Quattoni & Torralba, 2009), and Food-100 (100 classes)[6], classification of popular foods in Japan (Kawano & Yanai, 2014). After replacing the last classification
layer, the network was trained end-to-end for 50 epochs with a batch size 32 and a constant learning
rate 2 × 10[−][4]. Data augmentation was not applied here. The other configurations were the same
as those in Section 5.2.

**Results on Test Error: Figure 5 shows the convergence curves of training loss and test error for**
the fine-tuning task on MIT-67 and Food-100. Overall, all convergence curves showed similar trends
to those of the classification task in Figure 4. Only Recency Bias converged faster than Random
_Batch in both training loss and test error. Online Batch converged the fastest in training loss, but_
its test error was rather higher than Random Batch owing to the overfitting. Active Bias converged the


[5http://web.mit.edu/torralba/www/indoor.html](http://web.mit.edu/torralba/www/indoor.html)
[6http://foodcam.mobi/dataset100.html](http://foodcam.mobi/dataset100.html)


-----

|Col1|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
||||||
|Time Redu|ction: 24.6|%|||


Random Batch Online Batch Active Bias Recency Bias

1.9E+00 39.0%

6.3E-012.1E-01 Test Error 35.0%31.0%

Training Loss

Time Reduction: 24.6%

7.0E-02 27.0%

0 1500 3000 4500 6000 0 1500 3000 4500 6000

0.90.80.70.60.50.40.30.20.110 (a) MIT-67 Training Loss.Time (s) (b) MIT-67 Test Error.Time (s)

1.6E+001 20 40 60 80 44.0%10

0

40.0%


0.90.80.70.60.50.40.30.20.110


8.0E-01

4.0E-01


36.0%

32.0%


2.0E-01

|20|4|0|60|Col5|
|---|---|---|---|---|
||||||
||||||

|0|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
|0|||||
||||||
|Time Redu|ction: 26.1|%|||


2000 4000 6000 8000

Time (s)


2000 4000 6000 8000

Time (s)


(c) Food-100 Training Loss. (d) Food-100 Test Error.

Figure 5: Convergence curves for fine-tuning on two benchmark datasets.


Table 3: Recency Bias’s reduction in training time over other batch selection strategies.


|Method|MIT-67|FOOD-100|
|---|---|---|
|Random Batch|(5, 218 −3, 936)/5, 218 × 100 = 24.6%|(7, 263 −5, 365)/7, 263 × 100 = 26.1%|
|Online Batch|(6, 079 −3, 823)/6, 079 × 100 = 37.1%|(8, 333 −3, 685)/8, 333 × 100 = 55.8%|
|Active Bias|(5, 738 −3, 032)/5, 738 × 100 = 47.2%|(7, 933 −3, 227)/7, 933 × 100 = 59.3%|


slowest in both training loss and test error. Quantitatively, compared with Random Batch, Recency
_Bias reduced the test error by 2.88% and 1.81% in MIT-67 and Food-100, respectively._

**Results on Training Time: Moreover, to assess the performance gain in training time, we computed**
the reduction in the training time taken to reach the same error. For example, in Figure 5(b), the
best test error of 28.8% achieved in 5, 218 seconds by Random Batch could be achieved only in
3, 936 seconds by Recency Bias; thus, Recency Bias improved the training time by 24.6%. Table
3 summarizes the reduction in the training time of Recency Bias over three other batch selection
strategies. Notably, Recency Bias improved the training time by 24.6%–47.2% and 26.1%–59.3% in
fine-tuning MIT-67 and FOOD-100 datasets, respectively.

6 CONCLUSION


In this paper, we presented a novel adaptive batch selection algorithm called Recency Bias that
emphasizes predictively uncertain samples for accelerating the training of neural networks. Toward
this goal, the predictive uncertainty of each sample is evaluated using its recent label predictions
managed by a sliding window of a fixed size. Then, uncertain samples at the moment are selected with
high probability for the next mini-batch. We conducted extensive experiments on both classification
and fine-tuning tasks. The results showed that Recency Bias is effective in reducing the training
time as well as the best test error. It was worthwhile to note that using all historical observations to
estimate the uncertainty has the side effect of slowing down the training process. Overall, a merger of
uncertain samples and sliding windows greatly improves the power of adaptive batch selection.


-----

REFERENCES

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In
_ICML, pp. 41–48, 2009._

David Chandler. Introduction to modern statistical mechanics. Oxford University Press, 1987.

Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active Bias: Training more
accurate neural networks by emphasizing high variance samples. In NeurIPS, pp. 1002–1012,
2017.

Brian Chen and Gregory W Wornell. Quantization index modulation: A class of provably good
methods for digital watermarking and information embedding. IEEE Trans. on Information Theory,
47(4):1423–1443, 2001.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale
hierarchical image database. In CVPR, pp. 248–255, 2009.

Yang Fan, Fei Tian, Tao Qin, and Tie-Yan Liu. Neural data filter for bootstrapping stochastic gradient
descent. In ICLR, 2017.

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi
Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In
_NeurIPS, pp. 8527–8537, 2018._

KJ Joseph, Krishnakant Singh, Vineeth N Balasubramanian, et al. Submodular batch selection for
training deep neural networks. In IJCAI, pp. 2677–3683, 2019.

Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with
importance sampling. In ICML, pp. 2525–2534, 2018.

Y. Kawano and K. Yanai. Food image recognition with deep convolutional features. In UbiComp,
2014.

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 and CIFAR-100 datasets, 2014.
[https://www.cs.toronto.edu/~kriz/cifar.html.](https://www.cs.toronto.edu/~kriz/cifar.html)

M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable
models. In NeurIPS, pp. 1189–1197, 2010.

[Yann LeCun. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/](http://yann.lecun.com/exdb/mnist)
[exdb/mnist.](http://yann.lecun.com/exdb/mnist)

Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. In
_ICLR, 2016._

Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In CVPR, pp. 413–420, 2009.

Mrinmaya Sachan and Eric Xing. Easy questions first? A case study on curriculum learning for
question answering. In ACL, pp. 453–463, 2016.

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors
with online hard example mining. In CVPR, pp. 761–769, 2016.

Hwanjun Song, Minseok Kim, and Jae-Gil Lee. SELFIE: Refurbishing unclean samples for robust
deep learning. In ICML, pp. 5907–5915, 2019.

Luis Torgo. Data mining with R: learning with case studies. Chapman and Hall/CRC, 2011.

Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Brian MacWhinney, and Chris Dyer. Learning the
curriculum with bayesian optimization for task-specific word representation learning. In ACL, pp.
130–139, 2016.

Shengjie Wang, Wenruo Bai, Chandrashekhar Lavania, and Jeff Bilmes. Fixing mini-batch sequences
with hierarchical robust partitioning. In AISTATS, pp. 3352–3361, 2019.


-----

Bernard Widrow, Istvan Kollar, and Ming-Chang Liu. Statistical theory of quantization. IEEE
_Transactions on instrumentation and measurement, 45(2):353–361, 1996._

Tianyi Zhou and Jeff Bilmes. Minimax curriculum learning: Machine teaching with desirable
difficulties and scheduled diversity. In ICLR, 2018.


-----

A HYPERPARAMETER SELECTION

_Recency Bias receives the two hyperparameters: (i) the initial selection pressure se0 that determines_
the sampling probability gap between the most and the least uncertain samples and (ii) the window
size q that determines how many recent label predictions are involved in predicting the uncertainty.
To decide the best hyperparameters, we trained ResNet (L=50) on CIFAR-10 and CIFAR-100 with a
momentum optimizer. For hyperparameters selection, the two hyperparameters were chosen in a grid
_se0_ 1, 10, 100, 1000 and q 5, 10, 15 .
_∈{_ _}_ _∈{_ _}_


|Window|Size|Col3|
|---|---|---|
|q=5 q=10 q=15|||


10.4% 33.5%

Window Size

10.1% 33.0% q=5

q=10

9.8% 32.5%

Best Test Error q=15

9.5% 32.0%

1 10 100 1000 1 10 100 1000

Initial Selection Pressure (𝑆𝑒0) Initial Selection Pressure (𝑆𝑒0)


(a) CIFAR-10. (b) CIFAR-100.

Figure 6: Grid search on CIFAR-10 and CIFAR-100 datasets using ResNet.

Figure 6 shows the test errors of Recency Bias obtained by the grid search on the two datasets.
Regarding the initial selection pressure se0, the lowest test error was typically achieved when the
_se0 value was 100. As for the window size q, the test error was almost always the lowest when the q_
value was 10. Similar trends were observed for the other combinations of a neural network and an
optimizer. Therefore, in all experiments, we set se0 to be 100 and q to be 10.


-----

|GENERALIZATION OF Recency Bias|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|
|---|---|---|---|---|---|---|---|---|---|---|---|
|CONVERGENCE CURVES USING DENSENET WITH SGD 7 shows the convergence curves of training loss and test error for four batch selection strate DenseNet and an SGD optimizer, which corresponds to the right side of Table 1.||||||||||||
||eNet and an SGD optimizer, whic|||||||||||
||Random Batch Online|||||Batch Active Bias Recency Bias||||||
|E-01 E-02 E-02||||||4.8% 2.4% Error Test 1.2%||||||
|||||||||||||
|||||||||||||
|||||||||||||


2000 4000 6000 8000

Time (s)


2000 4000 6000 8000

Time (s)


0.90.80.70.60.50.40.30.20.110


|(a) MNIST Training Loss. (b) MNIST Test Error.|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
|(a) MNIST Training Loss. (b) MNIST Test Error. 10E+00 20 40 60 80 48.0%10 0 2E-01 Error 24.0% Test 6E-01|||||
||||||
||||||


3.2E+00


12.0%

|Col1|Col2|Col3|Col4|
|---|---|---|---|
|||||
|||||


2500 5000 7500 10000

Time (s)


2500 5000 7500 10000


(c) CIFAR-10 Training Loss. (d) CIFAR-10 Test Error.

70.0%


1.6E+00

8.0E-01


35.0%

|(c) CIFAR-10 Training Loss.|Col2|Col3|Col4|
|---|---|---|---|
|||||
|||||
|||||

|Time (s)|Col2|Col3|Col4|
|---|---|---|---|
|(d) CIFAR-10 Test Error.||||
|||||


2500 5000 7500 10000

Time (s)


2500 5000 7500 10000

Time (s)


(e) CIFAR-100 Training Loss. (f) CIFAR-100 Test Error.

Figure 7: Convergence curves of four batch selection strategies using DenseNet with SGD.


-----

|Col1|et and a momentum optimizer, w|Col3|Col4|Col5|
|---|---|---|---|---|
||Random Batch Online|||Batch Active Bias Recency Bias|
||||||
||||||
||||||
||||||


0.90.80.70.60.50.40.30.20.110


|20|40|60|
|---|---|---|
||||
||||
||||

|0|Col2|Col3|
|---|---|---|
||||
||||


|CONVERGENCE CURVES USING RESNET WITH MOMENTUM e 8 shows the convergence curves of training loss and test error for four batch selection strate ResNet and a momentum optimizer, which corresponds to the left side of Table 2. Random Batch Online Batch Active Bias Recency Bias 5E-01 4.5% 9E-02 Error 1.5% Test 4E-03 0E-04 0.5% 0 2300 4600 6900 0 2300 4600 69 Time (s) Time (s) (a) MNIST Training Loss. (b) MNIST Test Error. 17E+00 20 40 60 80 36.0%10 0 5E-01 Error 18.0% Test 5E-02 0E-03 9.0% 0 2900 5800 8700 0 2900 5800 870 Time (s) Time (s) (c) CIFAR-10 Training Loss. (d) CIFAR-10 Test Error. 3E+00 64.0% 2E-01 Error Test 2E-01|Col2|Col3|Col4|
|---|---|---|---|
|||||


2900 5800 8700

Time (s)


2900 5800 8700

Time (s)


(e) CIFAR-100 Training Loss. (f) CIFAR-100 Test Error.

Figure 8: Convergence curves of four batch selection strategies using ResNet with momentum.


-----

|CONVERGENCE CURVES USING RESNET WITH SGD 9 shows the convergence curves of training loss and test error for four batch selection strate ResNet and an SGD optimizer, which corresponds to the right side of Table 2.|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
||ows the convergence curves of trai et and an SGD optimizer, which|||||||
||Random Batch Online|||Batch Active Bias Recency Bias||||
|E-01 E-02 E-02||||6.3% Error 2.1% Test||||
|||||||||
|||||||||
|||||||||
|||||||||


2300 4600 6900


2300 4600 6900

|Time (s) Time (s) (a) MNIST Training Loss. (b) MNIST Test Error. 1 20 40 60 80 44.0%10 4E+00 0 5E-01 Error 22.0% Test 5E-01|Col2|Col3|Col4|
|---|---|---|---|
|||||
|||||


2900 5800 8700


0.90.80.70.60.50.40.30.20.110


0

22.0%

Test Error

11.0%


1.4E+001

4.5E-01


1.5E-01

5.0E-02

|20|40|60|
|---|---|---|
||||
||||
||||


2900 5800 8700


(c) CIFAR-10 Training Loss. (d) CIFAR-10 Test Error.

76.0%


3.6E+00

1.2E+00


4.0E-01


38.0%

|Time (s)|Col2|Col3|
|---|---|---|
|(c) CIFAR-10 Training Loss.|||
||||
||||

|Time (s)|Col2|Col3|
|---|---|---|
|(d) CIFAR-10 Test Error.|||
||||


2900 5800 8700

Time (s)


2900 5800 8700

Time (s)


(e) CIFAR-100 Training Loss. (f) CIFAR-100 Test Error.

Figure 9: Convergence curves of four batch selection strategies using ResNet with SGD.


-----