Filtering single image super-resolution datasets with BHI

Community Article Published November 7, 2024

Intro

Approach

DF2K Main Test
System Setup

Dataset

Training

Validation

Metrics

Floating-point format

Blockiness

HyperIQA Filtering

IC9600 Filtering

BHI Filtering

ImageNet Additional Test
HyperIQA Filtering

IC9600 Filtering

BHI Filtering

LSDIR Quick Test
BHI Filtering

Future Work

Intro

Having used and visually compared over 600 different upscaling models for my vitepress website that I created for others to do the same, I since then trained and released over 100 sisr (single image super-resolution) models myself, which are based on over 15 different architectures such as MoSR, RealPLKSR, DRCT, SPAN, DAT or ATD and their respective architecture options.

These models can be found on my github models repo, huggingface profile or on openmodeldb, and can be tried out online on this ZeroGPU Huggingface Space.

For sisr training purposes, I occasionally curated datasets, as was the earliest in August 2023 when I made a curated version of FFHQ called FaceUp for my FaceUp model series where I made use of the HyperIQA image quality metric for filtering.

In this post, I assess the influence of two dataset filtering techniques I made use of in the past for sisr model training, namely HyperIQA and IC9600 for complexity filtering.

Approach

My goal was to find a simple dataset curation workflow for datasets that in general either improve quality (model training validation metric scores) or efficiency (storage saving by reducing the quantity of images while keeping similiar validation metrics scores).

The BHI (Blockiness, HyperIQA, IC9600) filtering method is what I came up with, and here I wanted to evaluate its effectiveness or non-effectiveness by running tests and looking at their results.

My approach is as following:

Train a sisr model on a standard dataset while generating validation metric scores which will serve as the baseline model
Scoring that dataset with HyperIQA and IC9600
Filtering the dataset with different thresholds with each of these two methods
Train sisr models on each of these filtered datasets while generating validation metric scores
Evaluate effectiveness based quantity reduction with the metric scores in comparison to the baseline model
Derive a good threshold for each HyperIQA and IC9600 from the tests, which filtering techniques are then combined to make a curated version of the dataset based on these thresholds
Train a sisr model using same options on that curated dataset while generating validation metric scores
Evaluate effectiveness based on quantity reduction and its final scoring metrics vs the baseline model

DF2K Main Test

System Setup

All the tests are done on my home pc, here is my specs:

Ubuntu 20.04.6 LTS 64-bit
RTX 3060 (12 GB VRAM)
16 GiB RAM
AMD® Ryzen 5 3600 6-core processor × 12

Dataset

The dataset I chose is the DF2K dataset, which is a combination of the DIV2K and the Flicker2K dataset, often used as a standard training dataset for new sisr architectures in papers.

Moreover, when looking at the PLKSR paper, they trained all the plksr_tiny model from scratch (so no pretraining strategy used), where the model trained on DF2K reached better metric scores then the one trained on DIV2K only:

Table 4, Page 7, from the PLKSR paper. Also featured on their github repo.

To increase I/O speed during training, the tiling strategy is applied, as suggested in the training section of the real-esrgan repo. DF2K tiled to 512x512 px results now in a training dataset of 21'387 tiles.

This tiled version of the DF2K dataset, which will be used for training the base model and which will be filtered on, can be found here. Since all the filtering will be done on this tiled dataset, which I uploaded on huggingface, all the filtered subsets used for training in this post, given the respective metric scores, are reproducible.

Edit: The BHI-filtered version of DF2K can be found here

Training

Plksr_tiny, which is a rather fast architecture option to train and scores higher metrics than SAFMN, DITN or SPAN on the paper, will be used as the architecture options for running these tests, on a 4x scale.

The low resolution (LR) counterpart for paired training will be created with created with bicubic downsampling only.

For reproducibility, I provide links to download the HR (high resolution) and LR datasets for the baseline model:
Link to DF2K Tiled HR
Link to DF2K Tiled LR

As for the training framework, for all these tests, neosr is being used, with the commit hash dc4e3742132bae2c2aa8e8d16de3a9fcec6b1a74, making use of deterministic training.

In general fp16 with a batch size of 16 and a patch size of 32 is used for model training, together with adamw with lr 1e-4 with betas [0.9,0.99] as optimizer, multisteplr as scheduler with 60k and 120k milestones, L1Loss only, and ema.

Training configs in general will be made available for reproducibility. Although there are a lot of options in the default config, these have been shortened for visual clarity by removing all the commented out options in the provided standard configs in neosr.

Validation

The DIV2K dataset, which in this case is a subset of DF2K, provides an official validation set of 100 images with their HR and corresponding LR counterparts. We will use this one for validation during training.
Validation during training will happen each 10 '000 iteration step, which will provide sufficient data while not slowing down training too much though running inference, used with the PSNR, SSIM and DISTS metrics.

The official DIV2K validation set can be downloaded here

Metrics

With each test I will provide the tensorboard graphs as a visualization of the model training with the PSNR, SSIM and DISTS validation metrics.

PSNR and SSIM is often used in papers for validation metrics. Since DISTS had been added to neosr, I get tensorboard graphs of this metric aswell.

There are currently 25 full reference (and 45 non reference) metric options available that can be used with pyiqa, which I all ran once when trying to find release candidate out of the checkpoints of a model training. On the curated model at the end of this test, I will additionally (next to psnr, ssim and dists) use the topiq_fr and AHIQ metrics, which seemed to perform well in my experience so far.

Floating-point format

Testing the different options of using either fp32, fp16 of bf16 for training, the baseline model (on the full tiled DF2K dataset) has been trained on all of these formats for 200’000 iterations.

As can be seen in the following graphics from tensorboard, while there is little difference in validation metric scores, fp16 provided the biggest training time improvement, and will therefore henceforth be used for testing unless indicated otherwise.

Tensorboard: PSNR val metrics of baseline model: floating-point formats

Tensorboard: SSIM val metrics of baseline model: floating-point formats

Tensorboard: DISTS val metrics of baseline model: floating-point formats

The baseline models with their training configs can be found here

The BHI approach uses Blockiness, HyperIQA and IC9600 filtering for sisr training dataset curation, and I will present these filtering techniques in this order in the following sections.

Blockiness

I added blockiness filtering to this curation workflow with a threshold of 30, as they have already tested and shown in the Rethinking Image Super-Resolution from Training Data Perspectives paper that not only can jpg compression within the dataset become very detrimental to the sisr training process as shown in their Figure 5 when adding jpg compression at 75% or lower to the training set, but also that in general the metric values improved with lower blockiness (except for the Manga109 test set), as shown in their Table 4. Since decreasing the blockiness threshold from 30 to 10 did not lead to an increase in validation metric scores, we use a blockiness threshold of <30 for our BHI filtering approach. These visuals are inserted here for convenience and are taken from their paper:

Figure 5, Page 12, from the Rethinking Image Super-Resolution from Training Data Perspectives paper.

Table 4, Page 10, from the Rethinking Image Super-Resolution from Training Data Perspectives paper.

For visualization, the lowest and highest blockiness scoring tiles:

Visual Example: Lowest blockiness scoring tiles of tiled DF2K, with their respective scores

Visual Example: Highest IC9600 scoring tiles of tiled DF2K, with their respective scores

HyperIQA Filtering

The purpose of Image Quality Assessment is in general to evaluate the visually percieved quality of an image typically by assigning it a score. My assumption here is that IQA can be used to increase the quality of the whole training dataset, by filtering on the scored tiles through removing bad scoring tiles (which removes for example blurry and noisy tiles).

We are going to test this assumption.

For Image Quality Assessment I use HyperIQA scoring on the DF2K Tiles dataset.

I scored the tiled DF2K dataset with HyperIQA, which scores can be found in here.

For visualization I insert here the lowest and highest HyperIQA scoring tiles:

Visual Example: Lowest HyperIQA scoring tiles of tiled DF2K, with their respective scores

Visual Example: Highest HyperIQA scoring tiles of tiled DF2K, with their respective scores

From that scoring, I created the following filtered training subsets together with the number of tiles left plus what percent it is of the full tiled dataset:

HyperIQA score >= 0.1 -> unfiltered, full set = base model (100%)
HyperIQA score >= 0.2 -> 21’347 Tiles (99.8%)
HyperIQA score >= 0.3 -> 20’689 Tiles (96.7%)
HyperIQA score >= 0.4 -> 18’477 Tiles (86.4%)
HyperIQA score >= 0.5 -> 14’572 Tiles (68.1%)
HyperIQA score >= 0.6 -> 8’471 Tiles (39.6%)
HyperIQA score >= 0.7 -> 1’780 Tiles (8.3%)
HyperIQA score >= 0.8 -> 44 Tiles (0.2%)

Which I then trained fp16 models on for 100k iterations each, except for the 0.8 subset since there are simply too few tiles left to meaningfully train on. The results are shown in the following graphics together with the fp16 baseline model as reference point:

Tensorboard: PSNR validation scores of HyperIQA filtering on DF2K

Tensorboard: SSIM validation scores of HyperIQA filtering on DF2K

Tensorboard: DISTS validation scores of HyperIQA filtering on DF2K

In all these metrics, training on the HyperIQA score >= 0.2 filtered training subset gave us superior metrics. We will use this as a threshold for our BHI filtered dataset.

What is suprising to me is that I assumed the higher the general IQA score of the dataset (meaning filtered on higher IQA score) the better the metrics would be. Looking at PSNR and SSIM this does not seem to be the case. But instead removing only the worst tiles (scoring beneath 0.2) seems to have a positive effect on training validation metrics.

I also note here that the better than the baseline model scoring models on PSNR and SSIM here still contain over 90% of the tiles from the tiled dataset, whereas with higher thresholds the number of tiles drops significantly, so the quantity of tiles might play a role in these validation metrics.

IC9600 Filtering

Another assumption is that increasing the general complexity of the dataset (increasing the amount of information that is on each training tile) would also be beneficial to sisr training or rather, for sisr training dataset curation.

For automatic image complexity assessment I use IC9600 and scored the DF2K Tiled dataset, which scores can be found in here

For visualization the lowest and highest IC9600 scoring tiles:

Visual Example: Lowest IC9600 scoring tiles of tiled DF2K, with their respective scores

From the scoring, I created the following filtered subsets:

IC9600 score >= 0.1 -> 20’807 Tiles (97.3%)
IC9600 score >= 0.2 -> 19’552 Tiles (91.4%)
IC9600 score >= 0.3 -> 17’083 Tiles (79.9%)
IC9600 score >= 0.4 -> 12’784 Tiles (59.8%)
IC9600 score >= 0.5 -> 6’765 Tiles (31.6%)
IC9600 score >= 0.6 -> 1’918 Tiles (9.0%)
IC9600 score >= 0.7 -> 318 Tiles (1.5%)
IC9600 score >= 0.8 -> 44 Tiles (0.2%)

Tensorboard: PSNR validation scores of IC9600 filtering on DF2K

Tensorboard: SSIM validation scores of IC9600 filtering on DF2K

Tensorboard: DISTS validation scores of IC9600 filtering on DF2K

Training configs and model files of this IC9600 test

From these results, IC9600 filtering seems to have a positive effect on training. Not only does the model converge faster, or reach higher metric scores in earlier iterations of training, but also to reach higher validation metrics in general. In both PSNR and SSIM the threshold of 0.5 reached highest metric values. In general this hints to a higher IC9600 threshold in general to be beneficial. The higher than 0.5 thresholds scoring worse could be because of the high reduction in tile quantity in the trainingset.

BHI Filtering

Now I combine the previous filtering methods into the BHI filtering approach using the from the previous tests established thresholds of:

Blockiness < 30 HyperIQA >= 0.2 IC9600 >= 0.5

I train a fp16 model on the now BHI filtered DF2K tiled dataset. The quantity of tiles is as follows:

Baseline DF2K model: 21’387 Tiles Curated DF2K model: 6’620 Tiles (31%)

Here are the training validation results, merged here means the combined filtering technique aka BHI filtered DF2K tiled training set:

Tensorboard: PSNR validation scores DF2K baseline and DF2K-BHI

Tensorboard: SSIM validation scores DF2K baseline and DF2K-BHI

Tensorboard: DISTS validation scores DF2K baseline and DF2K-BHI

From the results we observe that BHI filtering the DF2K tiled dataset not only led to a 69% reduction in training dataset size, but it simultaneously achieved better PSNR, SSIM and DISTS validation metric scores on the DIV2K validation set.

Though I think 100k iterations generally suffice for these tests on a lightweight network option like plksr_tiny, in the PLKSR paper they trained their plksr_tiny models from scratch up to 450k iterations. Since this is the final DF2K tiled test, I will also increase the training to 500k iterations just so I would be able to catch if something happens with longer training iterations.

Tensorboard: PSNR validation scores DF2K baseline and DF2K-BHI both to 500k iters

Tensorboard: SSIM validation scores DF2K baseline and DF2K-BHI both to 500k iters

Tensorboard: DISTS validation scores DF2K baseline and DF2K-BHI both to 500k iters

We can see that the metric scores improved and thanks to the filtering techniques like blockiness filtering, with longer iterations the metrics continuously slightly improve.

To make sure that these results are not DIV2K testset specific, I test these final models on multiple official test sets on multiple metrics. Namely the Urban100, BSD100, DIV2K and LSDIR testsets, with the PSNR, SSIM, DISTS, AHIQ and TOPIQ_FR FR (full reference) IQA metrics.

Plksr_tiny models trained on DF2K tiled and BHI filtered for 500k, metrics on different testsets

The previous evaluation still stands true even with more testsets and metrics, the model trained on the BHI filtered DF2K tiled testset was able to achieve better metric scores in general. The BHI filtering method is effective on the DF2K tiled dataset with the plksr_tiny architecture option in reducing training dataset size while achieving better metric results on multiple testing sets with multiple metrics.

By the way, as an additional quick 100k iters test, I wanted to see what happens if we change a parameter, namely the patch size, and double it from 32 to 64. In general, increasing training patch size leads to better visual model outputs.

Tensorboard: PSNR validation scores DF2K baseline and DF2K-BHI on patch64

Tensorboard: SSIM validation scores DF2K baseline and DF2K-BHI on patch64

Tensorboard: DISTS validation scores DF2K baseline and DF2K-BHI on patch64

We achieve similiar results to the previous patch32 100k iters test, with slightly better metrics.

ImageNet Additional Test

To test that this is neither dataset nor architecture option specific, I repeat this filtering method with the ImageNet dataset, a dataset that is often used for the pretraining strategy in papers (like in PLKSR for the standard model).

After tiling the dataset to 512x512px, we are left with 197'436 tiles.
The corresponding LR again is created with bicubic downsampling with a 0.25 scale for a 4x model training.
Training validation is done on the Urban100 testset.

As for the architecture option, this time we use SPAN which is a bit faster than plksr_tiny and won the 1st place in CVPR 2024 NTIRE's Efficient Super-Resolution Challenge(ESR)

All the relevant files to this test are in the imagenet subfolder.