Spaces:

strickvl
/

redaction-detector

Runtime error

App Files Files Community

Alex Strick van Linschoten commited on May 7, 2022

Commit

5002a2b

•

1 Parent(s): 02f3e99

update text

Browse files

Files changed (2) hide show

app.py +3 -1
article.md +37 -3

app.py CHANGED Viewed

@@ -206,7 +206,9 @@ demo = gr.Interface(
             label="Confidence",
             optional=False,
         ),
-        gr.inputs.Checkbox(label="Extract redacted images", default=True),
     ],
     outputs=[
         gr.outputs.Textbox(label="Document Analysis"),

             label="Confidence",
             optional=False,
         ),
+        gr.inputs.Checkbox(
+            label="Analyse and extract redacted images", default=True
+        ),
     ],
     outputs=[
         gr.outputs.Textbox(label="Document Analysis"),

article.md CHANGED Viewed

@@ -6,7 +6,34 @@ models out in the world as some kind of demo or application.
 [Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
 prototype of your model on the internet.
-This model has an accuracy of ~96% on the validation dataset.
 ## The Dataset
@@ -14,14 +41,21 @@ I downloaded a few thousand publicly-available FOIA documents from a government
 website. I split the PDFs up into individual `.jpg` files and then used
 [Prodigy](https://prodi.gy/) to annotate the data. (This process was described
 in
-[a blogpost written last year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).)
 ## Training the model
-I trained the model with fastai's flexible `vision_learner`, fine-tuning
 `resnet18` which was both smaller than `resnet34` (no surprises there) and less
 liable to early overfitting. I trained the model for 10 epochs.
 ## Further Reading
 This initial dataset spurred an ongoing interest in the domain and I've since

 [Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
 prototype of your model on the internet.
+This MVP app runs two models to mimic the experience of what a final deployed
+version of the project might look like.
+- The first model (a classification model trained with fastai, available on the
+  Huggingface Hub
+  [here](https://huggingface.co/strickvl/redaction-classifier-fastai) and
+  testable as a standalone demo
+  [here](https://huggingface.co/spaces/strickvl/fastai_redaction_classifier)),
+  classifies and determines which pages of the PDF are redacted. I've written
+  about how I trained this model [here](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).
+- The second model (an object detection model trained using [IceVision](https://airctic.com/) (itself
+  built partly on top of fastai)) detects which parts of the image are redacted.
+  This is a model I've been working on for a while and I described my process in
+  a series of blog posts (see below).
+This MVP app does several things:
+- it extracts any pages it considers to contain redactions and displays that
+  subset as an [image carousel](https://gradio.app/docs/#o_carousel). It also
+  displays some text alerting you to which specific pages were redacted.
+- if you click the "Analyse and extract redacted images" checkbox, it will:
+  - pass the pages it considered redacted through the object detection model
+  - calculate what proportion of the total area of the image was redacted as
+    well as what proportion of the actual content (i.e. excluding margins etc
+    where there is no content)
+  - create a PDF that you can download that contains only the redacted images,
+    with an overlay of the redactions that it was able to identify along with
+    the confidence score for each item.
 ## The Dataset
 website. I split the PDFs up into individual `.jpg` files and then used
 [Prodigy](https://prodi.gy/) to annotate the data. (This process was described
 in
+[a blogpost written last
+year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).)
+For the object detection model, the process was quite a bit more involved and I
+direct you to the series of articles referenced below in the 'Further Reading' section.
 ## Training the model
+I trained the classification model with fastai's flexible `vision_learner`, fine-tuning
 `resnet18` which was both smaller than `resnet34` (no surprises there) and less
 liable to early overfitting. I trained the model for 10 epochs.
+The object detection model is trained using IceVision, with VFNet as the
+model and `resnet50` as the backbone. I trained the model for 50 epochs and
+reached 89% accuracy on the validation data.
 ## Further Reading
 This initial dataset spurred an ongoing interest in the domain and I've since