Alex Strick van Linschoten commited on
Commit
5002a2b
1 Parent(s): 02f3e99

update text

Browse files
Files changed (2) hide show
  1. app.py +3 -1
  2. article.md +37 -3
app.py CHANGED
@@ -206,7 +206,9 @@ demo = gr.Interface(
206
  label="Confidence",
207
  optional=False,
208
  ),
209
- gr.inputs.Checkbox(label="Extract redacted images", default=True),
 
 
210
  ],
211
  outputs=[
212
  gr.outputs.Textbox(label="Document Analysis"),
 
206
  label="Confidence",
207
  optional=False,
208
  ),
209
+ gr.inputs.Checkbox(
210
+ label="Analyse and extract redacted images", default=True
211
+ ),
212
  ],
213
  outputs=[
214
  gr.outputs.Textbox(label="Document Analysis"),
article.md CHANGED
@@ -6,7 +6,34 @@ models out in the world as some kind of demo or application.
6
  [Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
7
  prototype of your model on the internet.
8
 
9
- This model has an accuracy of ~96% on the validation dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  ## The Dataset
12
 
@@ -14,14 +41,21 @@ I downloaded a few thousand publicly-available FOIA documents from a government
14
  website. I split the PDFs up into individual `.jpg` files and then used
15
  [Prodigy](https://prodi.gy/) to annotate the data. (This process was described
16
  in
17
- [a blogpost written last year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).)
 
 
 
18
 
19
  ## Training the model
20
 
21
- I trained the model with fastai's flexible `vision_learner`, fine-tuning
22
  `resnet18` which was both smaller than `resnet34` (no surprises there) and less
23
  liable to early overfitting. I trained the model for 10 epochs.
24
 
 
 
 
 
25
  ## Further Reading
26
 
27
  This initial dataset spurred an ongoing interest in the domain and I've since
 
6
  [Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
7
  prototype of your model on the internet.
8
 
9
+ This MVP app runs two models to mimic the experience of what a final deployed
10
+ version of the project might look like.
11
+
12
+ - The first model (a classification model trained with fastai, available on the
13
+ Huggingface Hub
14
+ [here](https://huggingface.co/strickvl/redaction-classifier-fastai) and
15
+ testable as a standalone demo
16
+ [here](https://huggingface.co/spaces/strickvl/fastai_redaction_classifier)),
17
+ classifies and determines which pages of the PDF are redacted. I've written
18
+ about how I trained this model [here](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).
19
+ - The second model (an object detection model trained using [IceVision](https://airctic.com/) (itself
20
+ built partly on top of fastai)) detects which parts of the image are redacted.
21
+ This is a model I've been working on for a while and I described my process in
22
+ a series of blog posts (see below).
23
+
24
+ This MVP app does several things:
25
+
26
+ - it extracts any pages it considers to contain redactions and displays that
27
+ subset as an [image carousel](https://gradio.app/docs/#o_carousel). It also
28
+ displays some text alerting you to which specific pages were redacted.
29
+ - if you click the "Analyse and extract redacted images" checkbox, it will:
30
+ - pass the pages it considered redacted through the object detection model
31
+ - calculate what proportion of the total area of the image was redacted as
32
+ well as what proportion of the actual content (i.e. excluding margins etc
33
+ where there is no content)
34
+ - create a PDF that you can download that contains only the redacted images,
35
+ with an overlay of the redactions that it was able to identify along with
36
+ the confidence score for each item.
37
 
38
  ## The Dataset
39
 
 
41
  website. I split the PDFs up into individual `.jpg` files and then used
42
  [Prodigy](https://prodi.gy/) to annotate the data. (This process was described
43
  in
44
+ [a blogpost written last
45
+ year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).)
46
+ For the object detection model, the process was quite a bit more involved and I
47
+ direct you to the series of articles referenced below in the 'Further Reading' section.
48
 
49
  ## Training the model
50
 
51
+ I trained the classification model with fastai's flexible `vision_learner`, fine-tuning
52
  `resnet18` which was both smaller than `resnet34` (no surprises there) and less
53
  liable to early overfitting. I trained the model for 10 epochs.
54
 
55
+ The object detection model is trained using IceVision, with VFNet as the
56
+ model and `resnet50` as the backbone. I trained the model for 50 epochs and
57
+ reached 89% accuracy on the validation data.
58
+
59
  ## Further Reading
60
 
61
  This initial dataset spurred an ongoing interest in the domain and I've since