Spaces:
Runtime error
Runtime error
Alex Strick van Linschoten
commited on
Commit
•
5002a2b
1
Parent(s):
02f3e99
update text
Browse files- app.py +3 -1
- article.md +37 -3
app.py
CHANGED
@@ -206,7 +206,9 @@ demo = gr.Interface(
|
|
206 |
label="Confidence",
|
207 |
optional=False,
|
208 |
),
|
209 |
-
gr.inputs.Checkbox(
|
|
|
|
|
210 |
],
|
211 |
outputs=[
|
212 |
gr.outputs.Textbox(label="Document Analysis"),
|
|
|
206 |
label="Confidence",
|
207 |
optional=False,
|
208 |
),
|
209 |
+
gr.inputs.Checkbox(
|
210 |
+
label="Analyse and extract redacted images", default=True
|
211 |
+
),
|
212 |
],
|
213 |
outputs=[
|
214 |
gr.outputs.Textbox(label="Document Analysis"),
|
article.md
CHANGED
@@ -6,7 +6,34 @@ models out in the world as some kind of demo or application.
|
|
6 |
[Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
|
7 |
prototype of your model on the internet.
|
8 |
|
9 |
-
This
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
## The Dataset
|
12 |
|
@@ -14,14 +41,21 @@ I downloaded a few thousand publicly-available FOIA documents from a government
|
|
14 |
website. I split the PDFs up into individual `.jpg` files and then used
|
15 |
[Prodigy](https://prodi.gy/) to annotate the data. (This process was described
|
16 |
in
|
17 |
-
[a blogpost written last
|
|
|
|
|
|
|
18 |
|
19 |
## Training the model
|
20 |
|
21 |
-
I trained the model with fastai's flexible `vision_learner`, fine-tuning
|
22 |
`resnet18` which was both smaller than `resnet34` (no surprises there) and less
|
23 |
liable to early overfitting. I trained the model for 10 epochs.
|
24 |
|
|
|
|
|
|
|
|
|
25 |
## Further Reading
|
26 |
|
27 |
This initial dataset spurred an ongoing interest in the domain and I've since
|
|
|
6 |
[Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
|
7 |
prototype of your model on the internet.
|
8 |
|
9 |
+
This MVP app runs two models to mimic the experience of what a final deployed
|
10 |
+
version of the project might look like.
|
11 |
+
|
12 |
+
- The first model (a classification model trained with fastai, available on the
|
13 |
+
Huggingface Hub
|
14 |
+
[here](https://huggingface.co/strickvl/redaction-classifier-fastai) and
|
15 |
+
testable as a standalone demo
|
16 |
+
[here](https://huggingface.co/spaces/strickvl/fastai_redaction_classifier)),
|
17 |
+
classifies and determines which pages of the PDF are redacted. I've written
|
18 |
+
about how I trained this model [here](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).
|
19 |
+
- The second model (an object detection model trained using [IceVision](https://airctic.com/) (itself
|
20 |
+
built partly on top of fastai)) detects which parts of the image are redacted.
|
21 |
+
This is a model I've been working on for a while and I described my process in
|
22 |
+
a series of blog posts (see below).
|
23 |
+
|
24 |
+
This MVP app does several things:
|
25 |
+
|
26 |
+
- it extracts any pages it considers to contain redactions and displays that
|
27 |
+
subset as an [image carousel](https://gradio.app/docs/#o_carousel). It also
|
28 |
+
displays some text alerting you to which specific pages were redacted.
|
29 |
+
- if you click the "Analyse and extract redacted images" checkbox, it will:
|
30 |
+
- pass the pages it considered redacted through the object detection model
|
31 |
+
- calculate what proportion of the total area of the image was redacted as
|
32 |
+
well as what proportion of the actual content (i.e. excluding margins etc
|
33 |
+
where there is no content)
|
34 |
+
- create a PDF that you can download that contains only the redacted images,
|
35 |
+
with an overlay of the redactions that it was able to identify along with
|
36 |
+
the confidence score for each item.
|
37 |
|
38 |
## The Dataset
|
39 |
|
|
|
41 |
website. I split the PDFs up into individual `.jpg` files and then used
|
42 |
[Prodigy](https://prodi.gy/) to annotate the data. (This process was described
|
43 |
in
|
44 |
+
[a blogpost written last
|
45 |
+
year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).)
|
46 |
+
For the object detection model, the process was quite a bit more involved and I
|
47 |
+
direct you to the series of articles referenced below in the 'Further Reading' section.
|
48 |
|
49 |
## Training the model
|
50 |
|
51 |
+
I trained the classification model with fastai's flexible `vision_learner`, fine-tuning
|
52 |
`resnet18` which was both smaller than `resnet34` (no surprises there) and less
|
53 |
liable to early overfitting. I trained the model for 10 epochs.
|
54 |
|
55 |
+
The object detection model is trained using IceVision, with VFNet as the
|
56 |
+
model and `resnet50` as the backbone. I trained the model for 50 epochs and
|
57 |
+
reached 89% accuracy on the validation data.
|
58 |
+
|
59 |
## Further Reading
|
60 |
|
61 |
This initial dataset spurred an ongoing interest in the domain and I've since
|