1-800-BAD-CODE
/

sentence_boundary_detection_multilang

sentence boundary detection

token classification

Model card Files Files and versions Community

1-800-BAD-CODE commited on Jan 1, 2023

Commit

8a502c1

•

1 Parent(s): f25200c

Update README.md

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -42,7 +42,7 @@ Therefore, language tags do not need to be used and a single batch can contain m
 ## Architecture
 This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier.
-Given that this is a relatively-easy NLP task, the model contains ~5M parameters (~4M of which are embeddings).
 This makes the model very fast and cheap at inference time, as SBD should be.
 The BERT encoder is based on the following configuration:
@@ -119,7 +119,8 @@ For each input subword `t`, this model predicts the probability that `t` is the
 This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
-The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
 This model can be run directly with a couple of dependencies which most developers may already have installed.
@@ -129,7 +130,7 @@ The following snippet will install the dependencies, clone this repo, and run an
 $ pip install sentencepiece onnxruntime
 $ git clone https://huggingface.co/1-800-BAD-CODE/sentence_boundary_detection_multilang
 $ cd sentence_boundary_detection_multilang
-# Verify the content before running file
 # $ python run_example.py
 ```

 ## Architecture
 This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier.
+Given that this is a relatively-easy NLP task, the model contains \~5M parameters (\~4M of which are embeddings).
 This makes the model very fast and cheap at inference time, as SBD should be.
 The BERT encoder is based on the following configuration:
 This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
+This model runs with a script after checking out this repo; if there is any interest in it running in the HF API, let me know.
+For now, I assume no one cares.
 This model can be run directly with a couple of dependencies which most developers may already have installed.
 $ pip install sentencepiece onnxruntime
 $ git clone https://huggingface.co/1-800-BAD-CODE/sentence_boundary_detection_multilang
 $ cd sentence_boundary_detection_multilang
+# Inspect the content before running an arbitrary file
 # $ python run_example.py
 ```