monsoon-nlp
/

bert-base-thai

@@ -8,15 +8,23 @@ Adapted from https://github.com/ThAIKeras/bert for HuggingFace/Transformers libr
 ## Pre-tokenization
-You must run the original ThaiTokenizer to have your tokenization match that of the original model. If you skip this step, you will not do much better than
 mBERT or random chance!
 ```bash
-pip install pythainlp six sentencepiece==0.0.9
 git clone https://github.com/ThAIKeras/bert
-# download .vocab and .model files from ThAIKeras readme
 ```
 Then set up ThaiTokenizer class - this is modified slightly to
 remove a TensorFlow dependency.
@@ -116,12 +124,20 @@ Then pre-tokenizing your own text:
 from pythainlp import sent_tokenize
 tokenizer = ThaiTokenizer(vocab_file='th.wiki.bpe.op25000.vocab', spm_file='th.wiki.bpe.op25000.model')
-og_text = "กรุงเทพมหานคร..."
-split_sentences = ' '.join(sent_tokenize(txt))
-split_words = ' '.join(tokenizer.tokenize(split_sentences))
-split_words
-> "▁ร้าน อาหาร ใหญ่มาก กก กก กก ▁ <unk> เลี้ยว..."
 ```
 Original README follows:
@@ -279,4 +295,4 @@ python run_classifier.py \
   --spm_file=$BPE_DIR/th.wiki.bpe.op25000.model
 ```
-Without additional preprocessing and further fine-tuning, the Thai-only BERT model can achieve 0.56612 and 0.57057 for public and private test-set scores respectively.

 ## Pre-tokenization
+You must run the original ThaiTokenizer to have your tokenization match that of the original model.
+If you skip this step, you will not do much better than
 mBERT or random chance!
+[Refer to this CoLab notebook](https://colab.research.google.com/drive/1Ax9OsbTPwBBP1pJx1DkYwtgKILcj3Ur5?usp=sharing)
+ or follow these steps:
 ```bash
+pip install pythainlp six sentencepiece python-crfsuite
 git clone https://github.com/ThAIKeras/bert
+# download .vocab and .model files from ThAIKeras/bert > Tokenization section
 ```
+Or from [.vocab](https://raw.githubusercontent.com/jitkapat/thaipostagger/master/th.wiki.bpe.op25000.vocab)
+ and [.model](https://raw.githubusercontent.com/jitkapat/thaipostagger/master/th.wiki.bpe.op25000.model) links.
 Then set up ThaiTokenizer class - this is modified slightly to
 remove a TensorFlow dependency.
 from pythainlp import sent_tokenize
 tokenizer = ThaiTokenizer(vocab_file='th.wiki.bpe.op25000.vocab', spm_file='th.wiki.bpe.op25000.model')
+txt = "กรุงเทพมหานครเป็นเขตปกครองพิเศษของประเทศไทย มิได้มีสถานะเป็นจังหวัด คำว่า \"กรุงเทพมหานคร\" นั้นยังใช้เรียกองค์กรปกครองส่วนท้องถิ่นของกรุงเทพมหานครอีกด้วย"
+split_sentences = sent_tokenize(txt)
+print(split_sentences)
+"""
+['กรุงเทพมหานครเป็นเขตปกครองพิเศษของประเทศไทย ',
+ 'มิได้มีสถานะเป็นจังหวัด ',
+ 'คำว่า "กรุงเทพมหานคร" นั้นยังใช้เรียกองค์กรปกครองส่วนท้องถิ่นของกรุงเทพมหานครอีกด้วย']
+"""
+split_words = ' '.join(tokenizer.tokenize(' '.join(split_sentences)))
+print(split_words)
+"""
+'▁กรุงเทพมหานคร เป็นเขต ปกครอง พิเศษ ของประเทศไทย ▁มิ ได้มี สถานะเป็น จังหวัด ▁คําว่า ▁" กรุงเทพมหานคร " ▁นั้น...' # continues
+"""
 ```
 Original README follows:
   --spm_file=$BPE_DIR/th.wiki.bpe.op25000.model
 ```
+Without additional preprocessing and further fine-tuning, the Thai-only BERT model can achieve 0.56612 and 0.57057 for public and private test-set scores respectively.