nferruz commited on
Commit
104a6f8
1 Parent(s): bb11f1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -1
README.md CHANGED
@@ -250,7 +250,17 @@ fn = open("./2.7.3.13_processed.txt",'w')
250
  for key,value in grouped_dataset.items():
251
  fn.write(value)
252
  fn.write("\n")
253
- fn.close()
 
 
 
 
 
 
 
 
 
 
254
  ```
255
  The previous script will prepare a text file with the correct format for tokenization.
256
  Now we can use the tokenizer to convert its contents to tokens.
 
250
  for key,value in grouped_dataset.items():
251
  fn.write(value)
252
  fn.write("\n")
253
+ fn.close()
254
+
255
+ fn = open("./2.7.3.13_processed.txt",'w')
256
+ for key,value in grouped_dataset.items():
257
+ padding_len = 1024 - len(tokenizer(value)['input_ids'])
258
+ padding = "<pad>"*padding_len
259
+ print(len(tokenizer(value+padding)['input_ids']))
260
+ fn.write(value+padding)
261
+ fn.write
262
+ fn.write("\n")
263
+ fn.close()
264
  ```
265
  The previous script will prepare a text file with the correct format for tokenization.
266
  Now we can use the tokenizer to convert its contents to tokens.