Update README.md
Browse files
README.md
CHANGED
@@ -250,7 +250,17 @@ fn = open("./2.7.3.13_processed.txt",'w')
|
|
250 |
for key,value in grouped_dataset.items():
|
251 |
fn.write(value)
|
252 |
fn.write("\n")
|
253 |
-
fn.close()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
254 |
```
|
255 |
The previous script will prepare a text file with the correct format for tokenization.
|
256 |
Now we can use the tokenizer to convert its contents to tokens.
|
|
|
250 |
for key,value in grouped_dataset.items():
|
251 |
fn.write(value)
|
252 |
fn.write("\n")
|
253 |
+
fn.close()
|
254 |
+
|
255 |
+
fn = open("./2.7.3.13_processed.txt",'w')
|
256 |
+
for key,value in grouped_dataset.items():
|
257 |
+
padding_len = 1024 - len(tokenizer(value)['input_ids'])
|
258 |
+
padding = "<pad>"*padding_len
|
259 |
+
print(len(tokenizer(value+padding)['input_ids']))
|
260 |
+
fn.write(value+padding)
|
261 |
+
fn.write
|
262 |
+
fn.write("\n")
|
263 |
+
fn.close()
|
264 |
```
|
265 |
The previous script will prepare a text file with the correct format for tokenization.
|
266 |
Now we can use the tokenizer to convert its contents to tokens.
|