Pedro Ortiz Suarez
pjox
AI & ML interests
Language modeling, parsing, sequence tagging, NER, historical languages.
Recent Activity
updated
a Space
3 days ago
oscar-corpus/README
liked
a dataset
6 months ago
oscar-corpus/community-oscar
updated
a dataset
7 months ago
pjox/tmp4c-index
Organizations
pjox's activity
Set `sep="\s+"` for the duplicates file
2
#1 opened 9 months ago
by
lhoestq

Porn-related strings in the datasets (zh)
2
#8 opened about 1 year ago
by
kiwakwok
colab crashed after trying to load the dataset
1
#4 opened over 1 year ago
by
MhondGhod
Change foldernames
4
#3 opened over 1 year ago
by
hac541309
Unsafe Files
20
#12 opened almost 2 years ago
by
GetzPro

About the number of documents
6
#6 opened over 1 year ago
by
lixin4ever
Upload the rest of the data for 05-06-23
#1 opened over 1 year ago
by
pjox

Changing into Parquet
2
#5 opened over 1 year ago
by
hac541309
the link to RoBERTa base model directs us to bert-base-uncased
1
#1 opened almost 2 years ago
by
hurrial

Deduplicated English Corpus
2
#3 opened almost 2 years ago
by
conceptofmind

Data hosting on Huggingface
1
#2 opened almost 2 years ago
by
hieuhocnlp

How to download only one language?
2
#1 opened almost 2 years ago
by
musabg
full of sexy content and does't have 200G in zh corpus
1
#10 opened almost 2 years ago
by
Hzhiqiang