Large Dataset
thanks for this great project
I want to use for YelpDataset. a total number of reviews is almost 7 million texts. how can I use this code for the Yelp CSV dataset?
Thank you @erfankh . I'm happy to hear you find the model helpful!
In the Application section of the model card, you can find an "Open in Colab" button that demonstrates how you can classify your own CSV file on Google Colab (see screenshot below).
yeah, I saw this code but I am talking about a 4Gig text file that I need a lot of memory to run this code on this CSV. so my problem is how can I use efficiently this code for 7 million rows of long text?
Tokenize texts and create prediction data set
tokenized_texts = tokenizer(pred_texts,truncation=True,padding=True)
pred_dataset = SimpleDataset(tokenized_texts)
this part of code use high memory. i don't know how to change this part to use less for large dataset
I see. Maybe batch tokenization will work for you? For example, see here: https://osmanfatihkilic.medium.com/fine-tuning-huggingface-models-without-overwhelming-your-memory-d33b8a206ae2