Large Dataset

by erfankh - opened Nov 27, 2022

Nov 27, 2022

thanks for this great project
I want to use for YelpDataset. a total number of reviews is almost 7 million texts. how can I use this code for the Yelp CSV dataset?

j-hartmann

Owner Nov 27, 2022

Thank you @erfankh . I'm happy to hear you find the model helpful!

In the Application section of the model card, you can find an "Open in Colab" button that demonstrates how you can classify your own CSV file on Google Colab (see screenshot below).

j-hartmann changed discussion status to closed Nov 27, 2022

erfankh

Nov 28, 2022

•

edited Nov 28, 2022

yeah, I saw this code but I am talking about a 4Gig text file that I need a lot of memory to run this code on this CSV. so my problem is how can I use efficiently this code for 7 million rows of long text?

Tokenize texts and create prediction data set

tokenized_texts = tokenizer(pred_texts,truncation=True,padding=True)
pred_dataset = SimpleDataset(tokenized_texts)

this part of code use high memory. i don't know how to change this part to use less for large dataset

erfankh changed discussion status to open Nov 28, 2022

j-hartmann

Owner Nov 29, 2022

I see. Maybe batch tokenization will work for you? For example, see here: https://osmanfatihkilic.medium.com/fine-tuning-huggingface-models-without-overwhelming-your-memory-d33b8a206ae2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment