Post
1721
You can clean and format datasets entirely in the browser with a few lines of SQL.
In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.
The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts
https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset
Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.
The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts
https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset
Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned