hlky
AI & ML interests
Recent Activity
Articles
Organizations
hlky's activity
https://huggingface.co/blog/hlky/web-scraping-201
It's a theoretical lesson on limitations that apply for some services like Artsy and how to handle them.
Edit: looks like that didn't post properly, I'll have to rewrite it.
Currently running it myself on A40 with
CAPTION
task and a streaming WebDataset @ 60k images/hour!
Thanks that's helpful. Currently the title and description are not ideal for this kind of filtering. We'd need all the images captioned and classes/categories extracting from the captions. Captioning the set is something that's planned, I'm building https://github.com/bigdata-pw/florence-tool for that purpose. Another (very exciting) project is my priority right now but I will aim to get an initial version of this UI out soon focusing on image datasets like Flickr with a gallery type view for quick review and selection, plus filtering options, however as mentioned the usefulness of text based filtering will be limited until captions/classes are available, still it will be useful to filter on available image resolutions, view count (popularity) etc.
For reference the image sizes (url_n, url_w, url_m etc.) are documented here https://www.flickr.com/services/api/misc.urls.html
Sure, it's a fun and useful project. I've made a start already with some of the basic features. If you could tell me more about how you're expecting it to work and what the user interface should be like that would help refine it.
Interesting use case, I can certainly cook something up for that over the weekend, I'll let you know when it's ready.
In the mean time you can browse online with the Dataset Viewer: https://huggingface.co/datasets/bigdata-pw/Flickr/viewer
This article should get you started: https://huggingface.co/blog/hlky/processing-parquets-102
We'll cover more advanced topics like downloading into WebDatasets, which is recommended if you want millions of images, in later articles.
If there's any specific kind of filtering you'd like to see covered or anything else just let me know, always happy to help!
Please refrain from advertising your service on my post, thanks!
In case you missed them; other recent drops include bigdata-pw/Dinosaurs - a small set of BIG creatures ๐ฆ๐ฆ and the first in a series of articles about the art of web scraping! https://huggingface.co/blog/hlky/web-scraping-101 https://huggingface.co/blog/hlky/web-scraping-102
Stay tuned for exciting datasets and models coming soon:
- PC and Console game screenshots
- TV/Film actors biographies and photos (think facial recognition and automatic captioning!)
- bigdata-pw/lyrics-gpt v2
- and more!
Data acquisition for this project is still in progress, get ready for an update soon:tm:
In case you missed them; other BIG data drops include Diffusion1B bigdata-pw/Diffusion1B - ~1.23B images and generation parameters from a variety of diffusion models and if you fancy practicing diffusion model training check out Dataception bigdata-pw/Dataception - a dataset of over 5000 datasets in WebDataset format!
Requests are always welcome so reach out if there's a dataset you'd like to see!