nyuuzyou

nyuuzyou

AI & ML interests

None yet

Recent Activity

New activity about 2 hours ago
nyuuzyou/tamago
reacted to AkimfromParis's post with โค๏ธ about 21 hours ago
posted an update about 23 hours ago

Organizations

nyuuzyou's activity

reacted to AkimfromParis's post with โค๏ธ about 21 hours ago
view post
Post
917
๐Ÿ‡ฏ๐Ÿ‡ต The Open Japanese LLM Leaderboard created by LLM-jp ๐ŸŒธ in partnership with HuggingFace ๐Ÿค— was released today!

Blog: https://huggingface.co/blog/leaderboard-japanese
Space: llm-jp/open-japanese-llm-leaderboard

๐ŸŒ The leaderboard is available in both Japanese and English
๐Ÿ“š Based on the evaluation tool, llm-jp-eval with more than 20 datasets for Japanese LLMs
๐Ÿ“Š The leaderboard showcases all the metrics for NLP experts, plus averages for NLP beginners
๐Ÿ’ป For the comfort of users, we chose a horizontal UI, and implemented it in a light and dark theme on Gradio
๐Ÿ”ฌ The radar chart provides a very interesting visualization of metrics!
๐ŸŒฑ We are using the Japanese research platform, MDX, so please be patient!
โšก LLM bigger than +70B will be evaluated soonโ€ฆ

How do you say โ€œGPUs Go Brrrโ€ in Japanese - > GPUใŒใƒ–ใƒณใƒ–ใƒณ๏ฝž! (To pronounce "GPU ga bunbun!") ๐Ÿ”ฅ
  • 4 replies
ยท
posted an update about 23 hours ago
view post
Post
227
๐ŸŽต Introducing Tamago Music Dataset - nyuuzyou/tamago

A collection of 1,567 music tracks featuring:

- Complete metadata with audio files and cover artwork
- Rich track information including titles, descriptions, and genres
- User engagement metrics like play counts and reactions
- English language content from independent artists
- Released under Creative Commons Zero (CC0) license

Dataset structure includes:
- Track metadata (titles, descriptions, genres, tags)
- Associated media (audio files, cover images)
- Artist information and engagement metrics

Particularly valuable for:
- Music generation model training
- Cross-modal analysis
- Audio classification tasks
- Music style and genre analysis
replied to their post 2 days ago
view reply

Thanks! I license almost all of my datasets under CC0, with different modalities and tasks. Maybe somebody can find something else interesting for them in my profile ๐Ÿ˜‰

posted an update 3 days ago
view post
Post
888
๐Ÿ–ผ๏ธ Introducing Public Domain Pictures Dataset - nyuuzyou/publicdomainpictures

Dataset highlights:
- 644,412 public domain images with comprehensive metadata from publicdomainpictures.net
- English language metadata including titles, descriptions, and keywords
- Each entry contains rich metadata including:
- Unique image ID and full-size image URLs
- Detailed titles and descriptions
- Keyword/tag collections
- Creator attribution
- Released to the public domain under Creative Commons Zero (CC0) license
  • 2 replies
ยท
posted an update 10 days ago
view post
Post
2128
๐ŸŽต Introducing Suno Music Generation Dataset - nyuuzyou/suno

Dataset highlights:

- 659,788 AI-generated music samples with comprehensive metadata from suno.com
- Multilingual content with English as primary language, including Japanese and other languages
- Each entry contains rich metadata including:
- Unique song ID, audio/video URLs, and thumbnail images
- AI model version and generation parameters
- Song metadata (tags, prompts, duration)
- Creator information and engagement metrics
- Released to the public domain under Creative Commons Zero (CC0) license

The dataset structure includes detailed information about each generated piece, from technical parameters to user engagement metrics, making it particularly valuable for:
- Music generation model training
- Cross-modal analysis (text-to-audio relationships)
- User engagement studies
- Audio classification tasks
- Music style and genre analysis
posted an update 16 days ago
view post
Post
1419
๐ŸŽ“ Introducing Kompy.info Uzbek Educational Dataset - nyuuzyou/kompy

Dataset highlights:
- 584,648 pages of educational content extracted from kompy.info, a comprehensive educational resource website
- Content exclusively in Uzbek language, focusing on technical and scientific topics
- Each entry contains: URL, page title, and extracted main text content
- Data extracted using trafilatura HTML extraction tool
- Covers a wide range of academic and educational materials
- Released to the public domain under Creative Commons Zero (CC0) license

The dataset presents a valuable resource for natural language processing tasks in the Uzbek language, particularly in educational and technical domains. It can be used for text classification, topic modeling, and content analysis of educational materials. The large-scale collection of Uzbek-language academic content makes it especially useful for developing educational technology applications and studying pedagogical approaches in Uzbek-language instruction. The dataset's monolingual nature provides a focused corpus for understanding technical and scientific terminology in Uzbek educational contexts.
reacted to m-ric's post with ๐Ÿ”ฅ 19 days ago
view post
Post
2335
> Oasis: First Real-Time Video Game Without a Game Engine! ๐ŸŽฎ

DecartAI & Etched just released Oasis - a fully AI-generated video game running at 20 FPS (frames per second). The model takes keyboard inputs and generates everything - physics, rules, graphics - on the fly, without any game engine.

โšก๏ธ What makes this special? Current text-to-video models (Mochi-1, Sora, Kling) generate about 1 frame every 10-20 seconds (that's the kind of device I had to play LoL back in the day, thus my low rankings). Oasis is 200 times faster, making it the first playable AI-generated game.

โš™๏ธ Under the hood, it uses a vision transformer to encode space and a diffusion model to generate frames. The secret sauce is "dynamic noising" - a technique that keeps the video stable between frames.

Key insights:
โšก๏ธ Generates 20 FPS, vs 0.2 FPS for other DIT-based video models
โ€ฃ The specialized hardware Sohu developed by Etched allows to handle 10x more player than H100

๐ŸŽฎ Features real game mechanics
โ€ฃ Movement, jumping, item management
โ€ฃ Physics and lighting
โ€ฃ Procedurally generated worlds

โš ๏ธ Current limitations
โ€ฃ Blurry graphics at a distance
โ€ฃ Objects sometimes change appearance
โ€ฃ Memory issues in long sessions

Try it yourself, the playable demo is impressive! ๐Ÿ‘‰ https://oasis.decart.ai/welcome
Code ๐Ÿ‘‰ https://github.com/etched-ai/open-oasis
Read it in full ๐Ÿ‘‰ https://oasis-model.github.io/
reacted to Muhammadreza's post with โค๏ธ 19 days ago
view post
Post
2575
Hey guys.
This is my first post here on huggingface. I'm glad to be a part of this amazing community!
  • 2 replies
ยท
posted an update 22 days ago
view post
Post
2736
๐ŸŽ“ Introducing PPT4Web Educational Materials Dataset - nyuuzyou/ppt4web

Dataset highlights:
- 182,405 presentations from ppt4web.ru, a platform for storing and viewing presentations covering a wide range of educational materials
- Primarily in Russian, with content in English, Kazakh, Ukrainian, and Belarusian
- Each entry includes: URL, title, download URL, and filepath
- Contains original PPTX files (converted from PPT for consistency) in addition to metadata
- Data covers a broad spectrum of educational topics and subjects
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content across various subjects in multiple languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in educational settings, providing insights into the diverse range of subjects and teaching approaches.
posted an update about 1 month ago
view post
Post
1395
๐ŸŒ Introducing Websim.ai User Projects Dataset - nyuuzyou/websim

Dataset highlights:
- 137,452 user projects from Websim.ai, a service for creating small sites using Large Language Models (LLMs)
- Primarily in English, with potential for multilingual content in generated websites
- Each entry includes: project metadata, user information, and generated HTML content
- Contains detailed information about project revisions, site generation, and user interactions
- Data covers a wide range of user-generated website projects created through AI assistance
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing AI-assisted web development trends, studying user behavior in LLM-powered creative tools, and exploring the capabilities of language models in web design.
posted an update about 1 month ago
view post
Post
426
๐ŸŽ“ Introducing Ukr-lit.com.ua Presentations Dataset - nyuuzyou/ukr-lit

Dataset highlights:
- 18,001 presentations from ukr-lit.com.ua, a platform for storing and viewing presentations covering a wide range of subjects in Ukrainian school education
- Primarily in Ukrainian, with some Russian and English content
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a broad spectrum of educational topics and subjects taught in Ukrainian schools
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content across various subjects in Ukrainian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in Ukrainian school education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in Ukrainian educational settings, providing insights into the diverse range of subjects and teaching approaches in the Ukrainian school system.
reacted to erinys's post with ๐Ÿš€ about 1 month ago
reacted to davidberenstein1957's post with โž• about 1 month ago
view post
Post
1686
You can now build a custom text classifier without days of human labeling!

๐Ÿ‘ LLMs work reasonably well as text classifiers.
๐Ÿ‘Ž They are expensive to run at scale and their performance drops in specialized domains.

๐Ÿ‘ Purpose-built classifiers have low latency and can potentially run on CPU.
๐Ÿ‘Ž They require labeled training data.

Combine the best of both worlds: the automatic labeling capabilities of LLMs and the high-quality annotations from human experts to train and deploy a specialized model.

Blog: https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback
posted an update about 1 month ago
replied to clem's post about 1 month ago
view reply

So why isn't OpenAI this list? Are they not supporting open AI? ยฏ_(ใƒ„)_/ยฏ

posted an update about 1 month ago
view post
Post
1561
๐ŸŽ™ Introducing LiveATC Recordings (Partial 2024-08-26) Dataset - nyuuzyou/liveatc

Dataset highlights:

- 21,172 air traffic control audio recordings from LiveATC.net for August 26, 2024
- Multilingual content, primarily in English with potential for other languages
- Each entry includes: audio file, ICAO airport code, facility type, date, and time
- Contains original MP3 files stored in .tar.zst archives, organized by ICAO airport code
- Data covers various airports and ATC facilities worldwide
- Subject to LiveATC.net's Terms of Use for personal, non-commercial use only

The dataset can be used for audio classification, automatic speech recognition, and analysis of air traffic control communications. The inclusion of recordings from multiple airports allows for comparative analysis across different locations and facility types.
posted an update about 1 month ago
view post
Post
486
๐ŸŽ“ Introducing Svitppt.com.ua Presentations Dataset - nyuuzyou/svitppt

Dataset highlights:
- 18,001 presentations from svitppt.com.ua, a platform for storing and viewing presentations for Ukrainian school students
- Primarily in Ukrainian, with some Russian and English content
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a wide range of educational topics and presentation materials
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content in Ukrainian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in educational presentation materials and sharing practices in the Ukrainian-speaking student community. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in Ukrainian educational settings.
reacted to takeraparterer's post with ๐Ÿš€ about 1 month ago
view post
Post
2222
Check this out: I trained an AI on huggingface posts! all of these are AI generated:
----------
Hello!

I'm excited to share that my colleague @felipeebert and I have released the largest Spanish LLM benchmark to date.

We've developed the Spanish LLM Evaluation Benchmark (SLAB), a set of benchmarks designed to evaluate the ability of language models to understand, generate and translate in Spanish.

SLAB includes five different benchmarks:
- Sentiment Analysis: evaluate models' ability to detect and describe sentiment in natural language
- Fact Checking: evaluate models' ability to detect and refute factual errors in text
- Question Answering: evaluate models' ability to answer questions in Spanish
- Open-ended Questions: evaluate models' ability to generate coherent responses in Spanish
- Translation: evaluate models' ability to translate in Spanish

SLAB is aligned with the latest Spanish LLM industry developments and includes the most recent models available on the market. We aim to keep our benchmarks up-to-date and relevant to the Spanish language ecosystem.

SLAB is available at: https://huggingface.co/datasets/argilla/SLAB.

If you would like to collaborate on building additional Spanish LLM benchmarks, let's discuss in the comments.

๐Ÿ”— SLAB Blog Post: https://argilla.com/blog/slab
----------
Hello everyone,

I'm thrilled to announce the release of

https://huggingface.co/01-AI/01AI-GPT-4o -

A new family of models that brings the power of transformer AI to the masses.

This model is designed to be accessible and easy to use, while still offering high-quality results.

Key features:
- Small model size: only 23M parameters
- Supports text generation, image generation, and text-to-image tasks
- Data-efficient training with a lightweight tokenizer
- Optimized for efficient on-device usage
- Uses the powerful transformer architecture to deliver high-quality results

Excited to see what you all think!

https://huggingface.co/01-AI/01AI-GPT-4o
ยท
reacted to huggingface0's post with ๐Ÿคฏ about 1 month ago
view post
Post
3948
1+2=3
  • 2 replies
ยท
posted an update about 1 month ago
view post
Post
626
๐ŸŽ“ Introducing Bigslide.ru Presentations Dataset - nyuuzyou/bigslide

Dataset highlights:
- 50,872 presentations from bigslide.ru, a platform for storing and viewing presentations for school students
- Primarily in Russian, with some English and potentially other languages
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a wide range of educational topics and presentation materials
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content in Russian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in educational presentation materials and sharing practices in the Russian-speaking student community. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in educational settings.