Spaces:
Running
Running
Sébastien De Greef
commited on
Commit
·
318a4fe
1
Parent(s):
c8d3a05
feat: Update profile image and add synthetic emotions dataset generation guide
Browse files- src/_quarto.yml +2 -0
- src/about.qmd +2 -0
- src/about_resume.qmd +18 -7
- src/blog/about.qmd +1 -1
- src/llms/synthetic_emotions.qmd +50 -0
- src/profile.jpeg +0 -0
src/_quarto.yml
CHANGED
@@ -114,6 +114,8 @@ website:
|
|
114 |
text: "Auto-Train(ing) on HuggingFace"
|
115 |
- href: llms/rag_systems.qmd
|
116 |
text: "Retrival Augmented Generation"
|
|
|
|
|
117 |
|
118 |
- section: "Computer Vision Models"
|
119 |
contents:
|
|
|
114 |
text: "Auto-Train(ing) on HuggingFace"
|
115 |
- href: llms/rag_systems.qmd
|
116 |
text: "Retrival Augmented Generation"
|
117 |
+
- href: llms/synthetic_emotions.qmd
|
118 |
+
text: "Synthetic Emotions (Dataset Generation)"
|
119 |
|
120 |
- section: "Computer Vision Models"
|
121 |
contents:
|
src/about.qmd
CHANGED
@@ -6,6 +6,8 @@ title: "About"
|
|
6 |
|
7 |
This repository is my personal collection of recipes and notebooks, documenting my journey of learning and exploring various aspects of Artificial Intelligence (AI). As a self-taught AI enthusiast, I created this cookbook to serve as a knowledge base, a "how-to" guide, and a reference point for my own projects and experiments.
|
8 |
|
|
|
|
|
9 |
**The Story Behind**
|
10 |
|
11 |
Over the past year, I've been fascinated by the rapidly evolving field of AI and its endless possibilities. To deepen my understanding and skills, I embarked on a self-learning journey, diving into various AI-related projects and topics. As I progressed, I realized the importance of documenting my learnings, successes, and failures. This cookbook is the culmination of that effort, a centralized hub where I can quickly find and revisit previous projects, takeaways, and insights.
|
|
|
6 |
|
7 |
This repository is my personal collection of recipes and notebooks, documenting my journey of learning and exploring various aspects of Artificial Intelligence (AI). As a self-taught AI enthusiast, I created this cookbook to serve as a knowledge base, a "how-to" guide, and a reference point for my own projects and experiments.
|
8 |
|
9 |
+
[See my resume](https://sebdg-ai-cookbook.hf.space/about_resume.html) for more information about my AI all-in dive.
|
10 |
+
|
11 |
**The Story Behind**
|
12 |
|
13 |
Over the past year, I've been fascinated by the rapidly evolving field of AI and its endless possibilities. To deepen my understanding and skills, I embarked on a self-learning journey, diving into various AI-related projects and topics. As I progressed, I realized the importance of documenting my learnings, successes, and failures. This cookbook is the culmination of that effort, a centralized hub where I can quickly find and revisit previous projects, takeaways, and insights.
|
src/about_resume.qmd
CHANGED
@@ -2,7 +2,7 @@
|
|
2 |
title: "Sebastien De Greef"
|
3 |
about:
|
4 |
template: trestles
|
5 |
-
image: profile.
|
6 |
image-shape: round
|
7 |
links:
|
8 |
- icon: linkedin
|
@@ -10,12 +10,23 @@ about:
|
|
10 |
href: https://www.linkedin.com/in/sebdg/
|
11 |
---
|
12 |
|
13 |
-
|
14 |
-
As a doting father and an unabashed tech and music addict, I juggle the joys and challenges of parenthood with the rhythms of code and chords.
|
15 |
|
16 |
-
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
- **Project Management and Development**: Experienced in overseeing projects from conception to deployment, ensuring alignment with business objectives and client satisfaction.
|
21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
title: "Sebastien De Greef"
|
3 |
about:
|
4 |
template: trestles
|
5 |
+
image: profile.jpeg
|
6 |
image-shape: round
|
7 |
links:
|
8 |
- icon: linkedin
|
|
|
10 |
href: https://www.linkedin.com/in/sebdg/
|
11 |
---
|
12 |
|
13 |
+
## Professional Summary:
|
|
|
14 |
|
15 |
+
With over two decades of experience in software engineering, I have carved a niche in artificial intelligence (AI), natural language processing (NLP), and advanced data analysis. My journey into AI began over a decade ago, fueled by a fascination with early NLP technologies, back when the field relied heavily on rule-based systems like POS taggers and Bags of Words.
|
16 |
|
17 |
+
Over the years, I have honed my skills in sentiment analysis, named entity recognition (NER), and knowledge graph construction. My work has involved scraping vast datasets from the nascent stages of the Semantic Web, utilizing formats like RDF and OWL, and transforming these structured and unstructured data into insightful, actionable knowledge. This foundational work in data collection and analysis has set the stage for my later explorations into more complex AI domains.
|
18 |
+
My technical expertise extends to writing compilers, where I have deepened my understanding of language processing, tokenizers and parsers, turning programming and natural languages into executable instructions and vice versa. This expertise has been instrumental in developing advanced tokenization techniques, context window configurations, and sequence-to-sequence (Seq2Seq) models, many of which I have crafted for personal projects and training.
|
|
|
19 |
|
20 |
+
Driven by an unyielding commitment to self-improvement and mastery of state-of-the-art (SOTA) techniques, I recently dedicated four months to intensive training, focusing exclusively on cutting-edge AI technologies. This strategic pause was a leap of faith for transitioning from a successful career in software development to pursue my passion for AI full-time.
|
21 |
+
|
22 |
+
In addition, I developed [https://huggingface.co/spaces/sebdg/ai-cookbook](My AI Cookbook), a comprehensive guide that documents my journey and knowledge in AI. This resource covers theoretical foundations, practical applications, and curated educational materials, showcasing my ability to self-teach, document complex technical knowledge, and create valuable educational resources for the AI community. It includes my work with advanced tools and libraries such as CrewAI, LangChain, TensorFlow, TensorBoard, Torch, TTS (Text-to-Speech), and STT (Speech-to-Text). I am also proficient in Python, Pandas, and LabelStudio.
|
23 |
+
|
24 |
+
I have published an article on LinkedIn discussing the concept of [https://www.linkedin.com/pulse/what-good-enough-ai-s%2525C3%2525A9bastien-de-greef-euhfe/?trackingId=xwEq6HloQbugNnA6RTGcaQ%3D%3D]("good enough") in AI, focusing on achieving efficiency without compromising performance. I earned a badge as a Machine Learning Top Voice and am very active in online communities like the Ollama Discord, where I am a referent and go-to person. I also hold weekly hands-on live presentations about various AI aspects.
|
25 |
+
|
26 |
+
My recent projects include developing an emotions classifier and fine-tuning Llama3 models for emotion and sentiment analysis, as well as creating a fine-tuned dataset for function calling on Llama and Phi3 models on [https://huggingface.co/sebdg](HuggingFace). These projects demonstrate my capability to adapt and innovate with the latest AI technologies.
|
27 |
+
I have created an advanced [https://sebdg-portfolio.static.hf.space/object-detection.html](Object-Detection Project) utilizing YOLO (You Only Look Once) and SAM Segment-Anything. This project demonstrates my expertise in implementing object detection algorithms, training models on diverse datasets, and achieving high accuracy in real-time detection tasks. The project showcases practical applications of object detection in various scenarios, highlighting my proficiency in this cutting-edge technology.
|
28 |
+
|
29 |
+
Furthermore, I developed a [https://chatgpt.com/g/g-BUfysmfX4-ruby-cruit](Recruitment Assistant) project, which leverages a custom GPT model tailored to present me as a candidate to companies. This AI tool enhances my professional presence by engaging potential employers and inviting them to have a conversation with "Ruby," my AI-powered representative.
|
30 |
+
Currently, I am working on a project to map supply chains through AI and information gathering. This involves using Retrieval-Augmented Generation (RAG) systems with vector stores, scrapers, and D3.js visualization techniques. This project will be presented at Windesheim University of Applied Sciences, underscoring my ability to integrate multiple technologies to solve complex real-world problems.
|
31 |
+
|
32 |
+
As I seek opportunities to bring my extensive background and fresh training to innovative AI projects, I am eager to contribute to a team that values pioneering solutions, continuous learning, and real-world impact. My career is a testament to my belief that with the right mix of data, technology, and creativity, the possibilities are limitless.
|
src/blog/about.qmd
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
---
|
2 |
title: "About"
|
3 |
-
image: profile.
|
4 |
about:
|
5 |
template: trestles
|
6 |
links:
|
|
|
1 |
---
|
2 |
title: "About"
|
3 |
+
image: profile.jpeg
|
4 |
about:
|
5 |
template: trestles
|
6 |
links:
|
src/llms/synthetic_emotions.qmd
ADDED
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
### Generating Synthetic Datasets for Fine-Tuning LLMs on Sentiment Detection
|
2 |
+
|
3 |
+
#### Introduction
|
4 |
+
Fine-tuning Large Language Models (LLMs) for sentiment detection and text generation is crucial for applications in natural language processing (NLP). Using synthetic datasets like the GoEmotions dataset, which contains over 58,000 sentences labeled with 27 emotions, is a powerful approach. However, ensuring diversity and removing potentially biased information is essential for creating fair and accurate sentiment detection systems.
|
5 |
+
|
6 |
+
#### Leveraging the GoEmotions Dataset
|
7 |
+
The GoEmotions dataset, developed by Google, provides a rich resource for training LLMs to recognize and generate text with specific emotional tones. Key steps include:
|
8 |
+
|
9 |
+
1. **Data Preprocessing**:
|
10 |
+
- **Cleaning**: Remove redundant or noisy data.
|
11 |
+
- **Balancing**: Ensure an even distribution of emotions to prevent skewed model training.
|
12 |
+
|
13 |
+
2. **Synthetic Data Generation**:
|
14 |
+
- Generate sentences representing each emotion to diversify the training data.
|
15 |
+
|
16 |
+
#### Diversifying Prompts
|
17 |
+
Diverse prompts help models generalize better and avoid overfitting:
|
18 |
+
|
19 |
+
- **Variety of Contexts**: Include prompts from different topics and scenarios to teach the model how emotions manifest in various contexts.
|
20 |
+
- **Balanced Representation**: Ensure a mix of neutral, positive, and negative sentiments.
|
21 |
+
|
22 |
+
#### Mitigating Bias in Datasets
|
23 |
+
Bias can lead to unfair models. Steps to mitigate this include:
|
24 |
+
|
25 |
+
1. **Remove Sensitive Information**:
|
26 |
+
- **Names**: Exclude names to prevent celebrity bias.
|
27 |
+
- **Religion, Gender, Race, and Minorities**: Remove references to avoid reinforcing stereotypes or biases.
|
28 |
+
|
29 |
+
2. **Anonymization Techniques**:
|
30 |
+
- Replace sensitive information with neutral placeholders (e.g., “Person A” instead of specific names).
|
31 |
+
|
32 |
+
3. **Bias Detection and Correction**:
|
33 |
+
- Regularly evaluate the model for biased outputs and retrain with balanced and anonymized data if needed.
|
34 |
+
|
35 |
+
#### Practical Implementation
|
36 |
+
To fine-tune an LLM for sentiment detection and generation:
|
37 |
+
|
38 |
+
1. **Data Preparation**:
|
39 |
+
- Use the cleaned and anonymized GoEmotions dataset.
|
40 |
+
- Generate diverse synthetic prompts to cover a wide range of emotions.
|
41 |
+
|
42 |
+
2. **Model Training**:
|
43 |
+
- Fine-tune the LLM with a balanced mix of real and synthetic data representing all sentiment categories.
|
44 |
+
|
45 |
+
3. **Evaluation**:
|
46 |
+
- Test the model on a separate validation set to ensure accuracy and fairness.
|
47 |
+
- Continuously refine the dataset and retrain the model as necessary.
|
48 |
+
|
49 |
+
#### Conclusion
|
50 |
+
Creating synthetic datasets for fine-tuning LLMs on sentiment detection involves careful data preparation and bias mitigation. By leveraging the GoEmotions dataset and implementing strategies to diversify prompts and remove sensitive information, we can develop robust models capable of accurately detecting and generating text with specific sentiments. This approach enhances the model’s performance while ensuring fairness and inclusivity in NLP applications.
|
src/profile.jpeg
ADDED