@Severian on Hugging Face: "Create and Train Your Own Expert LLM: Generating Synthetic, Fact-Based…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Severian

posted an update May 4

Post

3665

Create and Train Your Own Expert LLM: Generating Synthetic, Fact-Based Datasets with LMStudio/Ollama and then fine-tuning with MLX and Unsloth

Hey everyone!

I know there are tons of videos and tutorials out there already but I've noticed a lot of questions popping up in community posts about using synthetic datasets for creative projects and how to transform personal content into more factual material. In my own work doing enterprise-level SFT and crafting my open-source models, I've enhanced a Python framework originally shared by the creator of the Tess models. This improved stack utilizes local language models and also integrates the Wikipedia dataset to ensure that the content generated is as accurate and reliable as possible.

I've been thinking of putting together a comprehensive, step-by-step course/guide on creating your own Expert Language Model. From dataset preparation and training to deployment on Hugging Face and even using something like AnythingLLM for user interaction. I'll walk you through each phase, clarifying complex concepts and troubleshooting common pitfalls.

Let me know if this interests you!

Most of the datasets and models I've made have been using these scripts and my approach

algorithm

May 4

I'd be very interested. To me most of the usual tutorials are missing the "verify" steps. For example:

How does the user verify the integrity of the dataset without having to read thousands of question pairs (if possible at all)?
How does the user verify the LLM is even utilizing the dataset? And with what amount of preference is it being used over the base model?
How does the user verify that all the possible unwanted entries are gone?

Those kind of questions are often missing.

IndrasMirror

May 4

Yes please I am currently working on trying to create a industry/task specific model. Currently working on a synthetic dataset but am really winging it mostly. A comprehensive guide would be great especially in terms of best formatting and processing for the final dataset.

win10

May 5

I'm very interested. I also have a rudimentary GITHUB project.
Don’t know if your project can synthesize pre-training data?

LeroyDyer

May 5

what can be considered pre-training data ?

as with any fresh model ; its the pipeline which should be taught first.. as this is what the model has evolved from :
Firstly Text generation ::

LANGAUGE
Here if possible large corpus of information or short storys or short lessons (single post) ... even childrens books (it needs to understand langauge first)
uterrances are useful at this stage too, ie prompts without responses and responses without querys ....

The AIM is to generate language : ... as well as feeding in correct corpus , for me i believe poetry, articles etc are the best way for pre-training a text gen model:
as soon as we can generate a good next word prediction :

THOUGHT:
We can train for simple Input/Output sequences : IE: Question and Answer :
So here we should begin with CONVERSATION: i:e greetings ..... SO VITAL! small talk: , Discussions on topics : movie scripts and character discussions in thematic personas etc: - maybe not even knowledge based qa: (because its pretraining only) :
then we can ADD THOUGHT:
ie: the same QA: and Conversational data from the previous stage , with thoughts:

in this stage we can also begin to do some fewshot maths and problem solving prompts(with chain of thoughts with explanations and solutions as thoughts)
CONTEXT:
then we can begin with context based query ie provide a context and query for the answer: this task is the beginning of INSTRUCT:
INSTRUCT:
using the context style prompt start adding the task based querys for code etc medical etc:

TASKS: when training for tasks its BEST TO OVERFIT: ??? yes to over fit the model and merge it into the base model ! hence grabbing the fully embedded task and still retaining original skills even retraining the same dataset on the merged model for a single epoch to align the new info into the merged model:
ie is a specialized prompt : what is a <>:
with a specialized output : here is the definition >>> , which we may not wish to come every time we say hello??

Later by adding reasoning and other stuff your model will begin to converge into a AI model (Chatbot with knowedge and perform task) ready as a base model !!

MAYBE!

iweavings

May 5

Hi,
I'm just learning about ML.
I like to build my module for a masters project, with all kinds of photos and videos, (NeRF) and use AI to create art/games
Would this be helpful to me, I have a lot of data.

LeroyDyer

May 7

when creating AI to be sucessfull you should train for a single task ! - Keep it simple . or train two seperate models to work together !
or merge them after training etc....
Plus think BIGGER ! (dont be a CLONE)

raincandy-u

May 5

Awesome! I am already working on a better datasets generator. I think of making a generation step-by-step like agent. It's good but too slow😭

LeroyDyer

May 7

yes but a good datset for training only needs to be 1000 samples! (your not getting knowledge ) your extracting a datastructure from these model based on a specific prompt ; hence it will be slower :

hence you should focues on task based datsets to train for a specific purpose and use the query to produce the components required to build that task datset ... :
ie : chat , but the speaker should anylsize the input for sentiment, before responding to hello ?
hence even for simple chat ... we can extend it by adding a sentiment anyalasis to the internal mind, as wel as a sarastic response (inside), or even psycoanyalise the current situation :
so we can have these heading along side the discussion ! ( this enables for deeper understanding of the conversation ) ... the responses stay the same but the internal anyalsis has some action :
later when speaking to the modle it will do these anyalsis and respond acordingly as it will have learned the pattern of anyalsize then respond rather that recall responses most probable...

so keep ing the datset to 1000 ebnables for the multiple epoch required to embed an new task into the model ; (its not new as its jsut chat) but it is enhanced!:

we could also spellcheck each input before returning a response :

we could even anylsise the speech to see if it is formal or informal ....
these feilds are for the internal mind: thought processes :

so keep at it !

But if you create a gradio rig for this where we can just input a couple of models etc then we could produce this in the cloud !
(COLAB) ... (sorry im not a python programmer(im vb.net))

capti0n

May 6

Awesome!!

pedi

May 6

😀😀😀

LeroyDyer

May 7

i would like to request a dataset :

one for coding ; (me i use visual basic ? but python and javascript/vbscript/ bash script is fine:

the main aim is to ask a question :
to perform a coding task : ie create a tokenizer in python:

the first colum would be the first try answer
the nexxt colum the anyalsis of this response ; if the code runs , what is missing , does this align with the user query and does it produce the expected output:

then the second colum would be the query (define this problem step by step)

the third colum would be the response genrated for the output using the steps:

the net colum and explanation of the query ? (what is a tokenizer? ie the definition)

the next colum would be the response folowing the defined:

with this we can create a detailed prompt template :
the definition of the concept:
which enables for the first guess,
the self anaylasis ;
then the defined process of step by step for this problem:
the new code following the guidelines ...

then i final output consisting of the step by step explanation. the defiition of the concept, the step by step output !

hence the whole process of defiing a problem by steps and describing the problem definintion , the guess and the final great output:

this producing a thinking train orgainsed in the thoughts of the model (ie all these parts are a part of the self checking and anyalasis)
i found that after training my models on this type of concept : as well as even reframing these inputs to be (internal agents generated outputs) ... it allowed the model to generate these agents and have a conversation internally before outputting the response:
for those of whom do not update thier template to use the thought paterns or chain of thoughts it does not matter as this will still happen internally .... but if you allow the bot to show its thoughts then you will see the whole process!!!

when it wanted to hallucenate an answer (it did not know or could not generate quite right answer) i noticed it arguing with itself !

we would like to add thinking to the mind: but we first must arrange its thoughts into many different thought chain types:
even we would like the model to generate internal agents and discusss , hence creating the datasets to allow use to model the internal thoughts around data, ( i have been using the dpo as a model ie: the rejected is the internal agents output and the chosen the bot output !
when using the prompt in day-to-day it generated these agents internaly and mimiced the training ! producing lovely results ... but because the rejected outputs are not always to be rejected it does cause a bit of down IQ... but this can be repaired again with good data :
i used a couple of your knowledge trees but they did not take hold or perform the way as we would have expected ..... (it wrong set up !... we need to provide the outputs or internal thinking to mimic first then it can generatre outputs to match !

see how it goes any way ! if you can makes some datsets like this using the models to produce theoutputs , even better , if we could use different agents to produce each output for the query getting a real diverse dataset and not a single trainers opinion ....

AtAndDev

May 8

I just love Unsloth!

In this post