How I Trained Action Chunking Transformer (ACT) on SO-101: My Journey, Gotchas, and Lessons

Community Article Published September 30, 2025

I’ve been wanting to try out the Action Chunking Transformer (ACT) on a real robot since I saw the model cooking a shrimp 🦐. A month ago, I finally got my hands on a SO-101. After watching many videos of people training this arm to pick and place a block, I thought, how hard would it be for me to do the same?

Thus began a journey filled with pitfalls and obstacles at every turn! When I look back now, many of them look pretty damn straightforward.🤷🏻‍♀️ So I figured I should share these here - you all might find them helpful.

Here we go - my journey, gotchas, and lessons learned from training ACT on SO-101 to pick a block and place it in a container.

Try 1: My Dear Woodpecker

LeRobot’s tutorial was a pretty good starting point. I grabbed two webcams and set them up on tripods. Then I clamped down the two arms - a leader and a follower. Third, I plugged all of them into a gaming desktop that my boyfriend kindly allowed me to use for science. Voila! I was ready to collect data.

The plan was simple: use the leader arm to control the follower arm to pick up a block and place it in a container. I would record 50 episodes - “10 episodes per location”, kick off training, which would last a few hours, and boom—robot butler.

Data collection & training

The data collection itself was… unexpectedly time-consuming. My cameras would disconnect every ~5 episodes and crash the recorder script. After some quick conversation with ChatGPT I realized it was because my two webcams were identical to each other. That confused my computer and so the camera’s USB paths would be randomly changed every once in a while.

I decided to push forward instead of fixing it. It took me about 2 hours (half of an afternoon) to get 50 demonstrations in the bag.

Afterwards, the training script was pretty easy to set up, thanks to the training and logging pipeline provided by LeRobot. I kicked off training before going to sleep. My ACT model (52M) trained in ~4 hours on a 12GB NVIDIA RTX 3080.

I woke up to find these beautiful loss curves on Weights & Biases. Can’t wait to try it on my SO-101!

Eval

The next day, I loaded up my freshly trained policy, placed the block at one of the five positions in the training set, and hit run.

Look what I’ve got here! 🐦

... a woodpecker!!

My SO-101 didn’t learn to pick up the block, but rather, to peck at the table over and over again… following some vague memory of “approach the block, close gripper, move to the container”, while completely missing the part where it was supposed to, you know, actually grab the block.

I tried moving the block closer, thinking maybe it just needed a little help. Nope! The arm continued to approach the block much closer than it should... Classic.

I was disappointed. People made it look so easy in the videos! What did I do wrong?

I sat down and made a list…

Laundry list... of rookie mistakes

The camera POV wasn’t fixed!: Upon some investigation I found that my cameras had, in fact, moved a little between training and testing. As much as a little perturbation was ok, I shouldn’t expect my model, trained on only 50 episodes, to be able to understand different camera POVs.

Camera POV during eval. (Difference: Front camera is more towards the left. Right camera is closer and more to the right)

Camera angle was difficult for grasping: From the right camera’s POV, the gripper tips overlap with each other during grasping. This makes the right camera footage less helpful for grasping.

An episode where the gripper tips overlap with each other during grasping.
Arm calibration mismatch: The arm calibration was different between training and testing! I didn’t question when the eval record script complained about missing arm calibration files and asked to recalibrate. It turns out that I accidentally lost my old calibration files during a power cycle as they were saved in a temporary folder! And when I redid the calibration, I didn’t bring all the joints to its middle range, which messed up the homing offset of some of the joints… so even if the model learned the correct joint angles, they got mapped to incorrect joint control commands to the servo.
Limited data diversity: my training set contained 10 episodes per location at 5 fixed locations, as suggested in the tutorial. This makes it easier for the model to overfit by memorizing the trajectories and thus not actually learning to pick up a block from a location not in the training set. This also brings me to my next point…
No eval set during training: Back in college when I trained computer vision models, I’d always set aside a small eval set aside from the training set, for monitoring model performance. This helped with detecting overfitting and selecting checkpoints. But the training script from LeRobot only does eval for policies trained in simulation.

Illustration of eval and training loss curves, and areas for under / overfitting.(Source)

Cheating without knowing it: I was looking over my shoulder at the follower arm while recording, getting information the robot would never have from the camera footage alone. Thanks to this video for calling me out! Plus, I later found out that this was actually also called out in the LeRobot tutorial.

But wait, there’s more! The logistical annoyances were piling up too:

Frequent webcam disconnection: Like I mentioned earlier, my cameras would disconnect every 10min or so. This was because Ubuntu likes to play musical chairs with USB addresses when you have identical webcams. After searching around, I realized that this was a common problem (with a clean solution - keep reading!).
Debugging was painful: LeRobot provides rerun for visualization during recording, and its own online dataset visualizer. But it was hard for me to visualize the time series quantities (camera frames, joint positions, time synchronization) of a recorded episode, which would be helpful for me to filter out low-quality data in the dataset.

Try 2: Improved Training and Eval Pipelines

Time for some serious engineering. I rolled up my sleeves and got to work.

Improvement 1: Standardize the Hardware Setup

Camera placement: I ditched the front-and-side setup for front-and-top. This causes less occlusion during grasping. Bonus: top camera also gets to see where the gripper tip is relative to the block (top / middle / bottom). This helps me collect higher quality demonstrations with grasps at the middle of the block (which is harder to see with the front camera)

Updated camera placement: front and top.

Fixed arm & camera locations: Tape and markers became my best friends. I made sure the cameras and arm were always fixed to the same spot.
Fixed camera parameters: I was recording episodes from day to night, so it was important to maintain fixed camera settings so the footage color scheme was as consistent as possible. I fixed camera parameters like exposure, white balance, etc.
Increase gripper friction: Probably just a nice-to-have, but I added tape on the gripper tips to increase friction, inspired by the original ACT paper.

Improvement 2: Better Data Collection Setup

Raw dataset recorder: I modified LeRobot’s script to save a lot more raw data - raw images, synchronized timestamps, videos, metadata, etc. These data was saved alongside LeRobotDataset which was still used for training. But I could customize extra info to record so I could visualize and debug.
Camera & arm udev rules: This is the solution to the frequent camera disconnection! I wrote some scripts to specify udev rules that map USB ports to some invariant property of the webcams. It turns out that they have identical serial numbers, so I ended up using the physical USB path to differentiate between them two and encoded them as udev rules. Later, I realized that this disconnection issue could also affect arms and did the same for them (luckily they have different serial numbers!)
See what the robot sees. When recording demonstrations, I only looked at the camera footage instead of the follower arm. A slight improvement: for some reason, rerun would randomly hide some topic stream even when they exist, so I ended up coding up my own OpenCV visualization of the two cameras’ footage.

Improvement 3: Formal Task Definition

In order to pump up my data diversity, I needed to start by formally defining my task space and variations.

Task definition: Pick up a block and place it in a container
Variations: Block types, container types, start poses

Available blocks. Left to right: white, grey, green.

Available containers. Left to right: tupperware, bowl, box.

Reproducible episodes: With clearly defined variation dimensions, I could now specify them as an enum or float range. So now each task has its own “task_config” which specifies the block and container type, as well as their start locations. I also added ruler grids to the table. Why is this important? So I can run eval on the same episode over and over again and compare between different models, as the performance difference will be truly model differences instead of eval episode setup differences.

# Example task_config for an episode.

task_name: "pick_and_place_block"
variations:
  # All blocks align longer edge with y axis at yaw = 0.
  block: "green"
  container: "tupperware"
  start_pose: 
    # [x, y, yaw_deg]
    block: [0.077, 0.224, 0.0]
    container: [0.3, 0.2, 0.0]

Stratified sampling: Now that I have configurable types of variations for each episode, I can now sample more efficiently by defining different bins for the start poses of the block. For the sake of simplicity, I fixed the container location for now and focused on varying the start poses of the block.

Success criteria: Define “success” in different stages - inspired by the progress score used by pi0 (presented in this talk.)

# Progress score for evaluating my pick-and-place-block task
task_progress_score:
  # Each of the stage is scored as "completed".
  # eg. 0_reach_block means arm has reached the block.
  0_reach_block: 0.2
  1_grasp_block: 0.4
  2_reach_container: 0.7
  3_release_block: 0.8
  4_block_in_container: 1.0

Improvement 4: Eval Pipeline Setup

Evaluation script: LeRobot uses the same record script for data collection and eval, but I wanted more automation. I wrote a script to automatically load up an eval set, shows me the task config of an eval episode (block and container types, start poses), gives me the time to set up, rolls out the episode, and allows me to use teleop to reset the environment (as the robot could end an episode in all sorts of weird poses…)
- This sped up my eval process while allowing me to use the same eval set to compare different models and / or checkpoints!
Episode scoring tool: Also built a tool to manually label the progress score of each episode and calculate the eval score for the entire eval set at the end 🙂

Back to the journey! Data collection & Training

Given the formalized task definition, sampling criteria, and train / eval split, I defined my new dataset as:

Collect: 12 start poses per bin x 6 bins = 72 episodes
Train / eval split: use 5 bins as our “distribution” of block start poses. Hold out the last one for “out of distribution” eval.
- Train: 10 episodes from bin 01235 = 50 episodes in total
- in-distribution eval: 2 episodes from bin 01235.
- out-of-distribution eval: 12 episodes from bin 4. (used for eval during training)

Training was finished again within 4 hours! From the eval loss, it didn’t seem like the model was overfitting. I used the checkpoint with lowest eval loss for on-robot eval.

Moment of truth!

Results!

in-distribution eval: 60% success rate (passing grade!). Average progress score 0.68.
out-of-distribution eval: 10% success rate (ouch). Average progress score 0.28.

Success episode (In distribution)

Failure episode (Out of distribution)

Failure analysis

Overreaching or underreaching: In most failure cases, the robot couldn’t get to the correct location for grasping the block, thus missing the grasp.
Grasping at the top of the block instead of middle: many times the gripper would reach to the top of the block, which made it hard to grasp the block tightly. After looking through my dataset, I realized that there were many examples of me grasping closer to the top of the block. “Garbage in, garbage out” - my master’s advisor used to say. It starts to feel like this SO-101 is my child and that I need to be a better parent - provide better examples!

A training episode where the gripper grasps towards the _top_ of the block instead of center.

No recovery: Once it failed the first grasp, it would retry many times but wasn’t able to recover whatsoever.
Fragility close to bin edges: The 4 failed in-distribution episodes are all closer to the edge of the bins.

I spent some time investigating my dataset coverage. It turns out that the uniform xy location sampling that I used for collecting data turned out not to be so random. Many data points were clustered together, leaving big gaps between data points. This explains why the episodes closer to bin edges would fail, and why the model wasn’t very good at generalizing to out-of-distribution episodes.

Time to get serious about data diversity. But before we move on, let me tell you about the great arm breakdown that almost stopped my project…

🥊 Plot Twist! The Robot Fights Back… 🥊

Halfway through data collection, disaster struck. My script crashed due to a camera issue (USB bandwidth problems). After that, my follower arm suddenly refused to cooperate. The error message was cryptic: Failed to sync read 'Present_Position' on ids=[1, 2, 3, 4, 5, 6]...[TxRxResult] Incorrect status packet. But calling sync_read by itself worked fine.

What the heck? 🤔

After a whole day of detective work 🕵🏻, I discovered that this failure only happened during high-frequency (30Hz) sync_read() and sync_write() operations. After ruling out the motors one by one, I found the culprit! The gripper motor couldn’t keep up. At high frequency, it sometimes failed to send back its status packet and blocked the comms on the entire motorbus.

😮 How did the gripper motor break??

After ruminating over this for a long time... I realized that, during teleop, I was scared that the block would fall from the gripper and so would grasp it with very tightly. But that added way too much force on the follower’s gripper motor which eventually got worn out.

After I swapped out the motor, my SO-101 started working again!

To prevent this from happening again, I practiced grasping the block by gently holding the gripper tip at about the width of the block. I also configured the follower joint’s max_relative_target to limit its velocity, further preventing them from over exerting themselves.

Moral of the story: be nice to your robot... and always keep some spare servos on hand!

Try 3: “More data!!”

More Data, Better Data

This time, I got serious about data diversity:

Density and coverage: Increased from 10 episodes per bin to 25, visualizing xy locations to make sure I covered the gaps
Orientation perturbation: Added rotation variations (from -45 to 45 deg in yaw) because blocks don’t always sit perfectly aligned in the real world. I was hoping that this will help with recovery from failed grasps.
Better grasping behavior: Made sure to include more examples of grasping from the middle of blocks. Also tried to reach the block at different angles.

Time lapse of me collecting data...

The Sweet Taste of Success

Results!

In distribution: 90% success rate. Average progress score 0.92.
Out of distribution: 75% success rate! 👏🏻 Average progress score 0.8.

Test its limits!

Recovering from failed grasps

Handling interruptions gracefully

Handling some new block and container types, if they are similar enough to the ones used in training (generalization!)

Remaining failure modes

One of the gripper tip gets stuck on top of block, preventing it from grasping correctly.
Approach the block incorrectly (too left / right / close / far)
Couldn’t recover from angles more than 45 degrees

Lessons Learned

Here’s what I learned from 3 weeks of robot wrangling:

Consistent setup is everything: The time spent on proper data collection and eval pipelines paid dividends
Data diversity matters: More varied training data led to better generalization
Debug infrastructure is crucial: Having flexible data structures and visualization tools helped me debug my data coverage and monitor potential latency issues.
Things break! It’s the real world, not simulation. That means - motors could fail, cameras could disconnect, calibration could malfunction, and USB buses could get overwhelmed. Be prepared.
Take care of your robot: Don’t be like me and break your robot with excessive force. These things are more delicate than they look

What’s Next? 💡

Now that I have a working pick-and-place robot (mostly), here are my next steps:

Switch to wrist camera: Replace the top camera with a wrist-mounted one for better grasp targeting
More workspace coverage: Add left/right container poses to cover more of the workspace
Full randomization: Maybe it’s time to randomize everything—start poses, container types, block types…
Try a VLA: See if a Vision-Language-Action model can handle generalization better
Async inference: Smooth out the jittery motion with asynchronous inference and fully exploit the benefits of action chunking.

The Bottom Line

I sometimes forget, while watching robot demo videos online, that real-world robots are messy, and that they break all the time in mysterious ways. (Even though I work at a robotics startup and deal with robot failures all day, lol) But when it finally works—when you see your robot successfully pick up a block and place it exactly where you wanted—it’s so magical!! All my work was worth it. 🤗

Which was why I tried to document them in this blog. Hopefully this can help someone else new to SO-101, LeRobot and ACT!

Just don’t expect it to happen on the first try. Or the second. But hey, third time’s the charm, right?

P.S. If you’re planning to try this yourself, buy spare motors. Trust me on this one.

Resources 📕

Community

rajuptvs

5 days ago

Incredible read!! tempting me just a bit more to get my own So 101 😀 .

indraneelpatil

4 days ago

Thanks a ton for writing this out! Very helpful!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote