--- language: en library_name: pytorch license: apache-2.0 pipeline_tag: reinforcement-learning tags: - reinforcement-learning - Generative Model - GenerativeRL - LunarLanderContinuous-v2 benchmark_name: Box2d task_name: LunarLanderContinuous-v2 model-index: - name: QGPO results: - task: type: reinforcement-learning name: reinforcement-learning dataset: name: LunarLanderContinuous-v2 type: LunarLanderContinuous-v2 metrics: - type: mean_reward value: '200.0' name: mean_reward verified: false --- # Play **LunarLanderContinuous-v2** with **QGPO** Policy ## Model Description This implementation applies **QGPO** to the Box2d **LunarLanderContinuous-v2** environment using [GenerativeRL](https://github.com/opendilab/di-engine). ## Model Usage ### Install the Dependencies
(Click for Details) ```shell # install GenerativeRL with huggingface support pip3 install GenerativeRL[huggingface] # install environment dependencies if needed pip3 install gym[box2d]==0.23.1 ```
### Download Model from Huggingface and Run the Model
(Click for Details) ```shell # running with trained model python3 -u run.py ``` **run.py** ```python import gym from grl.algorithms.qgpo import QGPOAlgorithm from grl.datasets import QGPOCustomizedTensorDictDataset from grl.utils.huggingface import pull_model_from_hub def qgpo_pipeline(): policy_state_dict, config = pull_model_from_hub( repo_id="zjowowen/LunarLanderContinuous-v2-QGPO", ) qgpo = QGPOAlgorithm( config, dataset=QGPOCustomizedTensorDictDataset( numpy_data_path="./data.npz", action_augment_num=config.train.parameter.action_augment_num, ), ) qgpo.model.load_state_dict(policy_state_dict) # --------------------------------------- # Customized train code ↓ # --------------------------------------- # qgpo.train() # --------------------------------------- # Customized train code ↑ # --------------------------------------- # --------------------------------------- # Customized deploy code ↓ # --------------------------------------- agent = qgpo.deploy() env = gym.make(config.deploy.env.env_id) observation = env.reset() images = [env.render(mode="rgb_array")] for _ in range(config.deploy.num_deploy_steps): observation, reward, done, _ = env.step(agent.act(observation)) image = env.render(mode="rgb_array") images.append(image) # save images into mp4 files import imageio.v3 as imageio import numpy as np images = np.array(images) imageio.imwrite("replay.mp4", images, fps=30, quality=8) # --------------------------------------- # Customized deploy code ↑ # --------------------------------------- if __name__ == "__main__": qgpo_pipeline() ```
## Model Training ### Train the Model and Push to Huggingface_hub
(Click for Details) ```shell #Training Your Own Agent python3 -u train.py ``` **train.py** ```python import gym from grl.algorithms.qgpo import QGPOAlgorithm from grl.datasets import QGPOCustomizedTensorDictDataset from grl.utils.log import log from grl_pipelines.diffusion_model.configurations.lunarlander_continuous_qgpo import ( config, ) def qgpo_pipeline(config): qgpo = QGPOAlgorithm( config, dataset=QGPOCustomizedTensorDictDataset( numpy_data_path="./data.npz", action_augment_num=config.train.parameter.action_augment_num, ), ) # --------------------------------------- # Customized train code ↓ # --------------------------------------- qgpo.train() # --------------------------------------- # Customized train code ↑ # --------------------------------------- # --------------------------------------- # Customized deploy code ↓ # --------------------------------------- agent = qgpo.deploy() env = gym.make(config.deploy.env.env_id) observation = env.reset() for _ in range(config.deploy.num_deploy_steps): env.render() observation, reward, done, _ = env.step(agent.act(observation)) # --------------------------------------- # Customized deploy code ↑ # --------------------------------------- if __name__ == "__main__": log.info("config: \n{}".format(config)) qgpo_pipeline(config) ```
**Configuration**
(Click for Details) ```python {'train': {'project': 'LunarLanderContinuous-v2-QGPO-VPSDE', 'device': 'cuda', 'wandb': {'project': 'IQL-LunarLanderContinuous-v2-QGPO-VPSDE'}, 'simulator': {'type': 'GymEnvSimulator', 'args': {'env_id': 'LunarLanderContinuous-v2'}}, 'model': {'QGPOPolicy': {'device': 'cuda', 'critic': {'device': 'cuda', 'q_alpha': 1.0, 'DoubleQNetwork': {'backbone': {'type': 'ConcatenateMLP', 'args': {'hidden_sizes': [10, 256, 256], 'output_size': 1, 'activation': 'relu'}}}}, 'diffusion_model': {'device': 'cuda', 'x_size': 2, 'alpha': 1.0, 'solver': {'type': 'DPMSolver', 'args': {'order': 2, 'device': 'cuda', 'steps': 17}}, 'path': {'type': 'linear_vp_sde', 'beta_0': 0.1, 'beta_1': 20.0}, 'reverse_path': {'type': 'linear_vp_sde', 'beta_0': 0.1, 'beta_1': 20.0}, 'model': {'type': 'noise_function', 'args': {'t_encoder': {'type': 'GaussianFourierProjectionTimeEncoder', 'args': {'embed_dim': 32, 'scale': 30.0}}, 'backbone': {'type': 'TemporalSpatialResidualNet', 'args': {'hidden_sizes': [512, 256, 128], 'output_dim': 2, 't_dim': 32, 'condition_dim': 8, 'condition_hidden_dim': 32, 't_condition_hidden_dim': 128}}}}, 'energy_guidance': {'t_encoder': {'type': 'GaussianFourierProjectionTimeEncoder', 'args': {'embed_dim': 32, 'scale': 30.0}}, 'backbone': {'type': 'ConcatenateMLP', 'args': {'hidden_sizes': [42, 256, 256], 'output_size': 1, 'activation': 'silu'}}}}}}, 'parameter': {'behaviour_policy': {'batch_size': 1024, 'learning_rate': 0.0001, 'epochs': 500}, 'action_augment_num': 16, 'fake_data_t_span': None, 'energy_guided_policy': {'batch_size': 256}, 'critic': {'stop_training_epochs': 500, 'learning_rate': 0.0001, 'discount_factor': 0.99, 'update_momentum': 0.005}, 'energy_guidance': {'epochs': 1000, 'learning_rate': 0.0001}, 'evaluation': {'evaluation_interval': 50, 'guidance_scale': [0.0, 1.0, 2.0]}, 'checkpoint_path': './LunarLanderContinuous-v2-QGPO'}}, 'deploy': {'device': 'cuda', 'env': {'env_id': 'LunarLanderContinuous-v2', 'seed': 0}, 'num_deploy_steps': 1000, 't_span': None}} ``` ```json { "train": { "project": "LunarLanderContinuous-v2-QGPO-VPSDE", "device": "cuda", "wandb": { "project": "IQL-LunarLanderContinuous-v2-QGPO-VPSDE" }, "simulator": { "type": "GymEnvSimulator", "args": { "env_id": "LunarLanderContinuous-v2" } }, "model": { "QGPOPolicy": { "device": "cuda", "critic": { "device": "cuda", "q_alpha": 1.0, "DoubleQNetwork": { "backbone": { "type": "ConcatenateMLP", "args": { "hidden_sizes": [ 10, 256, 256 ], "output_size": 1, "activation": "relu" } } } }, "diffusion_model": { "device": "cuda", "x_size": 2, "alpha": 1.0, "solver": { "type": "DPMSolver", "args": { "order": 2, "device": "cuda", "steps": 17 } }, "path": { "type": "linear_vp_sde", "beta_0": 0.1, "beta_1": 20.0 }, "reverse_path": { "type": "linear_vp_sde", "beta_0": 0.1, "beta_1": 20.0 }, "model": { "type": "noise_function", "args": { "t_encoder": { "type": "GaussianFourierProjectionTimeEncoder", "args": { "embed_dim": 32, "scale": 30.0 } }, "backbone": { "type": "TemporalSpatialResidualNet", "args": { "hidden_sizes": [ 512, 256, 128 ], "output_dim": 2, "t_dim": 32, "condition_dim": 8, "condition_hidden_dim": 32, "t_condition_hidden_dim": 128 } } } }, "energy_guidance": { "t_encoder": { "type": "GaussianFourierProjectionTimeEncoder", "args": { "embed_dim": 32, "scale": 30.0 } }, "backbone": { "type": "ConcatenateMLP", "args": { "hidden_sizes": [ 42, 256, 256 ], "output_size": 1, "activation": "silu" } } } } } }, "parameter": { "behaviour_policy": { "batch_size": 1024, "learning_rate": 0.0001, "epochs": 500 }, "action_augment_num": 16, "fake_data_t_span": null, "energy_guided_policy": { "batch_size": 256 }, "critic": { "stop_training_epochs": 500, "learning_rate": 0.0001, "discount_factor": 0.99, "update_momentum": 0.005 }, "energy_guidance": { "epochs": 1000, "learning_rate": 0.0001 }, "evaluation": { "evaluation_interval": 50, "guidance_scale": [ 0.0, 1.0, 2.0 ] }, "checkpoint_path": "./LunarLanderContinuous-v2-QGPO" } }, "deploy": { "device": "cuda", "env": { "env_id": "LunarLanderContinuous-v2", "seed": 0 }, "num_deploy_steps": 1000, "t_span": null } } ```
**Training Procedure** - **Weights & Biases (wandb):** [monitor link](https://wandb.ai/zjowowen/IQL-LunarLanderContinuous-v2-QGPO-VPSDE) ## Model Information - **Github Repository:** [repo link](https://github.com/opendilab/GenerativeRL/) - **Doc**: [Algorithm link](https://opendilab.github.io/GenerativeRL/) - **Configuration:** [config link](https://huggingface.co/OpenDILabCommunity/LunarLanderContinuous-v2-QGPO/blob/main/policy_config.json) - **Demo:** [video](https://huggingface.co/OpenDILabCommunity/LunarLanderContinuous-v2-QGPO/blob/main/replay.mp4) - **Parameters total size:** 8799.79 KB - **Last Update Date:** 2024-12-04 ## Environments - **Benchmark:** Box2d - **Task:** LunarLanderContinuous-v2 - **Gym version:** 0.23.1 - **GenerativeRL version:** v0.0.1 - **PyTorch version:** 2.4.1+cu121 - **Doc**: [Environments link](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)