Training Your First Balance Robot: A Guide to Cartpole Simulation in Robotics

Robot Simulation: Training Your First Balancing Robot – Cartpole

Published on February 2, 2026, this article documents the process of training a balancing robot using the Cartpole environment on a Windows system.

How the Cartpole Code Works

When using Isaac Lab for reinforcement learning, one of the tasks is how to “register” a task. This involves understanding where rewards, terminations, and observations are configured, and how the training script connects the environment with the Proximal Policy Optimization (PPO) algorithm. This blog clarifies how the code in the Cartpole project operates, following the path from registration to training.

When the command python scripts/rsl_rl/train.py --task Template-Cartpole-v0 --num_envs 4096 is executed, it triggers the training of the Cartpole task (Template-Cartpole-v0), utilizing 4096 parallel environments for PPO training. The process involves the following steps:

Launching Isaac Sim
Parsing command line arguments to locate “environment configuration” and “algorithm configuration” based on the task name (including the MDP and PPO hyperparameters for Cartpole)
Creating a Gym environment (simulating multiple Cartpole environments in parallel)
Using RSL-RL’s PPO Runner to collect data and update the policy in a loop until the specified number of iterations is reached

Task Registration

import gymnasium as gym
from . import agents

gym.register(
    id="Template-Cartpole-v0",
    entry_point="isaaclab.envs:ManagerBasedRLEnv",
    disable_env_checker=True,
    kwargs={
        "env_cfg_entry_point": f"{__name__}.cartpole_env_cfg:CartpoleEnvCfg",
        "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:PPORunnerCfg",
    },
)

The id specifies the name of the environment, which Gym uses to locate it. The entry_point is a string directing the program to the specific module and class to use. The kwargs include:

env_cfg_entry_point: Indicates where the configuration class for the environment (CartpoleEnvCfg) is located, which defines the environment’s appearance, reward calculation, and termination conditions.
rsl_rl_cfg_entry_point: Tells the program where to find the configuration class for PPO (PPORunnerCfg).

Environment Configuration

When gym.make("Template-Cartpole-v0", cfg=env_cfg) is called, Isaac Lab retrieves the CartpoleEnvCfg based on env_cfg_entry_point. This class provides a comprehensive overview of how the Cartpole environment appears, how to interact with it, how rewards are calculated, and when episodes end.

The structure of CartpoleEnvCfg is as follows:

CartpoleEnvCfg
├── scene: CartpoleSceneCfg
├── observations: ObservationsCfg
├── actions: ActionsCfg
├── events: EventCfg
├── rewards: RewardsCfg
├── terminations: TerminationsCfg
└── __post_init__: Simulation step length, episode length, and perspective settings

The components include:

scene: Defines the ground, the Cartpole robot (one for each environment), and lighting. The robot uses the CARTPOLE_CFG from Isaac Lab Assets, ensuring that each of the 4096 environments has a separate Cartpole.
actions: Currently consists of applying torque to the “cart joint” slider_to_cart, scaling the output to ±100.
observations: Includes joint position and speed, which form a vector for the policy network.
events: Randomizes the cart’s pose/velocity and pole’s angle/angular velocity at reset to vary starting states across episodes.
rewards: Details the reward structure.
terminations: Defines conditions for ending an episode, such as timeouts or the cart moving out of bounds.

Action Configuration and Joint Torque

The robot’s movements are entirely determined by the action space. The policy network outputs a scalar, which is converted into torque for the cart joint. The physics engine updates the state of the cart and pole based on this torque. The action configuration in cartpole_env_cfg.py is simply defined as follows:

@configclass
class ActionsCfg:
    """Action specifications for the MDP."""
    joint_effort = mdp.JointEffortActionCfg(
        asset_name="robot",
        joint_names=["slider_to_cart"],
        scale=100.0,
    )

Reward Calculation

In the RewardsCfg, five types of rewards are defined:

alive: Provides a reward of 1.0 as long as the episode does not end due to a failure condition.
terminating: Penalizes with -2.0 if the episode ends due to failure (e.g., cart going out of bounds).
pole_pos: Aims to keep the pole vertical, using the joint angle’s deviation from zero as a penalty.
cart_vel and pole_vel: Impose small penalties on cart speed and pole angular speed to encourage stability.

PPO Configuration: Network and Hyperparameters

The training algorithm used is PPO from RSL-RL, with hyperparameters and network structures defined in agents/rsl_rl_ppo_cfg.py. This is where training behaviors can be modified.

@configclass
class PPORunnerCfg(RslRlOnPolicyRunnerCfg):
    num_steps_per_env = 16
    max_iterations = 150
    save_interval = 50
    experiment_name = "cartpole_direct"
    policy = RslRlPpoActorCriticCfg(
        init_noise_std=1.0,
        actor_obs_normalization=False,
        critic_obs_normalization=False,
        actor_hidden_dims=[32, 32],
        critic_hidden_dims=[32, 32],
        activation="elu",
    )
    algorithm = RslRlPpoAlgorithmCfg(
        value_loss_coef=1.0,
        use_clipped_value_loss=True,
        clip_param=0.2,
        entropy_coef=0.005,
        num_learning_epochs=5,
        num_mini_batches=4,
        learning_rate=1.0e-3,
        schedule="adaptive",
        gamma=0.99,
        lam=0.95,
        desired_kl=0.01,
        max_grad_norm=1.0,
    )

The relationship between the robot and the environment is determined by the actor_hidden_dims, which dictate the size of the MLP transforming observations into actions. The dimensions of observations come from the previously described ObservationsCfg, while the action dimension is specified in ActionsCfg.

Training Script

The main responsibilities of the training script scripts/rsl_rl/train.py are to launch the application, retrieve the configurations based on the task name, and then create the environment and runner to execute learn().

Launch the application and parse parameters. Isaac Lab requires the app to be started before importing any simulation-related modules.
Hydra dynamically loads CartpoleEnvCfg and PPORunnerCfg based on the provided task and agent.
Create the environment, wrap it, build the runner, and start training.

In conclusion, the workflow involves modifying configurations (CartpoleEnvCfg / PPORunnerCfg), creating the environment with gym.make("Template-Cartpole-v0", cfg=env_cfg), and executing the training loop through the OnPolicyRunner.

Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/training-your-first-balance-robot-a-guide-to-cartpole-simulation-in-robotics/

Training Your First Balance Robot: A Guide to Cartpole Simulation in Robotics

相关推荐