D4RL (MuJoCo) ~~~~~~~~~~~~~~~ Abstract ========== D4RL is an open source benchmark for offline Reinforcement Learning, which provides a standardized environment and dataset for training and benchmarking algorithms. The collecting strategy of datasets includes: 1. datasets generated by hand-designed rules and expert demonstrations 2. nulti-task datasets (agents performing different tasks in the same environment) 3. datasets collected with hybrid strategies Specifically, the following seven sub-environments are included - Maze2D - AntMaze - Gym-MuJoco - Adroit - FrankaKitchen - Flow - Offline CARLA Note: offline RL is trained on the d4rl and tested by interacting with a specific RL environment, such as Mujoco. The Mujoco dataset is a physics engine designed to facilitate research and development in areas such as robotics, biomechanics, graphics, and animation that require fast and accurate simulations, and is often used as a benchmark testing environment for continuous space reinforcement learning algorithms. It is a collection of 20 subenvironments, in D4RL the subenvironments used are Half Cheetah, Hopper, and Walker2D. Each subenvironment contains 5 smaller subenvironments. - expert: train a \ `SAC `__\ algorithm online until the strategy reaches the expert performance level, using the expert strategy to collect 1 million samples of data - medium-expert: mix equal amounts of data collected by expert and medium strategies - medium: first train a SAC algorithm online, stop training in the middle, and then use this partially trained strategy to collect 1 million samples of data - medium-replay:train a SAC algorithm online until the strategy reaches a moderate performance level, collecting all the samples placed in the buffer during training - random:use a random initialization strategy to collect The picture below shows one of the Hopper games. .. image:: ./images/d4rl.gif :align: center Installation ============== Method --------- Installing d4rl, gym and mujoco-py, and d4rl can be installed with one click via pip or via clone .. code:: shell # pip install pip install git+https://github.com/rail-berkeley/d4rl@master#egg=d4rl # installed by cloning the repository git clone https://github.com/rail-berkeley/d4rl.git cd d4rl pip install -e . For mujoco, you only need to install gym and mujoco-py, either via pip or in combination with DI-engine 1. The mujoco-py library no longer requires an active license (``mujoco-py>=2.1.0``) and can be installed via \ `pip install free-mujoco-py `__ 2. If you install ``mujoco-py>=2.1``, you can do the following: .. code:: shell # Installation for Linux # Download the MuJoCo version 2.1 binaries for Linux. wget https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz # Extract the downloaded mujoco210 directory into ~/.mujoco/mujoco210. tar xvf mujoco210-linux-x86_64.tar.gz && mkdir -p ~/.mujoco && mv mujoco210 ~/.mujoco/mujoco210 # Install and use mujoco-py pip install gym pip install -U 'mujoco-py<2.2,>=2.1' # Installation for macOS # Download the MuJoCo version 2.1 binaries for OSX. wget https://mujoco.org/download/mujoco210-macos-x86_64.tar.gz # Extract the downloaded mujoco210 directory into ~/.mujoco/mujoco210. tar xvf mujoco210-macos-x86_64.tar.gz && mkdir -p ~/.mujoco && mv mujoco210 ~/.mujoco/mujoco210 # Install and use mujoco-py pip install gym pip install -U 'mujoco-py<2.2,>=2.1' 3. If you install ``mujoco-py<2.1``, you can do the following: .. code:: shell # Installation for Linux # Download the MuJoCo version 2.0 binaries for Linux. wget https://www.roboti.us/download/mujoco200_linux.zip # Extract the downloaded mujoco200 directory into ~/.mujoco/mujoco200. unzip mujoco200_linux.zip && mkdir -p ~/.mujoco && mv mujoco200_linux ~/.mujoco/mujoco200 # Download unlocked activation key. wget https://www.roboti.us/file/mjkey.txt -O ~/.mujoco/mjkey.txt # Install and use mujoco-py pip install gym pip install -U 'mujoco-py<2.1' # Installation for macOS # Download the MuJoCo version 2.0 binaries for OSX. wget https://www.roboti.us/download/mujoco200_macos.zip # Extract the downloaded mujoco200 directory into ~/.mujoco/mujoco200. tar xvf mujoco200-macos-x86_64.tar.gz && mkdir -p ~/.mujoco && mv mujoco200_macos ~/.mujoco/mujoco200 # Download unlocked activation key. wget https://www.roboti.us/file/mjkey.txt -O ~/.mujoco/mjkey.txt # Install and use mujoco-py pip install gym pip install -U 'mujoco-py<2.1' Verifying the Installation -------------------------------- Once the installation is complete, you can verify that the installation was successful by running the following command from the Python command line. .. code:: python import gym import d4rl # Import required to register environments # Create the environment env = gym.make('maze2d-umaze-v1') # d4rl abides by the OpenAI gym interface env.reset() env.step(env.action_space.sample()) # Each task is associated with a dataset # dataset contains observations, actions, rewards, terminals, and infos dataset = env.get_dataset() print(dataset['observations']) # An N x dim_observation Numpy array of observations # Alternatively, use d4rl.qlearning_dataset which # also adds next_observations. dataset = d4rl.qlearning_dataset(env) Images ------------- The DI-engine is ready with images of the framework itself, which can be obtained by \ ``docker pull opendilab/ding:nightly-mujoco`` \, or by visiting \ `docker hub `__\ for more images. Gym-Mujoco's Space before Transformation (Original Environment) ===================================================================== Observation space --------------------- - The vector consists of physical information(3D position, orientation, and joint angles etc.), and the specific size is \ ``(N, )``\, which is determined by the environment. The data type is \ ``float64`` - `Fujimoto `__ mentions that doing obs norm for the d4rl dataset will improve the stability of offline training. Action Space ---------------- - The vector consists of physical information(torque etc.), which is often a continuous action space with N dimension (N varies with specific subenvironments), the data type is \ ``float32``\ , np datasets need to be imported.(For example, the action is \ ``array([-0.9266078 , -0.4958926 , 0.46242517], dtype=float32)``\ ) - For example, if it's in the Hooper, N is 3, then the action value be chosen in \ ``[-1, 1]``\. Reward space ---------------- - Depending on the specific game content, the game score can vary very much and is usually a float value, which can be found in the performance section of the benchmark algorithm at the bottom. Others -------- - The end of the game is the end of the current environment episode Key Facts ============ 1. Vector physical information input, empirically it is not advisable to subtract the mean value in normalization 2. Continuous action space 3. Dense rewards 4. Large variation in the scale of reward taking The Space after Transformation(RL Environment) ============================================================== Observation space ----------------------- - Basically no change Action space ---------------- - Basically no transformation, still a continuous action space of size N, with a range of values\ ``[-1, 1]``\,size is \ ``(N, )``\ ,data type is \ ``np.float32`` Reward space ------------- - Basically no change The above space can be represented using the gym environment space definition as follows .. code:: python import gym obs_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(11, ), dtype=np.float64) act_space = gym.spaces.Box(low=-1, high=1, shape=(3, ), dtype=np.float32) rew_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(1, ), dtype=np.float32) Others --------- - The \ ``info``\ returned by the environment \ ``step``\ method must contain the \ ``final_eval_reward``\ key-value pair, which represents the evaluation metrics for the entire episode, which in Mujoco is the cumulative sum of the rewards for the entire episode. Others ======== Inert initialization ---------------------- To facilitate support for parallel operations such as environment vectorization, environment instances are generally implemented with inert initialization, i.e., the \ ``_init_``\ method does not initialize the real original environment instance, but only sets the relevant parameters and configuration values, and initializes the concrete original environment instance when the \ ``reset``\ method is called for the first time. Video storage ---------------- After the environment is created and before it is reset, the \ ``enable_save_replay``\ method is called to specify the path where the game footage is saved. The environment will automatically save the session's video files after each episode. (The default call \ ``gym.wrappers.RecordVideo``\ is implemented), the code shown below will run an environment episode and save the results of that episode in \ ``./video/``\. .. code:: python from easydict import EasyDict from dizoo.mujoco.envs import MujocoEnv env = MujocoEnv(EasyDict({'env_id': 'Hoopper-v3' })) env.enable_save_replay(replay_path='./video') obs = env.reset() while True: action = env.random_action() timestep = env.step(action) if timestep.done: print('Episode is over, final eval reward is: {}'.format(timestep.info['final_eval_reward'])) break DI-ZOO RUNNABLE CODE EXAMPLE ================================= The complete training profile is available in `github link `__ , for specific profiles, like \ ``https://github.com/opendilab/DI-engine/blob/main/dizoo/d4rl/config/hopper_medium_cql_default_config.py``\ ,it works with the following demo: .. code:: python from easydict import EasyDict from easydict import EasyDict hopper_medium_cql_default_config = dict( env=dict( env_id='hopper-medium-v0', norm_obs=dict(use_norm=False, ), norm_reward=dict(use_norm=False, ), collector_env_num=1, evaluator_env_num=8, use_act_scale=True, n_evaluator_episode=8, stop_value=6000, ), policy=dict( cuda=True, model=dict( obs_shape=11, action_shape=3, twin_critic=True, actor_head_type='reparameterization', actor_head_hidden_size=256, critic_head_hidden_size=256, ), learn=dict( data_path=None, train_epoch=30000, batch_size=256, learning_rate_q=3e-4, learning_rate_policy=1e-4, learning_rate_alpha=1e-4, ignore_done=False, target_theta=0.005, discount_factor=0.99, alpha=0.2, reparameterization=True, auto_alpha=False, lagrange_thresh=-1.0, min_q_weight=5.0, ), collect=dict( n_sample=1, unroll_len=1, data_type='d4rl', ), command=dict(), eval=dict(evaluator=dict(eval_freq=500, )), other=dict(replay_buffer=dict(replay_buffer_size=2000000, ), ), ), ) hopper_medium_cql_default_config = EasyDict(hopper_medium_cql_default_config) main_config = hopper_medium_cql_default_config hopper_medium_cql_default_create_config = dict( env=dict( type='d4rl', import_names=['dizoo.d4rl.envs.d4rl_env'], ), env_manager=dict(type='base'), policy=dict( type='cql', import_names=['ding.policy.cql'], ), replay_buffer=dict(type='naive', ), ) hopper_medium_cql_default_create_config = EasyDict(hopper_medium_cql_default_create_config) create_config = hopper_medium_cql_default_create_config Note: For offline RL algorithms, such as TD3_bc, CQL, you need to use a special entry function, the example can be referred to `link `__ BENCHMARK ALGORITHM PERFORMANCE ================================= - Walker2d - walker2d-medium-expert-v0 + CQL .. image:: ./images/walker2d_medium_expert_cql.png :align: center :scale: 40% - Typical iteration of 1M iteration takes 9 hours (NVIDIA V100)