Shortcuts

Hello World for DI

Decision intelligence is the most important direction in the field of artificial intelligence. Its general form is to use an agent to process information from an environment, give reasonable feedback and responses, and make the state of the environment changing as designer’s expectations. For example, a autodrive car will receive information about road conditions from the environment, and give real-time autonomous driving decisions, and let the vehicle drive to the set destination.

We first use the “lunarlander” environment to introduce the agent in the DI-engine and how it interacts with the environment. In this simulated environment, the agent needs to land the lunarlander safely and smoothly to the designated area and avoid crashing.

../_images/lunarlander.gif

Let the Agent Run

An agent is an object that can interact with the environment freely, and is essentially a mathematical model that accepts input and feeds back output. Its model consists of a model structure and a set of model parameters. In general, we will write the model into a file for saving, or read the model from that file for deploying. Here we provide an agent model trained by the DI-engine framework using the DQN algorithm: final.pth.tar Just use the following code to make the agent run, remember to replace the model address in the function (“ckpt_path=’./final.pth.tar’”), with the locally saved model file path, such as “’~/Download/final.pth.tar’”:

import gym # Load the gym library, which is used to standardize the reinforcement learning environment
import torch # Load the PyTorch library for loading the Tensor model and defining the computing network
from easydict import EasyDict # Load EasyDict for instantiating configuration files
from ding.config import compile_config # Load configuration related components in DI-engine config module
from ding.envs import DingEnvWrapper # Load environment related components in DI-engine env module
from ding.policy import DQNPolicy, single_env_forward_wrapper # Load policy-related components in DI-engine policy module
from ding.model import DQN # Load model related components in DI-engine model module
from dizoo.box2d.lunarlander.config.lunarlander_dqn_config import main_config, create_config # Load DI-zoo lunarlander environment and DQN algorithm related configurations


def main(main_config: EasyDict, create_config: EasyDict, ckpt_path: str):
    main_config.exp_name = 'lunarlander_dqn_deploy' # Set the name of the experiment to be run in this deployment, which is the name of the project folder to be created
    cfg = compile_config(main_config, create_cfg=create_config, auto=True) # Compile and generate all configurations
    env = DingEnvWrapper(gym.make(cfg.env.env_id), EasyDict(env_wrapper='default')) # Add the DI-engine environment decorator upon the gym's environment instance
    env.enable_save_replay(replay_path='./lunarlander_dqn_deploy/video') # Enable the video recording of the environment and set the video saving folder
    model = DQN(**cfg.policy.model) # Import model configuration, instantiate DQN model
    state_dict = torch.load(ckpt_path, map_location='cpu') # Load model parameters from file
    model.load_state_dict(state_dict['model']) # Load model parameters into the model
    policy = DQNPolicy(cfg.policy, model=model).eval_mode # Import policy configuration, import model, instantiate DQN policy, and turn to evaluation mode
    forward_fn = single_env_forward_wrapper(policy.forward) # Use the strategy decorator of the simple environment to decorate the decision method of the DQN strategy
    obs = env.reset() # Reset the initialization environment to get the initial observations
    returns = 0. # Initialize total reward
    while True: # Let the agent's strategy and environment interact cyclically until the end
        action = forward_fn(obs) # According to the observed state, make a decision and generate action
        obs, rew, done, info = env.step(action) # Execute actions, interact with the environment, get the next observation state, the reward of this interaction, the signal of whether to end, and other information
        returns += rew # Cumulative reward return
        if done:
            break
    print(f'Deploy is finished, final epsiode return is: {returns}')

if __name__ == "__main__":
    main(main_config=main_config, create_config=create_config, ckpt_path='./final.pth.tar')

As shown in the codes, the PyTorch object parameters of the model can be obtained by using torch.load. And then the model parameters can be loaded into the DQN model of DI-engine using load_state_dict to make the model rebuilt. Then load the DQN model into the DQN policy, and use the forward_fn function of the evaluation mode to make the agent generate feedback action for the environmental state, obs. The action of the agent will interact with the environment once to generate the environment state, obs, at the next moment, the reward, rew, of this interaction, the signal, done, of whether the environment is over, and other information, info.

Note

The environment state is generally a set of vectors or tensors. The reward is generally a real value. The signal whether the environment has ended is a boolean variable, yes or no. Other information is an additional message that the creator of the environment wants to pass, in any format.

The reward value at all times will be accumulated as the total score of the agent in this task.

Note

You can see the total score of the deployed agent in the log, and you can see the replay video in the experiment folder.

../_images/evaluator_info.png

To Better Evaluate Agents

In various contexts of reinforcement learning, the initial states of the agents are not always exactly the same. The performance of the agent may fluctuate with different initial states. For example, in the environment of “lunarlander”, the lunar surface is different every time.

Therefore, we need to set up multiple environments and run several more evaluation tests to better score it. DI-engine designed the environment manager env_manager to do this, we can do this with the following slightly more complex code:

import os
import gym
import torch
from tensorboardX import SummaryWriter
from easydict import EasyDict

from ding.config import compile_config
from ding.worker import BaseLearner, SampleSerialCollector, InteractionSerialEvaluator, AdvancedReplayBuffer
from ding.envs import BaseEnvManager, DingEnvWrapper
from ding.policy import DQNPolicy
from ding.model import DQN
from ding.utils import set_pkg_seed
from ding.rl_utils import get_epsilon_greedy_fn
from dizoo.box2d.lunarlander.config.lunarlander_dqn_config import main_config, create_config

# Get DI-engine form env class
def wrapped_cartpole_env():
    return DingEnvWrapper(
        gym.make(main_config['env']['env_id']),
        EasyDict(env_wrapper='default'),
    )


def main(cfg, seed=0):
    cfg['exp_name'] = 'lunarlander_dqn_eval'
    cfg = compile_config(
        cfg,
        BaseEnvManager,
        DQNPolicy,
        BaseLearner,
        SampleSerialCollector,
        InteractionSerialEvaluator,
        AdvancedReplayBuffer,
        save_cfg=True
    )
    cfg.policy.load_path = './final.pth.tar'

    # build multiple environments and use env_manager to manage them
    evaluator_env_num = cfg.env.evaluator_env_num
    evaluator_env = BaseEnvManager(env_fn=[wrapped_cartpole_env for _ in range(evaluator_env_num)], cfg=cfg.env.manager)

    # switch save replay interface
    # evaluator_env.enable_save_replay(cfg.env.replay_path)
    evaluator_env.enable_save_replay(replay_path='./lunarlander_dqn_eval/video')

    # Set random seed for all package and instance
    evaluator_env.seed(seed, dynamic_seed=False)
    set_pkg_seed(seed, use_cuda=cfg.policy.cuda)

    # Set up RL Policy
    model = DQN(**cfg.policy.model)
    policy = DQNPolicy(cfg.policy, model=model)
    policy.eval_mode.load_state_dict(torch.load(cfg.policy.load_path, map_location='cpu'))

    # Evaluate
    tb_logger = SummaryWriter(os.path.join('./{}/log/'.format(cfg.exp_name), 'serial'))
    evaluator = InteractionSerialEvaluator(
        cfg.policy.eval.evaluator, evaluator_env, policy.eval_mode, tb_logger, exp_name=cfg.exp_name
    )
    evaluator.eval()

if __name__ == "__main__":
    main(main_config)

Note

When evaluating multiple environments in parallel, the environment manager of DI-engine will also count the average, maximum and minimum rewards, as well as other indicators related to some algorithms.

Training Stronger Agents from Scratch

Run the following code using DI-engine to get the agent model in the above test. Try generating an agent model yourself, maybe it will be stronger:

import gym
from ditk import logging
from ding.model import DQN
from ding.policy import DQNPolicy
from ding.envs import DingEnvWrapper, BaseEnvManagerV2, SubprocessEnvManagerV2
from ding.data import DequeBuffer
from ding.config import compile_config
from ding.framework import task, ding_init
from ding.framework.context import OnlineRLContext
from ding.framework.middleware import OffPolicyLearner, StepCollector, interaction_evaluator, data_pusher, \
    eps_greedy_handler, CkptSaver, online_logger, nstep_reward_enhancer
from ding.utils import set_pkg_seed
from dizoo.box2d.lunarlander.config.lunarlander_dqn_config import main_config, create_config

def main():
    logging.getLogger().setLevel(logging.INFO)
    cfg = compile_config(main_config, create_cfg=create_config, auto=True)
    ding_init(cfg)
    with task.start(async_mode=False, ctx=OnlineRLContext()):
        collector_env = SubprocessEnvManagerV2(
            env_fn=[lambda: DingEnvWrapper(gym.make(cfg.env.env_id)) for _ in range(cfg.env.collector_env_num)],
            cfg=cfg.env.manager
        )
        evaluator_env = SubprocessEnvManagerV2(
            env_fn=[lambda: DingEnvWrapper(gym.make(cfg.env.env_id)) for _ in range(cfg.env.evaluator_env_num)],
            cfg=cfg.env.manager
        )

        set_pkg_seed(cfg.seed, use_cuda=cfg.policy.cuda)

        model = DQN(**cfg.policy.model)
        buffer_ = DequeBuffer(size=cfg.policy.other.replay_buffer.replay_buffer_size)
        policy = DQNPolicy(cfg.policy, model=model)

        task.use(interaction_evaluator(cfg, policy.eval_mode, evaluator_env))
        task.use(eps_greedy_handler(cfg))
        task.use(StepCollector(cfg, policy.collect_mode, collector_env))
        task.use(nstep_reward_enhancer(cfg))
        task.use(data_pusher(cfg, buffer_))
        task.use(OffPolicyLearner(cfg, policy.learn_mode, buffer_))
        task.use(online_logger(train_show_freq=10))
        task.use(CkptSaver(policy, cfg.exp_name, train_freq=100))
        task.run()

if __name__ == "__main__":
    main()

Note

The above code takes about 10 minutes to train to the default termination point with an Intel i5-10210U 1.6GHz CPU and no GPU device. If you want the training time to be shorter, try the simpler Cartpole environment.

Note

DI-engine integrates the tensorboard component to record key information during the training process. You can turn it on during training, so you can see real-time updated information, such as the average total reward value recorded by the evaluator, etc.

Well done! So far, you have completed the Hello World task of DI-engine, used the provided code and model, and learned how the reinforcement learning agent interacts with the environment. Please continue to read this document, First Reinforcement Learning Program, to understand how the RL pipeline is built in DI-engine.