Taxi
~~~~~~~~~~~~~~

Overview 
=============

Taxi is a discrete classic reinforcement learning environment that simulates a passenger's travel process in the city. In this environment, the car needs to depart from one location, pick up a passenger at a location and deliver the passenger to the destination. The demonstration of Taxi environment is shown as follows:

.. image:: ./images/taxi.gif
   :align: center
   :scale: 80%

Installation
=============

Method
--------

Taxi environment can be directly called through library gym, which has id \  ``Taxi-v3`` \。

.. code:: shell

    pip install gym

Verify Installation
-------------------

Run the following commands in python to check if installation is successful.

.. code:: python

    import gym
    from gym.spaces import Discrete
    env = gym.make("Taxi-v3", render_mode="rgb_array")
    obs = env.reset()
    print(obs)
    assert env.observation_space == Discrete(500)
    assert env.action_space == Discrete(6)

Introduction 
=============

Action space
-------------

Taxi environment has a discrete action space with the shape of (1, ) in range {0, 5}. This represents different operations for the taxi.

- \ ``0`` \: move downward

- \ ``1`` \: move upward

- \ ``2`` \: move right

- \ ``3`` \: move left

- \ ``4`` \: pick up passenger

- \ ``5`` \: drop off passenger

The definition of action space in gym is defined as: 

.. code:: python 

    action_space = gym.spaces.Discrete(6)

observation space
-------------------

The state space is also a discrete space, similar to the action space. The shape of the state is (1,) and its range is {0, 499}, totaling 500 states.

Taxi environment owns a map with the size of 5×5, wit 4 places marked in colors. The passenger begins at one of the 4 marked places or in the taxi, and the destination of the passenger is also one of the four marked locations with color labels. There are a total of 500 states. These 500 states are a summary of the states of both the car and the passenger. They are represented under the space environment definition of gym as follows:

.. code:: python

    observation_space = gym.spaces.Discrete(500)

Each state is determined by the following 4 components of information. You can view the information of each component corresponding to each state number by the following method:

.. code:: python

    obs = env.reset()
    taxi_row, taxi_col, pass_loc, dest_idx = env.unwrapped.decode(obs)

The range of each element is as follows:

- \ ``taxi_row`` \: Row of taxi, among 0, 1, 2, 3, 4, representing the index of row

- \ ``taxi_col`` \: Column of taxi, among 0, 1, 2, 3, 4, representing the index of column

- \ ``pass_loc`` \: The passenger's location, among 0, 1, 2, 3, 4. Respetively 0, 1, 2, 3 represent red, green, yellow, blue, while 4 indicates that the passenger is in the taxi.

- \ ``dest_idx`` \: The passenger's destination, among 0, 1, 2, 3, representing red, green, yellow, blue respetively.

The encoded value of state can be calculated as (\ `` taxi_row * 100 + taxi_col * 20 + pass_loc * 4 + dest_idx * 1`` \)

Reward space
--------------

- \ ``-1`` \: when the taxi moves once (including picking up the passenger at the designated location).

- \ ``-10`` \: when picking up or dropping off the passenger in an abnormal situation (including: picking up a passenger before the designated location, picking up a passenger when there is already one on board, dropping off a passenger when there is no one on board, and dropping off a passenger while the car is not at the destination).

- \ ``+20`` \: when successfully dropping off the passenger at the correct destination

Termination Condition
----------------------
Each episode terminates when one of the following conditions is satisfied:

- Successfully delivering the passenger. In other words, if step has no restriction, the process can only terminate when the passenger has been successfully delivered.
- reached max step for each episode. This can be set through variable ``max_episode_steps`` in the environment.

Implementation example inside DI-zoo
=========================================

The following code is the implementation for Taxi-v3 environment, based on DQN algorithm:

.. code:: python

    from easydict import EasyDict

    taxi_dqn_config = dict(
        exp_name='taxi_dqn_seed0',
        env=dict(
            collector_env_num=8,
            evaluator_env_num=8,
            n_evaluator_episode=8,   
            stop_value=20,           
            max_episode_steps=60,    
            env_id="Taxi-v3" 
        ),
        policy=dict(
            cuda=True,
            model=dict(
                obs_shape=34,
                action_shape=6,
                encoder_hidden_size_list=[128, 128]
            ),
            random_collect_size=5000,
            nstep=3,
            discount_factor=0.99,
            learn=dict(
                update_per_collect=10,
                batch_size=64,
                learning_rate=0.0001,
                learner=dict(
                    hook=dict(
                        log_show_after_iter=1000,
                    )
                ),
            ),
            collect=dict(n_sample=32),
            eval=dict(evaluator=dict(eval_freq=1000, )), 
            other=dict(
                eps=dict(
                type="linear",
                start=1,
                end=0.05,
                decay=3000000                             
                ),                                     
                replay_buffer=dict(replay_buffer_size=100000,),  
            ),
        )
    )
    taxi_dqn_config = EasyDict(taxi_dqn_config)
    main_config = taxi_dqn_config

    taxi_dqn_create_config = dict(
        env=dict(
            type="taxi",
            import_names=["dizoo.taxi.envs.taxi_env"]
        ),
        env_manager=dict(type='base'),
        policy=dict(type='dqn'),
        replay_buffer=dict(type='deque', import_names=['ding.data.buffer.deque_buffer_wrapper']),
    )

    taxi_dqn_create_config = EasyDict(taxi_dqn_create_config)
    create_config = taxi_dqn_create_config

    if __name__ == "__main__":
        from ding.entry import serial_pipeline
        serial_pipeline((main_config, create_config), max_env_step=3000000, seed=0)

Benchmark Algorithm Performance
=================================

Set the total number of iteration steps to be 300000, and randomly select three different seeds. The iteration results based on the DQN algorithm are shown in the figure below: You can see that the average evaluation reward begins to converge after about 700k - 800k steps, and the average evaluation reward is basically stable after 1M steps, where every evaluation can successfully pick up and deliver passengers.

.. image:: ./images/taxidqn.png
   :align: center
   :scale: 80%