Worker

MuZeroCollector

class lzero.worker.muzero_collector.MuZeroCollector(collect_print_freq: int = 100, env: BaseEnvManager = None, policy: namedtuple = None, tb_logger: SummaryWriter = None, exp_name: str | None = 'default_experiment', instance_name: str | None = 'collector', policy_config: policy_config = None)[source]

Bases: ISerialCollector

Overview:

The Collector for MCTS+RL algorithms, including MuZero, EfficientZero, Sampled EfficientZero, Gumbel MuZero. It manages the data collection process for training these algorithms using a serial mechanism.

Interfaces:

__init__, reset, reset_env, reset_policy, _reset_stat, envstep, __del__, _compute_priorities, pad_and_save_last_trajectory, collect, _output_log, close

Properties:

envstep

__init__(collect_print_freq: int = 100, env: BaseEnvManager = None, policy: namedtuple = None, tb_logger: SummaryWriter = None, exp_name: str | None = 'default_experiment', instance_name: str | None = 'collector', policy_config: policy_config = None) None[source]
Overview:

Initialize the MuZeroCollector with the given parameters.

Parameters:
  • collect_print_freq (-) – Frequency (in training steps) at which to print collection information.

  • env (-) – Instance of the subclass of vectorized environment manager.

  • policy (-) – namedtuple of the collection mode policy API.

  • tb_logger (-) – TensorBoard logger instance.

  • exp_name (-) – Name of the experiment, used for logging and saving purposes.

  • instance_name (-) – Unique identifier for this collector instance.

  • policy_config (-) – Configuration object for the policy.

_abc_impl = <_abc._abc_data object>
_compute_priorities(i: int, pred_values_lst: List[float], search_values_lst: List[float]) ndarray[source]
Overview:

Compute the priorities for transitions based on prediction and search value discrepancies.

Parameters:
  • i (-) – Index of the values in the list to compute the priority for.

  • pred_values_lst (-) – List of predicted values.

  • search_values_lst (-) – List of search values obtained from MCTS.

Returns:

Array of computed priorities.

Return type:

  • priorities (np.ndarray)

_output_log(train_iter: int) None[source]
Overview:

Log the collector’s data and output the log information.

Parameters:

train_iter (-) – Current training iteration number for logging context.

_reset_stat(env_id: int) None[source]
Overview:

Reset the collector’s state. Including reset the traj_buffer, obs_pool, policy_output_pool and env_info. Reset these states according to env_id. You can refer to base_serial_collector to get more messages.

Parameters:

env_id (-) – the id where we need to reset the collector’s state

close() None[source]
Overview:

Close the collector. If end_flag is False, close the environment, flush the tb_logger and close the tb_logger.

collect(n_episode: int | None = None, train_iter: int = 0, policy_kwargs: dict | None = None, collect_with_pure_policy: bool = False) List[Any][source]
Overview:

Collect n_episode episodes of data with policy_kwargs, trained for train_iter iterations.

Parameters:
  • n_episode (-) – Number of episodes to collect.

  • train_iter (-) – Number of training iterations completed so far.

  • policy_kwargs (-) – Additional keyword arguments for the policy.

  • collect_with_pure_policy (-) – Whether to collect data using pure policy without MCTS.

Returns:

Collected data in the form of a list.

Return type:

  • return_data (List[Any])

config = {}
classmethod default_config() EasyDict
Overview:

Get collector’s default config. We merge collector’s default config with other default configs and user’s config to get the final config.

Returns:

(EasyDict): collector’s default config

Return type:

cfg

property envstep: int
Overview:

Get the total number of environment steps collected.

Returns:

Total number of environment steps collected.

Return type:

  • envstep (int)

pad_and_save_last_trajectory(i: int, last_game_segments: List[GameSegment], last_game_priorities: List[ndarray], game_segments: List[GameSegment], done: ndarray) None[source]
Overview:

Save the game segment to the pool if the current game is finished, padding it if necessary.

Parameters:
  • i (-) – Index of the current game segment.

  • last_game_segments (-) – List of the last game segments to be padded and saved.

  • last_game_priorities (-) – List of priorities of the last game segments.

  • game_segments (-) – List of the current game segments.

  • done (-) – Array indicating whether each game is done.

Note

(last_game_segments[i].obs_segment[-4:][j] == game_segments[i].obs_segment[:4][j]).all() is True

reset(_policy: namedtuple | None = None, _env: BaseEnvManager | None = None) None[source]
Overview:

Reset the collector with the given policy and/or environment. If _env is None, reset the old environment. If _env is not None, replace the old environment in the collector with the new passed in environment and launch. If _policy is None, reset the old policy. If _policy is not None, replace the old policy in the collector with the new passed in policy.

Parameters:
  • policy (-) – the api namedtuple of collect_mode policy

  • env (-) – instance of the subclass of vectorized env_manager(BaseEnvManager)

reset_env(_env: BaseEnvManager | None = None) None[source]
Overview:

Reset or replace the environment managed by this collector. If _env is None, reset the old environment. If _env is not None, replace the old environment in the collector with the new passed in environment and launch.

Parameters:

env (-) – New environment to manage, if provided.

reset_policy(_policy: namedtuple | None = None) None[source]
Overview:

Reset or replace the policy used by this collector. If _policy is None, reset the old policy. If _policy is not None, replace the old policy in the collector with the new passed in policy.

Parameters:

policy (-) – the api namedtuple of collect_mode policy

MuZeroEvaluator

class lzero.worker.muzero_evaluator.MuZeroEvaluator(eval_freq: int = 1000, n_evaluator_episode: int = 3, stop_value: int = 1000000.0, env: BaseEnvManager = None, policy: namedtuple = None, tb_logger: SummaryWriter = None, exp_name: str | None = 'default_experiment', instance_name: str | None = 'evaluator', policy_config: policy_config = None)[source]

Bases: ISerialEvaluator

Overview:

The Evaluator class for MCTS+RL algorithms, such as MuZero, EfficientZero, and Sampled EfficientZero.

Interfaces:

__init__, reset, reset_policy, reset_env, close, should_eval, eval

Properties:

env, policy

__init__(eval_freq: int = 1000, n_evaluator_episode: int = 3, stop_value: int = 1000000.0, env: BaseEnvManager = None, policy: namedtuple = None, tb_logger: SummaryWriter = None, exp_name: str | None = 'default_experiment', instance_name: str | None = 'evaluator', policy_config: policy_config = None) None[source]
Overview:

Initialize the evaluator with configuration settings for various components such as logger helper and timer.

Parameters:
  • eval_freq (-) – Evaluation frequency in terms of training steps.

  • n_evaluator_episode (-) – Number of episodes to evaluate in total.

  • stop_value (-) – A reward threshold above which the training is considered converged.

  • env (-) – An optional instance of a subclass of BaseEnvManager.

  • policy (-) – An optional API namedtuple defining the policy for evaluation.

  • tb_logger (-) – Optional TensorBoard logger instance.

  • exp_name (-) – Name of the experiment, used to determine output directory.

  • instance_name (-) – Name of this evaluator instance.

  • policy_config (-) – Optional configuration for the game policy.

_abc_impl = <_abc._abc_data object>
close() None[source]
Overview:

Close the evaluator, the environment, flush and close the TensorBoard logger if applicable.

config = {'eval_freq': 50}
classmethod default_config() EasyDict[source]
Overview:

Retrieve the default configuration for the evaluator by merging evaluator-specific defaults with other defaults and any user-provided configuration.

Returns:

The default configuration for the evaluator.

Return type:

  • cfg (EasyDict)

eval(save_ckpt_fn: Callable = None, train_iter: int = -1, envstep: int = -1, n_episode: int | None = None, return_trajectory: bool = False) Tuple[bool, float][source]
Overview:

Evaluate the current policy, storing the best policy if it achieves the highest historical reward.

Parameters:
  • save_ckpt_fn (-) – Optional function to save a checkpoint when a new best reward is achieved.

  • train_iter (-) – The current training iteration count.

  • envstep (-) – The current environment step count.

  • n_episode (-) – Optional number of evaluation episodes; defaults to the evaluator’s setting.

  • return_trajectory (-) – Return the evaluated trajectory game_segments in episode_info if True.

Returns:

Indicates whether the training can be stopped based on the stop value. - episode_info (Dict[str, Any]): A dictionary containing information about the evaluation episodes.

Return type:

  • stop_flag (bool)

reset(_policy: namedtuple | None = None, _env: BaseEnvManager | None = None) None[source]
Overview:

Reset both the policy and environment for the evaluator, optionally replacing them. If _env is None, reset the old environment. If _env is not None, replace the old environment in the evaluator with the new passed in environment and launch. If _policy is None, reset the old policy. If _policy is not None, replace the old policy in the evaluator with the new passed in policy.

Parameters:
  • _policy (-) – An optional new policy namedtuple to replace the existing one.

  • _env (-) – An optional new environment instance to replace the existing one.

reset_env(_env: BaseEnvManager | None = None) None[source]
Overview:

Reset the environment for the evaluator, optionally replacing it with a new environment. If _env is None, reset the old environment. If _env is not None, replace the old environment in the evaluator with the new passed in environment and launch.

Parameters:

_env (-) – An optional new environment instance to replace the existing one.

reset_policy(_policy: namedtuple | None = None) None[source]
Overview:

Reset the policy for the evaluator, optionally replacing it with a new policy. If _policy is None, reset the old policy. If _policy is not None, replace the old policy in the evaluator with the new passed in policy.

Parameters:

_policy (-) – An optional new policy namedtuple to replace the existing one.

should_eval(train_iter: int) bool[source]
Overview:

Determine whether to initiate evaluation based on the training iteration count and evaluation frequency.

Parameters:

train_iter (-) – The current count of training iterations.

Returns:

True if evaluation should be initiated, otherwise False.

Return type:

  • (bool)