Worker

MuZeroCollector

class lzero.worker.muzero_collector.MuZeroCollector(collect_print_freq: int = 100, env: BaseEnvManager | None = None, policy: namedtuple | None = None, tb_logger: SummaryWriter = None, exp_name: str = 'default_experiment', instance_name: str = 'collector', policy_config: policy_config = None, task_id: int | None = None)[source]

Bases: ISerialCollector

Overview:

The episode-based collector for MCTS-based reinforcement learning algorithms, including MuZero, EfficientZero, Sampled EfficientZero, and Gumbel MuZero. It orchestrates the data collection process in a serial manner, managing interactions between the policy and the environment to generate game segments for training.

Interfaces:

__init__, reset, reset_env, reset_policy, _reset_stat, collect, _compute_priorities, pad_and_save_last_trajectory, _output_log, close, __del__.

Properties:

envstep.

__init__(collect_print_freq: int = 100, env: BaseEnvManager | None = None, policy: namedtuple | None = None, tb_logger: SummaryWriter = None, exp_name: str = 'default_experiment', instance_name: str = 'collector', policy_config: policy_config = None, task_id: int | None = None) None[source]
Overview:

Initializes the MuZeroCollector with the given configuration.

Parameters:
  • collect_print_freq (-) – The frequency (in training iterations) at which to print collection statistics.

  • env (-) – An instance of a vectorized environment manager.

  • policy (-) – A namedtuple containing the policy’s forward pass and other methods.

  • tb_logger (-) – A TensorBoard logger instance for logging metrics.

  • exp_name (-) – The name of the experiment, used for organizing logs.

  • instance_name (-) – A unique name for this collector instance.

  • policy_config (-) – The configuration object for the policy.

  • task_id (-) – The identifier for the current task in a multi-task setting. If None, operates in single-task mode.

_abc_impl = <_abc._abc_data object>
_compute_priorities(i: int, pred_values_lst: List[float], search_values_lst: List[float]) ndarray | None[source]
Overview:

Computes priorities for experience replay based on the discrepancy between predicted values and MCTS search values.

Parameters:
  • i (-) – The index of the environment’s data in the lists.

  • pred_values_lst (-) – A list containing lists of predicted values for each environment.

  • search_values_lst (-) – A list containing lists of search values from MCTS for each environment.

Returns:

An array of priorities for the transitions. Returns None if priority is not used.

Return type:

  • priorities (Optional[np.ndarray])

_output_log(train_iter: int) None[source]
Overview:

Aggregates and logs collection statistics to the console, TensorBoard, and WandB. This method is only executed by the rank 0 process in a distributed setup.

Parameters:

train_iter (-) – The current training iteration number, used as the logging step.

_reset_stat(env_id: int) None[source]
Overview:

Resets the statistics for a specific environment, identified by env_id. This is typically called when an episode in that environment ends.

Parameters:

env_id (-) – The ID of the environment to reset statistics for.

close() None[source]
Overview:

Closes the collector, including the environment and any loggers. Ensures that all resources are properly released.

collect(n_episode: int | None = None, train_iter: int = 0, policy_kwargs: Dict | None = None, collect_with_pure_policy: bool = False) List[Any][source]
Overview:

Collects n_episode episodes of data. It manages the entire lifecycle of an episode, from getting actions from the policy, stepping the environment, storing transitions, and saving completed game segments.

Parameters:
  • n_episode (-) – The number of episodes to collect. If None, uses the default from the policy config.

  • train_iter (-) – The current training iteration, used for logging.

  • policy_kwargs (-) – Additional keyword arguments to pass to the policy’s forward method, like temperature for exploration.

  • collect_with_pure_policy (-) – If True, collects data using a pure policy (e.g., greedy action) without MCTS.

Returns:

A list containing the collected game segments and metadata.

Return type:

  • return_data (List[Any])

config = {}
classmethod default_config() EasyDict
Overview:

Get collector’s default config. We merge collector’s default config with other default configs and user’s config to get the final config.

Returns:

(EasyDict): collector’s default config

Return type:

cfg

property envstep: int
Overview:

Returns the total number of environment steps collected since the last reset.

Returns:

The total environment step count.

Return type:

  • envstep (int)

pad_and_save_last_trajectory(i: int, last_game_segments: List[GameSegment | None], last_game_priorities: List[ndarray | None], game_segments: List[GameSegment], done: ndarray) None[source]
Overview:

Pads the end of the last_game_segment with data from the start of the current game_segment. This is necessary to compute target values for the final transitions of a segment. After padding, the completed segment is stored in the game_segment_pool.

Parameters:
  • i (-) – The index of the environment being processed.

  • last_game_segments (-) – List of game segments from the previous collection chunk.

  • last_game_priorities (-) – List of priorities corresponding to the last game segments.

  • game_segments (-) – List of game segments from the current collection chunk.

  • done (-) – Array indicating if the episode has terminated for each environment.

Note

An implicit assumption is that the start of the new segment’s observation history overlaps with the end of the last segment’s, e.g., (last_game_segments[i].obs_segment[-4:][j] == game_segments[i].obs_segment[:4][j]).all() is True.

reset(_policy: namedtuple | None = None, _env: BaseEnvManager | None = None) None[source]
Overview:

Resets the collector, including the environment and policy. Also re-initializes internal state variables for tracking collection progress.

Parameters:
  • _policy (-) – The new policy to use.

  • _env (-) – The new environment to use.

reset_env(_env: BaseEnvManager | None = None) None[source]
Overview:

Resets or replaces the environment managed by the collector. If _env is None, it resets the existing environment. Otherwise, it replaces the old environment with the new one and launches it.

Parameters:

_env (-) – The new environment to be used. If None, resets the current environment.

reset_policy(_policy: namedtuple | None = None) None[source]
Overview:

Resets or replaces the policy used by the collector. If _policy is None, it resets the existing policy. Otherwise, it replaces the old policy with the new one.

Parameters:

_policy (-) – The new policy to be used.

MuZeroEvaluator

class lzero.worker.muzero_evaluator.MuZeroEvaluator(eval_freq: int = 1000, n_evaluator_episode: int = 3, stop_value: float = 1000000.0, env: BaseEnvManager | None = None, policy: namedtuple | None = None, tb_logger: SummaryWriter | None = None, exp_name: str = 'default_experiment', instance_name: str = 'evaluator', policy_config: EasyDict | None = None, task_id: int | None = None)[source]

Bases: ISerialEvaluator

Overview:

The Evaluator for MCTS-based reinforcement learning algorithms, such as MuZero, EfficientZero, and Sampled EfficientZero.

Interfaces:

__init__, reset, reset_policy, reset_env, close, should_eval, eval

Properties:

env, policy

__init__(eval_freq: int = 1000, n_evaluator_episode: int = 3, stop_value: float = 1000000.0, env: BaseEnvManager | None = None, policy: namedtuple | None = None, tb_logger: SummaryWriter | None = None, exp_name: str = 'default_experiment', instance_name: str = 'evaluator', policy_config: EasyDict | None = None, task_id: int | None = None) None[source]
Overview:

Initializes the MuZeroEvaluator. This evaluator is compatible with MuZero, Sampled MuZero, Gumbel MuZero, EfficientZero, UniZero, and Sampled UniZero (i.e., all algorithms except AlphaZero).

Parameters:
  • eval_freq (-) – The frequency, in training iterations, at which to run evaluation.

  • n_evaluator_episode (-) – The total number of episodes to run during each evaluation.

  • stop_value (-) – The reward threshold at which training is considered converged and will stop.

  • env (-) – An optional environment manager for evaluation.

  • policy (-) – An optional policy for evaluation.

  • tb_logger (-) – An optional TensorBoard logger.

  • exp_name (-) – The name of the experiment, used for logging.

  • instance_name (-) – The name of this evaluator instance.

  • policy_config (-) – Configuration for the policy.

  • task_id (-) – The unique identifier for the task. If None, the evaluator operates in single-task mode. In a multi-task setting, each task corresponds to a specific evaluator instance.

_abc_impl = <_abc._abc_data object>
close() None[source]
Overview:

Close the evaluator, including the environment and the TensorBoard logger.

config = {'eval_freq': 5000}
classmethod default_config() EasyDict[source]
Overview:

Get the default configuration of the MuZeroEvaluator.

Returns:

An EasyDict object representing the default configuration.

Return type:

  • cfg (EasyDict)

eval(save_ckpt_fn: Callable | None = None, train_iter: int = -1, envstep: int = -1, n_episode: int | None = None, return_trajectory: bool = False) Tuple[bool, Dict[str, Any]][source]
Overview:

Run a full evaluation process. It will evaluate the current policy, log the results, and save a checkpoint if a new best performance is achieved.

Parameters:
  • save_ckpt_fn (-) – A function to save a checkpoint. Called when a new best reward is achieved.

  • train_iter (-) – The current training iteration.

  • envstep (-) – The current total environment steps.

  • n_episode (-) – The number of episodes to evaluate. Defaults to the value set in __init__.

  • return_trajectory (-) – Whether to return the collected game_segments in the result dictionary.

Returns:

A flag indicating whether the training should stop (e.g., if the stop value is reached). - episode_info (Dict[str, Any]): A dictionary containing evaluation results, such as rewards and episode lengths.

Return type:

  • stop_flag (bool)

reset(_policy: namedtuple | None = None, _env: BaseEnvManager | None = None) None[source]
Overview:

Reset both the policy and the environment.

Parameters:
  • _policy (-) – New policy to use.

  • _env (-) – New environment manager to use.

reset_env(_env: BaseEnvManager | None = None) None[source]
Overview:

Reset the environment. If a new environment is provided, it replaces the old one.

Parameters:

_env (-) – New environment manager to use. If None, resets the existing environment.

reset_policy(_policy: namedtuple | None = None) None[source]
Overview:

Reset the policy. If a new policy is provided, it replaces the old one.

Parameters:

_policy (-) – New policy to use. If None, resets the existing policy.

should_eval(train_iter: int) bool[source]
Overview:

Determine whether it’s time to run an evaluation based on the training iteration.

Parameters:

train_iter (-) – The current training iteration.

Returns:

True if evaluation should be run, otherwise False.

Return type:

  • (bool)