Worker
MuZeroCollector
- class lzero.worker.muzero_collector.MuZeroCollector(collect_print_freq: int = 100, env: BaseEnvManager = None, policy: namedtuple = None, tb_logger: SummaryWriter = None, exp_name: str | None = 'default_experiment', instance_name: str | None = 'collector', policy_config: policy_config = None)[source]
Bases:
ISerialCollector
- Overview:
The Episode Collector for MCTS+RL algorithms, including MuZero, EfficientZero, Sampled EfficientZero, Gumbel MuZero. It manages the data collection process for training these algorithms using a serial mechanism.
- Interfaces:
__init__
,reset
,reset_env
,reset_policy
,_reset_stat
,envstep
,__del__
,_compute_priorities
,pad_and_save_last_trajectory
,collect
,_output_log
,close
- Properties:
envstep
- __init__(collect_print_freq: int = 100, env: BaseEnvManager = None, policy: namedtuple = None, tb_logger: SummaryWriter = None, exp_name: str | None = 'default_experiment', instance_name: str | None = 'collector', policy_config: policy_config = None) None [source]
- Overview:
Initialize the MuZeroCollector with the given parameters.
- Parameters:
collect_print_freq (-) – Frequency (in training steps) at which to print collection information.
env (-) – Instance of the subclass of vectorized environment manager.
policy (-) – namedtuple of the collection mode policy API.
tb_logger (-) – TensorBoard logger instance.
exp_name (-) – Name of the experiment, used for logging and saving purposes.
instance_name (-) – Unique identifier for this collector instance.
policy_config (-) – Configuration object for the policy.
- _abc_impl = <_abc._abc_data object>
- _compute_priorities(i: int, pred_values_lst: List[float], search_values_lst: List[float]) ndarray [source]
- Overview:
Compute the priorities for transitions based on prediction and search value discrepancies.
- Parameters:
i (-) – Index of the values in the list to compute the priority for.
pred_values_lst (-) – List of predicted values.
search_values_lst (-) – List of search values obtained from MCTS.
- Returns:
Array of computed priorities.
- Return type:
priorities (
np.ndarray
)
- _output_log(train_iter: int) None [source]
- Overview:
Log the collector’s data and output the log information.
- Parameters:
train_iter (-) – Current training iteration number for logging context.
- _reset_stat(env_id: int) None [source]
- Overview:
Reset the collector’s state. Including reset the traj_buffer, obs_pool, policy_output_pool and env_info. Reset these states according to env_id. You can refer to base_serial_collector to get more messages.
- Parameters:
env_id (-) – the id where we need to reset the collector’s state
- close() None [source]
- Overview:
Close the collector. If end_flag is False, close the environment, flush the tb_logger and close the tb_logger.
- collect(n_episode: int | None = None, train_iter: int = 0, policy_kwargs: dict | None = None, collect_with_pure_policy: bool = False) List[Any] [source]
- Overview:
Collect n_episode episodes of data with policy_kwargs, trained for train_iter iterations.
- Parameters:
n_episode (-) – Number of episodes to collect.
train_iter (-) – Number of training iterations completed so far.
policy_kwargs (-) – Additional keyword arguments for the policy.
collect_with_pure_policy (-) – Whether to collect data using pure policy without MCTS.
- Returns:
Collected data in the form of a list.
- Return type:
return_data (
List[Any]
)
- config = {}
- classmethod default_config() EasyDict
- Overview:
Get collector’s default config. We merge collector’s default config with other default configs and user’s config to get the final config.
- Returns:
(
EasyDict
): collector’s default config- Return type:
cfg
- property envstep: int
- Overview:
Get the total number of environment steps collected.
- Returns:
Total number of environment steps collected.
- Return type:
envstep (
int
)
- pad_and_save_last_trajectory(i: int, last_game_segments: List[GameSegment], last_game_priorities: List[ndarray], game_segments: List[GameSegment], done: ndarray) None [source]
- Overview:
Save the game segment to the pool if the current game is finished, padding it if necessary.
- Parameters:
i (-) – Index of the current game segment.
last_game_segments (-) – List of the last game segments to be padded and saved.
last_game_priorities (-) – List of priorities of the last game segments.
game_segments (-) – List of the current game segments.
done (-) – Array indicating whether each game is done.
Note
(last_game_segments[i].obs_segment[-4:][j] == game_segments[i].obs_segment[:4][j]).all() is True
- reset(_policy: namedtuple | None = None, _env: BaseEnvManager | None = None) None [source]
- Overview:
Reset the collector with the given policy and/or environment. If _env is None, reset the old environment. If _env is not None, replace the old environment in the collector with the new passed in environment and launch. If _policy is None, reset the old policy. If _policy is not None, replace the old policy in the collector with the new passed in policy.
- Parameters:
policy (-) – the api namedtuple of collect_mode policy
env (-) – instance of the subclass of vectorized env_manager(BaseEnvManager)
- reset_env(_env: BaseEnvManager | None = None) None [source]
- Overview:
Reset or replace the environment managed by this collector. If _env is None, reset the old environment. If _env is not None, replace the old environment in the collector with the new passed in environment and launch.
- Parameters:
env (-) – New environment to manage, if provided.
- reset_policy(_policy: namedtuple | None = None) None [source]
- Overview:
Reset or replace the policy used by this collector. If _policy is None, reset the old policy. If _policy is not None, replace the old policy in the collector with the new passed in policy.
- Parameters:
policy (-) – the api namedtuple of collect_mode policy
MuZeroEvaluator
- class lzero.worker.muzero_evaluator.MuZeroEvaluator(eval_freq: int = 1000, n_evaluator_episode: int = 3, stop_value: int = 1000000.0, env: BaseEnvManager = None, policy: namedtuple = None, tb_logger: SummaryWriter = None, exp_name: str | None = 'default_experiment', instance_name: str | None = 'evaluator', policy_config: policy_config = None)[source]
Bases:
ISerialEvaluator
- Overview:
The Evaluator class for MCTS+RL algorithms, such as MuZero, EfficientZero, and Sampled EfficientZero.
- Interfaces:
__init__, reset, reset_policy, reset_env, close, should_eval, eval
- Properties:
env, policy
- __init__(eval_freq: int = 1000, n_evaluator_episode: int = 3, stop_value: int = 1000000.0, env: BaseEnvManager = None, policy: namedtuple = None, tb_logger: SummaryWriter = None, exp_name: str | None = 'default_experiment', instance_name: str | None = 'evaluator', policy_config: policy_config = None) None [source]
- Overview:
Initialize the evaluator with configuration settings for various components such as logger helper and timer.
- Parameters:
eval_freq (-) – Evaluation frequency in terms of training steps.
n_evaluator_episode (-) – Number of episodes to evaluate in total.
stop_value (-) – A reward threshold above which the training is considered converged.
env (-) – An optional instance of a subclass of BaseEnvManager.
policy (-) – An optional API namedtuple defining the policy for evaluation.
tb_logger (-) – Optional TensorBoard logger instance.
exp_name (-) – Name of the experiment, used to determine output directory.
instance_name (-) – Name of this evaluator instance.
policy_config (-) – Optional configuration for the game policy.
- _abc_impl = <_abc._abc_data object>
- close() None [source]
- Overview:
Close the evaluator, the environment, flush and close the TensorBoard logger if applicable.
- config = {'eval_freq': 50}
- classmethod default_config() EasyDict [source]
- Overview:
Retrieve the default configuration for the evaluator by merging evaluator-specific defaults with other defaults and any user-provided configuration.
- Returns:
The default configuration for the evaluator.
- Return type:
cfg (
EasyDict
)
- eval(save_ckpt_fn: Callable = None, train_iter: int = -1, envstep: int = -1, n_episode: int | None = None, return_trajectory: bool = False) Tuple[bool, float] [source]
- Overview:
Evaluate the current policy, storing the best policy if it achieves the highest historical reward.
- Parameters:
save_ckpt_fn (-) – Optional function to save a checkpoint when a new best reward is achieved.
train_iter (-) – The current training iteration count.
envstep (-) – The current environment step count.
n_episode (-) – Optional number of evaluation episodes; defaults to the evaluator’s setting.
return_trajectory (-) – Return the evaluated trajectory game_segments in episode_info if True.
- Returns:
Indicates whether the training can be stopped based on the stop value. - episode_info (
Dict[str, Any]
): A dictionary containing information about the evaluation episodes.- Return type:
stop_flag (
bool
)
- reset(_policy: namedtuple | None = None, _env: BaseEnvManager | None = None) None [source]
- Overview:
Reset both the policy and environment for the evaluator, optionally replacing them. If _env is None, reset the old environment. If _env is not None, replace the old environment in the evaluator with the new passed in environment and launch. If _policy is None, reset the old policy. If _policy is not None, replace the old policy in the evaluator with the new passed in policy.
- Parameters:
_policy (-) – An optional new policy namedtuple to replace the existing one.
_env (-) – An optional new environment instance to replace the existing one.
- reset_env(_env: BaseEnvManager | None = None) None [source]
- Overview:
Reset the environment for the evaluator, optionally replacing it with a new environment. If _env is None, reset the old environment. If _env is not None, replace the old environment in the evaluator with the new passed in environment and launch.
- Parameters:
_env (-) – An optional new environment instance to replace the existing one.
- reset_policy(_policy: namedtuple | None = None) None [source]
- Overview:
Reset the policy for the evaluator, optionally replacing it with a new policy. If _policy is None, reset the old policy. If _policy is not None, replace the old policy in the evaluator with the new passed in policy.
- Parameters:
_policy (-) – An optional new policy namedtuple to replace the existing one.
- should_eval(train_iter: int) bool [source]
- Overview:
Determine whether to initiate evaluation based on the training iteration count and evaluation frequency.
- Parameters:
train_iter (-) – The current count of training iterations.
- Returns:
True if evaluation should be initiated, otherwise False.
- Return type:
(
bool
)