Red¶

red_irl_model¶

class ding.reward_model.red_irl_model.RedRewardModel(config: Dict, device: str, tb_logger: SummaryWriter)[source]¶

Overview:

The implement of reward model in RED (https://arxiv.org/abs/1905.06750)

Interface:

estimate, train, load_expert_data, collect_data, clear_date, __init__, _train

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	red	Reward model register name, refer to registry `REWARD_MODEL_REGISTRY`
2	`expert_data_` `path`	str	expert_data .pkl	Path to the expert dataset	Should be a ‘.pkl’ file
3	`sample_size`	int	1000	sample data from expert dataset with fixed size
4	`sigma`	int	5	hyperparameter of r(s,a)	r(s,a) = exp( -sigma* L(s,a))
5	`batch_size`	int	64	Training batch size
6	`hidden_size`	int	128	Linear model hidden size
7	`update_per_` `collect`	int	100	Number of updates per collect
8	`clear_buffer` `_per_iters`	int	1	clear buffer per fixed iters	make sure replay buffer’s data count isn’t too few. (code work in entry)

Properties:

online_net (:obj: SENet): The reward model, in default initialized once as the training begins.

__init__(config: Dict, device: str, tb_logger: SummaryWriter) → None[source]¶

Overview:

Initialize self. See help(type(self)) for accurate signature.

Arguments:

clear_data()[source]¶

Overview:: Collecting clearing data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in clear_data method

collect_data(data) → None[source]¶

Overview:: Collecting training data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in collect_data method

estimate(data: list) → List[Dict][source]¶

Overview:

Estimate reward by rewriting the reward key

Arguments:

data (list): the list of data used for estimation, with at least obs and action keys.

Effects:

load_expert_data() → None[source]¶

Overview:: Getting the expert data from config['expert_data_path'] attribute in self.
Effects:: This is a side effect function which updates the expert data attribute (e.g. self.expert_data)

train() → None[source]¶

Overview:

Training the RED reward model. In default, RED model should be trained once.

Effects:

This is a side effect function which updates the reward model and increment the train iteration count.