Red¶
red_irl_model¶
RedRewardModel¶
- class ding.reward_model.red_irl_model.RedRewardModel(config: Dict, device: str, tb_logger: SummaryWriter)[source]¶
- Overview:
The implement of reward model in RED (https://arxiv.org/abs/1905.06750)
- Interface:
estimate,train,load_expert_data,collect_data,clear_date,__init__,_train- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
typestr
red
Reward model register name, referto registryREWARD_MODEL_REGISTRY2
expert_data_pathstr
expert_data .pkl
Path to the expert datasetShould be a ‘.pkl’file3
sample_sizeint
1000
sample data from expert datasetwith fixed size4
sigmaint
5
hyperparameter of r(s,a)r(s,a) = exp(-sigma* L(s,a))5
batch_sizeint
64
Training batch size6
hidden_sizeint
128
Linear model hidden size7
update_per_collectint
100
Number of updates per collect8
clear_buffer_per_itersint
1
clear buffer per fixed itersmake sure replaybuffer’s data countisn’t too few.(code work in entry)- Properties:
online_net (:obj: SENet): The reward model, in default initialized once as the training begins.
- __init__(config: Dict, device: str, tb_logger: SummaryWriter) None[source]¶
- Overview:
Initialize
self.Seehelp(type(self))for accurate signature.- Arguments:
cfg (
Dict): Training configdevice (
str): Device usage, i.e. “cpu” or “cuda”tb_logger (
str): Logger, defaultly set as ‘SummaryWriter’ for model summary
- clear_data()[source]¶
- Overview:
Collecting clearing data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in clear_data method
- collect_data(data) None[source]¶
- Overview:
Collecting training data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in collect_data method
- estimate(data: list) List[Dict][source]¶
- Overview:
Estimate reward by rewriting the reward key
- Arguments:
data (
list): the list of data used for estimation, with at leastobsandactionkeys.
- Effects:
This is a side effect function which updates the reward values in place.