Red¶
red_irl_model¶
RedRewardModel¶
- class ding.reward_model.red_irl_model.RedRewardModel(config: Dict, device: str, tb_logger: SummaryWriter)[source]¶
- Overview:
The implement of reward model in RED (https://arxiv.org/abs/1905.06750)
- Interface:
estimate
,train
,load_expert_data
,collect_data
,clear_date
,__init__
,_train
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
red
Reward model register name, referto registryREWARD_MODEL_REGISTRY
2
expert_data_
path
str
expert_data .pkl
Path to the expert datasetShould be a ‘.pkl’file3
sample_size
int
1000
sample data from expert datasetwith fixed size4
sigma
int
5
hyperparameter of r(s,a)r(s,a) = exp(-sigma* L(s,a))5
batch_size
int
64
Training batch size6
hidden_size
int
128
Linear model hidden size7
update_per_
collect
int
100
Number of updates per collect8
clear_buffer
_per_iters
int
1
clear buffer per fixed itersmake sure replaybuffer’s data countisn’t too few.(code work in entry)- Properties:
online_net (:obj: SENet): The reward model, in default initialized once as the training begins.
- __init__(config: Dict, device: str, tb_logger: SummaryWriter) None [source]¶
- Overview:
Initialize
self.
Seehelp(type(self))
for accurate signature.- Arguments:
cfg (
Dict
): Training configdevice (
str
): Device usage, i.e. “cpu” or “cuda”tb_logger (
str
): Logger, defaultly set as ‘SummaryWriter’ for model summary
- clear_data()[source]¶
- Overview:
Collecting clearing data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in clear_data method
- collect_data(data) None [source]¶
- Overview:
Collecting training data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in collect_data method
- estimate(data: list) List[Dict] [source]¶
- Overview:
Estimate reward by rewriting the reward key
- Arguments:
data (
list
): the list of data used for estimation, with at leastobs
andaction
keys.
- Effects:
This is a side effect function which updates the reward values in place.