Pdeil¶
pdeil_irl_model¶
PdeilRewardModel¶
- class ding.reward_model.pdeil_irl_model.PdeilRewardModel(cfg: dict, device, tb_logger: SummaryWriter)[source]¶
- Overview:
The Pdeil reward model class (https://arxiv.org/abs/2112.06746)
- Interface:
estimate
,train
,load_expert_data
,collect_data
,clear_date
,__init__
,_train
,_batch_mn_pdf
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
pdeil
Reward model register name, referto registryREWARD_MODEL_REGISTRY
2
expert_data_
path
str
expert_data. .pkl
Path to the expert datasetShould be a ‘.pkl’file3
discrete_
action
bool
False
Whether the action is discrete4
alpha
float
0.5
coefficient for ProbabilityDensity Estimator5
clear_buffer
_per_iters
int
1
clear buffer per fixed itersmake sure replaybuffer’s data countisn’t too few.(code work in entry)
- __init__(cfg: dict, device, tb_logger: SummaryWriter) None [source]¶
- Overview:
Initialize
self.
Seehelp(type(self))
for accurate signature. Some rules in naming the attributes ofself.
:e_
: expert values_sigma_
: standard division valuesp_
: current policy values_s_
: states_a_
: actions
- Arguments:
cfg (
Dict
): Training configdevice (
str
): Device usage, i.e. “cpu” or “cuda”tb_logger (
str
): Logger, defaultly set as ‘SummaryWriter’ for model summary
- clear_data()[source]¶
- Overview:
Clearing training data. This is a side effect function which clears the data attribute in
self
- collect_data(item: list)[source]¶
- Overview:
Collecting training data by iterating data items in the input list
- Arguments:
data (
list
): Raw training data (e.g. some form of states, actions, obs, etc)
- Effects:
This is a side effect function which updates the data attribute in
self
by iterating data items in the input data items’ list
- estimate(data: list) List[Dict] [source]¶
- Overview:
Estimate reward by rewriting the reward keys.
- Arguments:
data (
list
): the list of data used for estimation, with at leastobs
andaction
keys.
- Effects:
This is a side effect function which updates the reward values in place.