Basic Usage#

Simple Example#

The following code performs a deterministic action on the click-test-2 environment (Instruction: Click button ONE.).

import time
import gymnasium
from miniwob.action import ActionTypes

env = gymnasium.make('miniwob/click-test-2-v1', render_mode='human')

# Wrap the code in try-finally to ensure proper cleanup.
try:
  # Start a new episode.
  obs, info = env.reset()
  time.sleep(2)       # Only here to let you look at the environment.
  
  # Find the HTML element with text "ONE".
  for element in obs["dom_elements"]:
    if element["text"] == "ONE":
      break

  # Click on the element.
  action = env.action_space.sample()     # Template for the action.
  action["action_type"] = env.action_space_config.action_types.index(
      ActionTypes.CLICK_ELEMENT
  )
  action["ref"] = element["ref"]
  obs, reward, terminated, _, _ = env.step(action)

  # Check if the action was correct. 
  assert reward >= 0      # Should be around 0.8 since 2 seconds has passed.
  assert terminated is True
  time.sleep(2)

finally:
  env.close()

Environment Initialization#

env = gymnasium.make('miniwob/click-test-2-v1', render_mode='human')

Common arguments include:

  • render_mode: Render mode. Supported values are:

    • None (default): Headless Chrome, which does not show the browser window.

    • 'human': Show the browser window.

  • action_space_config: Configuration for the action space. See the Action Space section for details. Supported values are:

    • An ActionSpaceConfig object.

    • A preset name, which will instantiate an ActionSpaceConfig object.

Observation Space#

Observation Object#

In all MiniWoB++ environments, an observation is a dict with the following fields:

Key

Type

Description

utterance

str

Task instruction string.

field_keys

tuple[str]

Environment-specific task field keys. (TODO: Implement this in code)

field_values

tuple[str]

Task field values extracted from the task instruction. (TODO: Implement this in code)

screenshot

np.ndarray with shape (height, width, 3) and type uint8

Screenshot as RGB values for each pixel. Note that some elements such as opened dropdown may not be captured in the screenshot.

dom_elements

list[dict]

List of feature dicts, each describing a DOM elements (see below).

DOM Element Features#

Each feature dict in dom_elements has the following fields:

Key

Type

Description

ref

int

Non-zero integer ID.

  • The ref for normal HTML elements start from 1.

  • Each HTML element retains the same ref during the same episode.

  • Non-empty text nodes are converted into pseudo-elements with ref counting down from -1.[1]

parent

int

ref of the parent. For the root DOM element, parent will be 0.

left

np.ndarray with shape (1,) and type float32

Left coordinate relative to the screen (can be negative).

top

np.ndarray with shape (1,) and type float32

Top coordinate relative to the screen (can be negative).

width

np.ndarray with shape (1,) and type float32

Element width.

height

np.ndarray with shape (1,) and type float32

Element height.

tag

str

HTML tag.

  • For normal elements, this is the uppercased tag name (e.g., "DIV").

  • For <input> elements, the input type is appended (e.g., "INPUT_text").

  • Non-empty text nodes become pseudo-elements with tag "t".[1]

text

str

Text content, which is non-empty only for leaf elements.[1]

value

str

Value of <input> element.

id

str

HTML id attribute.

classes

str

HTML class attribute (multiple classes are separated by spaces).

bg_color

np.ndarray with shape (4,) and type float32

Background color as RGBA value.

fg_color

np.ndarray with shape (4,) and type float32

Foreground color as RGBA value.

flags

np.ndarray with shape (4,) and type int8

Binary flags:

  • (focused) Whether the element is being focused on.

  • (tampered) Whether the element has been tampered (clicked, focused, typed, etc.).

  • (targeted) Whether the element is an event target (for recorded demonstrations).

  • (is_leaf) Whether the element is a leaf.

Action Space#

Supported Actions#

MiniWoB++ environments support the following actions.

Name

Description

NONE

Do nothing for the current step.

CLICK_COORDS

Click on the specified coordinates.

DBLCLICK_COORDS

Double-click on the specified coordinates.

MOUSEDOWN_COORDS

Start dragging on the specified coordinates.

MOUSEUP_COORDS

Stop dragging on the specified coordinates.

CLICK_ELEMENT

Click on the specified element.

DBLCLICK_ELEMENT

Double-click on the specified element.

MOUSEDOWN_ELEMENT

Start dragging on the specified element.

MOUSEUP_ELEMENT

Stop dragging on the specified element.

SCROLL_UP

Scroll up on the mouse wheel.

SCROLL_DOWN

Scroll down on the mouse wheel.

PRESS_KEY

Press the specified key or key combination.

TYPE_TEXT

Type the specified string.

TYPE_FIELD

Type the value of the specified task field.

FOCUS_ELEMENT_AND_TYPE_TEXT

Click on the specified element, and then type the specified string.

FOCUS_ELEMENT_AND_TYPE_FIELD

Click on the specified element, and then type the value of the specified task field.

Action Configs#

The list of selected actions, along with other configurations, can be customized by passing a miniwob.action.ActionSpaceConfig object to the action_space_config argument during environment construction. The ActionSpaceConfig object has the following fields:

Key

Type

Description

action_types

Sequence[ActionTypes] or Sequence[str]

An ordered sequence of action types to include.

screen_width

float

Screen width. Will be overridden by the environment constructor.

screen_height

float

Screen height. Will be overridden by the environment constructor.

coord_bins

tuple[int, int]

If specified, bin the x and y coordinates to these numbers of bins. Mouse actions will be executed at the middle of the specified partition.

allowed_keys

Sequence[str]

An ordered sequence of allowed keys and key combinations for the PRESS_KEY action.

text_max_len

int

Maximum text length for the TYPE_TEXT action.

text_charset

str or set[str]

Character set for the TYPE_TEXT action.

Action Object#

An action is a dict whose field inclusion depends on the selected actions:

Key

Type

Description

action_type

int

Action type index from the action_types list in the config.

coords

np.ndarray of shape (2,)

Coordinates. Included when any *COORDS action is selected. Depending on the coord_bins config, coords can be of type int8 (binned) or float32 (unbinned).

ref

int

Element ref ID. Included when any *_ELEMENT* action is selected. If no element has the specified ref, the action becomes a no-op.

key

int

Key index from the allowed_keys list in the config. Included when the PRESS_KEY action is selected.

text

str

Text to type. Included when any *_TYPE_TEXT action is selected.

field

int

Task field index. Included when any *_TYPE_FIELD action is selected.

Presets#

The following preset names can be specified in place of the ActionSpaceConfig object: (TODO: Implement this in code)