lightrft.models.monkey_patch.qwen¶

This module provides an adapted implementation of Qwen2 attention mechanism with Ulysses sequence parallelism support. It extends the standard transformer attention to work with sequence parallel processing, enabling more efficient handling of long sequences by distributing them across multiple devices.

Note

This implementation has been tested only on transformers versions between 4.48.0 and 4.49.0.

lightrft.models.monkey_patch.qwen.qwen2_attn_forward(self, hidden_states: torch.Tensor, position_embeddings: Tuple[torch.Tensor, torch.Tensor], attention_mask: torch.Tensor | None, past_key_value: transformers.cache_utils.Cache | None = None, cache_position: torch.LongTensor | None = None, **kwargs) → Tuple[torch.Tensor, torch.Tensor | None][source]¶

Forward pass for Qwen2 attention with sequence parallelism support.

This function implements the attention mechanism for Qwen2 model with added support for Ulysses sequence parallelism. It handles the projection of input hidden states into query, key and value tensors, applies rotary position embeddings, and processes the attention computation with optional sliding window attention.

Parameters:

hidden_states (torch.Tensor) – Input tensor containing token embeddings
position_embeddings (Tuple[torch.Tensor, torch.Tensor]) – Tuple of (cos, sin) tensors for rotary position embeddings
attention_mask (Optional[torch.Tensor]) – Optional mask to prevent attention to certain positions
past_key_value (Optional[Cache]) – Optional cached key and value tensors for incremental decoding
cache_position (Optional[torch.LongTensor]) – Optional tensor indicating positions in the cache
kwargs – Additional keyword arguments passed to the attention implementation

Returns:

Tuple containing: - Output tensor after attention and projection - Optional attention weights if output_attentions is True

Return type:

Tuple[torch.Tensor, Optional[torch.Tensor]]

Note:: This implementation is specifically adapted for transformers versions 4.48.0-4.49.0 and includes special handling for Ulysses sequence parallelism.