Model Testing Guide¶

Guide for testing and evaluating trained models, particularly Vision-Language Models (VLMs).

Overview¶

LightRFT provides tools for interactive model testing with support for:

Interactive Chat: Real-time conversation testing
Multimodal Support: Text and image inputs
Batch Testing: Automated testing with JSON files
Performance Optimization: Flash Attention 2 and bfloat16
Command-line Interface: Convenient testing commands

Quick Start¶

Basic Text Conversation¶

python test_chat.py --model_path <checkpoint-path>

Image-based Testing¶

# Start interactive mode
python test_chat.py --model_path <checkpoint-path>

In interactive mode:

[You] /image <image-path>
✓ Image loaded: <image-path>
[You] What do you see in this image?
[Assistant] ...

Batch Testing¶

python test_chat.py \
  --model_path <checkpoint-path> \
  --batch <test-file.json> \
  --output <results.json>

Custom Generation Parameters¶

python test_chat.py \
  --model_path <checkpoint-path> \
  --max_tokens 4096 \
  --temperature 0.5 \
  --top_p 0.9

Interactive Commands¶

Available commands in interactive mode:

Command	Description
`/image <path>`	Load image for next query
`/clear`	Clear conversation history
`/reset`	Reset loaded images
`/help`	Show help information
`/quit` or `/exit`	Exit program

Batch Test File Format¶

Text-only Tests¶

[
  {
    "query": "What is 2 + 2?",
    "expected": "4"
  },
  {
    "query": "Explain the Pythagorean theorem."
  }
]

Image-based Tests¶

[
  {
    "query": "Describe what you see in this image.",
    "images": ["<image-path-1>"],
    "expected": "Description of the image"
  },
  {
    "query": "Compare these two images.",
    "images": ["<image-path-1>", "<image-path-2>"]
  }
]

Configuration Parameters¶

Parameter	Default	Description
`--model_path`	-	Model checkpoint path (required)
`--device`	`cuda`	Inference device (cuda/cpu)
`--max_tokens`	`8192`	Maximum generation tokens
`--temperature`	`0.7`	Sampling temperature (0 for greedy)
`--top_p`	`0.95`	Top-p sampling parameter
`--system_prompt`	`None`	Custom system prompt
`--batch`	`None`	Batch test JSON file path
`--output`	`None`	Batch test results output file

Usage Examples¶

Example 1: Math Problem Solving¶

python test_chat.py --model_path <checkpoint-path>

Example interaction:

[You] If a triangle has sides 3, 4, and 5, what is its area?

[Assistant] <think>
This is a right triangle since 3² + 4² = 9 + 16 = 25 = 5².
For a right triangle, the area is (1/2) × base × height.
Using the two perpendicular sides: Area = (1/2) × 3 × 4 = 6
</think>

The area of the triangle is 6 square units.

Example 2: Geometry Recognition¶

python test_chat.py --model_path <checkpoint-path>

Example interaction:

[You] /image <geometry-image-path>
✓ Image loaded: <geometry-image-path>

[You] Solve the geometry problem shown in this image.

[Assistant] <think>
Looking at the diagram, I can see a triangle ABC with...
[detailed reasoning process]
</think>

The answer is [solution].

Example 3: Batch Performance Testing¶

Create test file test_questions.json:

[
  {
    "query": "Find the area of triangle with base 6 and height 8.",
    "expected": "24"
  },
  {
    "query": "What is the perimeter of a square with side length 5?",
    "expected": "20"
  }
]

Run batch test:

python test_chat.py \
  --model_path <checkpoint-path> \
  --batch test_questions.json \
  --output test_results.json \
  --temperature 0.0

Performance Optimizations¶

The testing script includes:

Flash Attention 2: Accelerated attention computation
BFloat16: Reduced memory usage and faster inference
Batch Processing: Improved throughput for batch tests
Memory Management: Automatic GPU memory cleanup

Troubleshooting¶

Out of Memory (OOM)¶

If encountering memory issues:

1. Reduce max_tokens:

python test_chat.py --model_path <checkpoint-path> --max_tokens 4096

2. Use CPU inference (slower):

python test_chat.py --model_path <checkpoint-path> --device cpu

Image Loading Failure¶

Ensure image path is correct and format is supported (JPG, PNG, etc.):

ls -lh <image-path>

Generation Quality Issues¶

Adjust sampling parameters:

More deterministic: --temperature 0.0 (greedy decoding)
More diverse: --temperature 1.0 --top_p 0.9
Balanced: --temperature 0.7 --top_p 0.95 (default)

Dependencies¶

Required packages:

pip install torch transformers pillow flash-attn

Best Practices¶

Model Loading: First run requires model loading time
Image Reset: Images auto-reset after each conversation
History Management: Use /clear to reset conversation history
Batch Independence: Each batch test runs independently