Use this skill when a generated image already exists and the user needs iterative, artifact-grounded refinement through direct manipulation, such as masking regions for inpainting or adjusting keyword attention based on visual feedback. It fits interfaces that combine image inspection, explainable attention views, and localized regeneration controls.
Paradigm: P12 — Interactive Artifact Refinement
Domain: text-to-image generation
Source: Wang et al. (2024). PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement.
The user has a generated image and wants to iteratively refine specific regions or semantic attributes using direct controls tied to the artifact, such as masking, inpainting, or token-level attention adjustment.
A→G, T→G, G→I, H→I→A
| Symbol | Role |
|---|---|
| H | image creator inspecting and refining generated outputs |
| T | optional follow-up text guidance for inpainting or semantic emphasis |
| I | masking image regions, selecting tokens, adjusting attention weights, switching versions |
| Aug | implicit combination of artifact state and user controls during refinement |
| G | diffusion model plus attention visualization/refinement mechanisms |
| A | generated image and its editable regions/versions |
| Widget | Binds To | Example |
|---|---|---|
| image mask brush | selected region for localized regeneration | Paint over an undesired background area and mark it for inpainting |
| inpainting prompt box | text guidance for masked-region regeneration | Enter 'replace with misty forest' for the masked area |
| token attention slider | per-keyword diffusion attention weight | Increase attention for the token 'moon' to strengthen its visual prominence |
| attention heatmap overlay | token-to-image correspondence inspection | Hover over 'wolf' to highlight the image regions the model associates with that token |
| version history strip | artifact state comparison and rollback | Switch between prior image versions after each attention or inpainting edit |
Return type: composite
After generating an image, the user notices the moon is weak and part of the background is undesirable. They inspect token attention, raise the attention weight for 'full moon,' mask the flawed region, provide a short inpainting prompt, and compare the new result against earlier versions.