Generate or edit research figures with text-to-image and text+image-to-image workflows. Use when the user wants科研绘图、论文插图生成、基于现有图片改图、统一 figure 风格、润色 scientific schematic。
Determine whether the user supplied an input image path.
If there is a valid local image path, use --mode image2image and pass that path with --input.
Otherwise use --mode text2image.
Run a preprocessing stage before generation.
In preprocessing, analyze the request context and infer:
the figure category
the paper-style intent
the likely venue-style preset
the desired layout structure
the color semantics
the connection structure
the arrow density and arrow hierarchy
whether the request is a hero figure, overview figure, mechanism figure, architecture figure, or workflow figure
Automatically select the most suitable venue-style preset.
Optimize the layout and connection grammar before writing prompts.
From the preprocessing result, write two explicit prompts:
a Positive prompt that is very detailed, publication-oriented, and suitable for scientific figure generation
a Negative prompt that suppresses low-quality, non-academic, decorative, illegible, or off-style outputs
Rewrite the generation input so the backend script receives a strong, paper-style final prompt derived from the Positive prompt.
Choose an output directory automatically:
default to outputs/research-figure-edit/
if that is unsuitable for the current workspace, use a nearby outputs/ directory under the working directory
Run the backend script with Bash and let it complete the full workflow in one step: request API response, extract image bytes, and save the final image file.
Return the saved output paths to the user together with:
detected figure category
selected venue-style preset
Positive prompt
Negative prompt
any major rewrite decisions (briefly)
Mode selection rules
If the user provides only a prompt, run --mode text2image.
If the user provides a prompt and a local input image path, run --mode image2image.
Only treat a path as an image input if it points to an existing local file and has a supported image extension.
If the request is ambiguous, ask the user to specify the prompt text and image path separately.
Figure category detection
Before generating, classify the request into the closest figure category.
Categories
flowchart — workflows, pipelines, training loops, inference loops, algorithm steps
architecture — model structures, module diagrams, system frameworks, encoder-decoder layouts
Use lightweight intent matching from the user's wording.
If the request mentions 流程图, pipeline, workflow, step, loop, training pipeline, prefer flowchart
If it mentions 架构图, architecture, framework, module, encoder, decoder, prefer architecture
If it mentions 机制, principle, how it works, reward mechanism, attention mechanism, prefer mechanism
If it mentions 对比, compare, baseline, before/after, ablation, prefer comparison
If it mentions overview, motivation, teaser, main idea, concept, prefer concept
If it mentions data flow, input/output, preprocess, postprocess, multimodal stream, prefer dataflow
If it mentions experiment setup, evaluation, benchmark, protocol, prefer experiment
If multiple categories match, prefer the most structurally specific one in this order:
flowchart > architecture > mechanism > comparison > concept > dataflow > experiment
Venue-style preset system
After detecting the category, automatically select a venue-style preset.
Presets
paper-clean
Use for generic publication-quality figures when no venue-specific visual cue is clearly better.
Prompt guidance:
clean publication-quality scientific figure
white background
minimal modern academic style
balanced layout
crisp labels
subtle professional color palette
polished and highly readable
nature-like
Use for elegant overview figures, concept figures, and high-level mechanism illustrations.
Prompt guidance:
Nature-like scientific figure
elegant and refined
clean white background
subtle professional color palette
balanced composition
minimal clutter
polished scientific schematic
science-like
Use for explanatory mechanism figures and concept+experiment hybrid figures.
Prompt guidance:
Science-like publication figure
explanatory scientific illustration
refined but accessible design
strong visual storytelling
concise labels
polished multi-part composition
cvpr-like
Use for computer vision style architecture diagrams, technical pipelines, and comparison figures.
Prompt guidance:
CVPR-like technical figure
structured modular layout
clear blocks and arrows
compact and professional design
white background
strong readability
conference-ready diagram
iccv-like
Use for visually layered architecture diagrams and complex visual systems.
Prompt guidance:
ICCV-like vision paper diagram
professional modular architecture figure
clearly separated components
strong visual hierarchy
balanced layout
publication-quality technical illustration
neurips-like
Use for machine learning method figures, algorithm logic diagrams, and concise training schematics.
Prompt guidance:
NeurIPS-like machine learning figure
clean and rigorous academic style
precise structure
compact scientific layout
clear logical flow
minimal but highly legible
iclr-like
Use for reinforcement learning, LLM, alignment, agent, and training workflow figures.
Prompt guidance:
ICLR-like modern ML figure
clean research schematic
clear process flow
concise modular design
elegant but rigorous
highly readable labels
suitable for RL or LLM training diagrams
aaai-like
Use for formal AI system diagrams and logic-oriented technical figures.
Prompt guidance:
AAAI-like AI research diagram
clear and formal academic style
structured blocks
straightforward technical presentation
neat and concise conference-style figure
Category to preset mapping
Use these defaults unless the user explicitly asks for a different visual style:
flowchart -> iclr-like
architecture -> cvpr-like
mechanism -> nature-like
comparison -> cvpr-like
concept -> nature-like
dataflow -> paper-clean
experiment -> neurips-like
Override rules
If the user explicitly asks for Nature, Science, CVPR, ICCV, NeurIPS, ICLR, or AAAI style, honor that request.
If the user asks for a paper main figure, overview figure, or teaser figure, prefer nature-like unless they specify another venue.
If the user asks for a training pipeline, RL flowchart, agent workflow, or LLM training diagram, prefer iclr-like.
If the user asks for a computer vision architecture, vision pipeline, or multi-branch model diagram, prefer cvpr-like or iccv-like.
If the user gives no clear style signal, use the mapping table above.
Layout and connection optimization rules
Before writing prompts, optimize the visual organization of the figure itself.
Narrative hierarchy
Prefer figures with explicit narrative hierarchy rather than flat equal-weight modules.
Use layered organization when appropriate, such as:
bottom or side: validation mapping / experiment support
If one part of the figure is conceptually dominant, give it more visual weight.
Do not force all panels to have equal size when the content is not equally important.
Arrow and connector discipline
Reduce unnecessary arrows.
Only use arrows when they truly improve scientific readability.
Prefer three levels of connector strength:
primary arrows for the main process or argument flow
secondary arrows for local dependencies within one panel
minimal connectors or simple alignment for weak logical mapping
Avoid spider-web connection structures.
Avoid dense cross-panel arrows when color semantics, grouping, or alignment can communicate the same relationship more clearly.
Preferred connection structures
Prefer clean structures such as:
linear chains for process explanations
tree-like branching and merging for supervision or aggregation paths
parallel columns feeding one fusion block for multimodal reasoning
aligned mapping grids or tag panels for experiment-to-operator correspondence
Avoid complex global loops unless the loop itself is the scientific point.
If a loop is needed, keep it local to one panel instead of the whole figure.
Panel consistency
Within a multi-panel figure, keep panels internally consistent.
A strong default is:
title on top
main visual structure in the center
one short takeaway note below
Do not mix radically different alignment systems across neighboring panels unless there is a clear reason.
Cross-panel structure
Minimize direct cross-panel connectors.
When possible, communicate shared semantics by:
repeated color meaning
repeated visual tokens
aligned panel ordering
matching labels
mirrored or parallel local structures
For validation panels, prefer compact mapping layouts over long return arrows to earlier panels.
For example, use aligned tags, mini-box mappings, or table-like correspondences rather than many long crossing connectors.
Layout optimization objectives
During preprocessing, explicitly optimize for:
balanced visual weight
clean reading order
minimal connector clutter
high signal-to-noise ratio
strong alignment
clear grouping
scientific readability at paper scale
immediate recognition of the main claim
Prompt preprocessing and construction rules
Before any image generation, always perform prompt preprocessing.
Preprocessing outputs
The preprocessing stage must explicitly produce two artifacts:
Positive prompt
Negative prompt
Positive prompt requirements
The Positive prompt must be:
tailored to the user's scientific context
aligned with the detected figure category
aligned with the selected venue-style preset
detailed enough for high-quality academic figure generation
written in a publication-oriented style rather than a casual image-generation style
structured, specific, and visually directive
The Positive prompt should usually contain:
Figure identity
what kind of figure this is
whether it is a hero figure, overview figure, mechanism figure, architecture figure, workflow figure, comparison figure, or experiment figure
Scientific content layer
the exact entities, modules, stages, or claims to show
the intended scientific message
the causal or logical relationships between components
Layout layer
horizontal or vertical layout
number of panels or bands
relative grouping of information
alignment and reading order
narrative hierarchy across panels or bands
which panel or band should receive the most visual weight
Connection layer
where arrows are necessary
where alignment or grouping should replace arrows
primary vs secondary connector hierarchy
whether the structure should be linear, branching, merging, parallel, or mapping-based
how to avoid clutter and spider-web connectors
Visual semantics layer
color meaning per pathway or concept
which elements should be emphasized
which elements should stay neutral
Typography and labeling layer
short labels only
clean scientific sans-serif style
bold panel titles when needed
no decorative text treatment
Venue-style layer
include the chosen preset guidance explicitly
Quality layer
white background
publication-ready
vector-graphics look
flat design when appropriate
strict alignment
high readability
minimal clutter
disciplined connector usage
balanced layout and panel spacing
Negative prompt requirements
The Negative prompt must explicitly suppress outputs that are visually incompatible with scientific paper figures.
The Negative prompt should usually include terms such as:
low quality
blurry text
unreadable labels
poster design
marketing infographic
cartoon
comic style
childish icons
3D rendering
glossy UI
neon colors
sci-fi interface
cluttered layout
dark background
heavy gradients
dramatic shadows
photographic realism
messy arrows
inconsistent alignment
decorative background
exaggerated icons
slide deck style
business infographic style
flashy presentation
hand-drawn style
sketch style
crowded composition
illegible academic labels
Final generation prompt rule
Construct the final generation prompt by combining these layers in order:
User intent layer — what the user actually wants drawn
Category layer — structural instructions for the detected category
Venue-style layer — one of the preset guidance blocks above
Do not merely forward the raw user request. Rewrite it into a stronger research-figure generation prompt.
The backend script should receive the rewritten Positive prompt as its --prompt content.
The script completes the whole pipeline in one run:
sends the generation request
extracts generated image bytes from the JSON response
saves the final image file directly
optionally saves raw API response JSON
It prints the saved paths after completion.
When responding to the user after generation, also report:
detected figure category
selected venue-style preset
Positive prompt
Negative prompt
any major prompt rewrite decisions (briefly)
Notes
For image2image, the input image is read locally and base64-encoded automatically.
Supported input MIME types: PNG, JPG, JPEG.
--aspect-ratio is supported for both modes.
--clarity is supported for text2image and can be standard, high, or ultra.
Use clear prompts such as:
preserve the structure and improve visual consistency
convert into a clean Nature-style scientific figure
white background, paper-ready, minimal annotations
Before running the script, verify that at least one of NANOBANANA_API_KEY or NANOBANANA_BEARER_TOKEN is available in the environment. If not, ask the user to configure them.
Even if the user only provides a short request, you must still run the preprocessing stage internally and expand it into detailed Positive and Negative prompts before generation.