RLRF: Rendering-Aware Reinforcement Learning

Rendering-Aware Reinforcement Learning for
Vector Graphics Generation

Joan Rodriguez^1,2,3, Haotian Zhang⁵, Abhay Puri¹, Aarash Feizi^1,2,11, Rishav Pramanik⁷, Pascal Wichmann⁶, Arnab Mondal^2,8, Mohammad Reza Samsami^2,9,
Rabiul Awal^1,2, Perouz Taslakian¹, Spandana Gella¹, Sai Rajeswar^1,2, David Vazquez¹, Christopher Pal^1,2,4,10, Marco Pedersoli^1,2,3

¹ServiceNow Research ²Mila ³ÉTS Montréal ⁴Polytechnique Montréal ⁵Columbia University
⁶Independent Scholar ⁷Stony Brook University ⁸Apple ⁹Google Research ¹⁰Canada CIFAR AI Chair ¹¹McGill University

RLRF trains a vision-language model to generate SVG code by using the rendered image as feedback. Unlike traditional supervised learning which only looks at the code tokens, RLRF renders the generated SVG and compares it to the target image. This allows the model to learn from visual mistakes and improve its generation quality, even for complex shapes and gradients that are hard to represent when the model does not see the rendered image.

Composite Reward Function

Three complementary signals guide the learning process

Reconstruction

Pixel-level fidelity measured via L2 distance between the rendered SVG and the input image.

R_recon = -||I - Î||²

Semantic Similarity

High-level perceptual alignment using DreamSim or CLIP embeddings.

R_sem = cos(φ(I), φ(Î))

Code Efficiency

Encourages compact SVG code by penalizing token length deviation from ground truth.

R_eff = -|len(ŷ) - len(y)|

Combined Reward: R = λ₁·R_recon + λ₂·R_sem + λ₃·R_eff

Detailed Results

Full comparison across all models and metrics.

Model	↓ MSE	↑ SSIM	↑ DINO	↓ LPIPS	Code Eff.	Time(s)
RLRF Results on SVG Base Models
StarVector-1B-Base	4.60	87.00	96.00	9.22	-800	64
+RLRF (ours)	3.46	88.00	98.00	7.51	-127	23
Δ Improvement	-1.14	+1.0	+2.0	-1.71	+763	-41
Qwen2.5VL-3B-Instruct	23.31	62.28	69.26	35.30	+1.5k	24
+SVG-SFT (ours)	9.48	78.40	92.60	17.44	-2.5k	67
+RLRF (ours)	4.79	88.76	95.97	10.97	199	48
Δ Improvement	-4.69	+10.36	+3.37	-6.47	+2.7k	-19
Qwen2.5-VL-7B-Instruct	23.10	61.40	78.00	33.80	+765	37
+SVG-SFT (ours)	8.60	79.40	93.00	16.58	-2.8k	73
+RLRF (ours)	1.03	95.10	98.70	3.08	-334	63
Δ Improvement	-7.57	+15.70	+5.70	-13.50	+2.5k	-10
VLMs (Open-Source)
Qwen2.5VL-32B-Instruct	23.62	55.46	82.38	35.83	+1.3k	58
Qwen2.5VL-72B-Instruct	23.20	55.72	81.68	34.14	+1.4k	62
Llama4-Scout (109B)	20.98	58.58	83.72	33.37	+1.4k	57
Llama4-Maverick (400B)	20.67	59.26	85.61	31.75	+1.3k	61
VLMs (Closed-Source)
Gemini-Flash-1.5	20.38	59.65	84.70	33.27	+1.2k	59
Gemini-Flash-2.0	19.31	60.21	86.53	32.10	+1.1k	63
Gemini-1.5-Pro	20.19	60.75	84.17	33.02	+1.2k	58
Claude 3.7 Sonnet	17.73	69.33	79.80	28.42	+1.4k	62
GPT-4o-1120	16.92	66.91	89.00	27.55	+1.3k	60
Image Processing Methods
Im2VEC	18.10	76.50	69.20	29.10	-4.3k	<1
Potrace	8.15	77.28	89.23	19.10	-7.3k	12
DiffVG	6.64	81.23	86.12	20.5	-19.7k	31
PyAutoTrace	4.71	87.44	95.68	10.71	-99.7k	<1
VTracer	4.25	87.94	95.75	11.66	-12.9k	<1
SuperSVG	3.05	83.30	82.70	13.50	-65.6k	<1
LIVE	2.22	88.11	93.45	7.23	-18.3k	1,243

Citation

@inproceedings{rodriguez2025rendering,
  title={Rendering-Aware Reinforcement Learning for Vector Graphics Generation},
  author={Rodriguez, Juan A and Zhang, Haotian and Puri, Abhay and Feizi, Aarash and Pramanik, Rishav and Wichmann, Pascal and Mondal, Arnab and Samsami, Mohammad Reza and Awal, Rabiul and Taslakian, Perouz and others},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS 2025)},
  year={2025}
}

Abstract

We introduce RLRF (Reinforcement Learning from Rendering Feedback), a novel approach for training vision-language models to generate high-quality vector graphics in SVG format. Unlike traditional supervised learning methods that only optimize token-level predictions, RLRF leverages the rendered visual output as feedback to guide the learning process.

Our method trains a model to generate SVG code by comparing the rendered image against the target image, allowing the model to learn from visual mistakes and improve generation quality. This is particularly effective for complex shapes, gradients, and fine details that are difficult to represent accurately when the model only sees code tokens.

RLRF employs a composite reward function combining three complementary signals: reconstruction reward for pixel-level fidelity, semantic similarity for high-level perceptual alignment using vision embeddings, and code efficiency to encourage compact SVG generation.

Experimental results demonstrate that RLRF achieves state-of-the-art performance across multiple metrics including MSE, SSIM, DINO, and LPIPS, while maintaining competitive code efficiency. Our approach significantly outperforms baseline methods including GPT-4o, LIVE, VTracer, and supervised fine-tuning approaches, establishing a new benchmark for vector graphics generation.

Rendering-Aware Reinforcement Learning for
Vector Graphics Generation

How RLRF Works