StarVector: Generating Scalable Vector Graphics Code From Images And Text

ServiceNow Research 1 Mila - Quebec AI Institute 2 Canada CIFAR AI Chair 3 ETS, Montreal, Canada 4 UBC, Vancouver, Canada 5 Apple MLR 6

Abstract

Teaser Image
Figure 1: StarVector is a foundation model for SVG generation. It uses a Vision-Language Modeling architecture to understand images and text instructions. StarVector excels at vectorizing a wide range of visual inputs, from general icons and logotypes to more intricate vectors such as technical diagrams.

StarVector is a pioneering Foundation Model for Scalable Vector Graphics (SVG) generation that integrates both visual and textual inputs. Designed to address the limitations of traditional image processing algorithms, StarVector excels in complex tasks such as image vectorization and text-conditional SVG generation. It introduces a vision-language modeling approach within a transformer decoder architecture, offering a significant leap in generating sophisticated SVG elements. Accompanied by the introduction of SVG-Stack, a large-scale dataset, and SVG-Bench, a new benchmark for SVG generation, StarVector sets a new performance standard in this field.

  1. Innovative SVG Generation: StarVector leverages a transformer decoder architecture to generate SVG code from both images and textual descriptions, enabling advanced image vectorization and text-conditional SVG creation.
  2. Complex Task Proficiency: Unlike traditional algorithms, StarVector handles intricate SVG elements such as text and detailed primitives, making it suitable for generating sophisticated diagrams and visuals.
  3. Comprehensive Dataset and Benchmark: SVG-Stack, a dataset with over 2 million SVG samples, and SVG-Bench, a dedicated benchmark, provide the foundation for training and evaluating StarVector's capabilities.
  4. State-of-the-Art Performance: StarVector surpasses previous methods in both text-to-SVG and image-to-SVG generation, closing the gap with traditional image processing techniques while being fully open-sourced for the research community.
The repository can be accessed here:https://github.com/starvector

StarVector

Teaser Image
Figure 2: a) StarVector Architecture: StarVector projects images into embeddings via an image encoder, then maps these embeddings to the LLM hidden space using an LLM Adapter, generating Visual Tokens. Text conditioning is achieved with the LLM’s tokenizer and embedder. The model learns to map token sequences (visual or textual) to SVG code. The symbol ⊕ denotes mutually exclusive operations (image-to- SVG or text-to-SVG), while ‖ indicates sequence concatenation. Figure 2: b)Vision Model and Adapter: The image encoder employs a Vision Transformer (ViT) to process image patches sequentially. The LLM Adapter non-linearly projects embeddings into visual tokens for LLM integration.

** Write Something here **

🔜[Coming Soon!] More details about dataset creation, experimental setup, and evaluation!