Are CLIP features all you need for Universal Synthetic Image Origin Attribution?

1University of Florence, 2City, University of London, 3Queen Mary, University of London
TWYN @ ECCV 2024
overview

We propose to use general-purpose features extracted from large pre-trained vision encoders to perform Open-Set Origin Attribution of synthetic images produced by various generative models, including Diffusion Models. Our method outperforms existing frequency-based forensic classifiers, is able to operate in the low-data regime, and is more robust to input perturbations.

Abstract

The steady improvement of Diffusion Models for visual synthesis has given rise to many new and interesting use cases of synthetic images but also has raised concerns about their potential abuse, which poses significant societal threats. To address this, fake images need to be detected and attributed to their source model, and given the frequent release of new generators, realistic applications need to consider an Open-Set scenario where some models are unseen at training time. Existing forensic techniques are either limited to Closed-Set settings or to GAN-generated images, relying on fragile frequency-based "fingerprint" features. By contrast, we propose a simple yet effective framework that incorporates features from large pre-trained foundation models to perform Open-Set origin attribution of synthetic images produced by various generative models, including Diffusion Models. We show that our method leads to remarkable attribution performance, even in the low-data regime, exceeding the performance of existing methods and generalizes better on images obtained from a diverse set of architectures.

Method

We address the problem of synthetic image attribution in the most general setting possible:
Using real images (from a set $\mathcal{R}$), synthetic images generated by a set of known generative models $\mathcal{O}_\mathcal{K}=\{\mathcal{M}_\mathcal{K}^1,\ldots,\mathcal{M}_\mathcal{K}^{N_{\mathcal{O}_\mathcal{K}}}\}$, and synthetic images generated by a set of unknown generative models $\mathcal{O}_\mathcal{U}=\{\mathcal{M}_\mathcal{U}^1,\ldots,\mathcal{M}_\mathcal{U}^{N_{\mathcal{O}_\mathcal{U}}}\}$, we optimize a classifier to assign images to either a known model from $\mathcal{O}_\mathcal{K}$ or ``reject'' such assignment, classifying the images as synthetic-and-unknown ($y_u$).

Motivated by the generality and expressiveness of the representations of modern vision foundation models, we propose to employ the Vision Transformer-based encoder of a foundation model and extract intermediate features. Next, we perform the classification task (of assigning each image to a class in $\mathcal{O}_\mathcal{K}\cup\{y_u\}$) following either a Linear Probe or a $k$-NN approach.


Results

Open Set Attribution

overview

Architecture ablation

overview

Robustness

overview

BibTeX

@misc{cioni2024clip,
        title={Are CLIP features all you need for Universal Synthetic Image Origin Attribution?},
        author={Dario Cioni and Christos Tzelepis and Lorenzo Seidenari and Ioannis Patras},
        year={2024},
        eprint={2408.09153},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }