Negation is a fundamental linguistic concept used by humans to convey information that they do not desire. Despite this, there has been minimal research specifically focused on negation within vision-language tasks. This lack of research means that vision-language models (VLMs) may struggle to understand negation, implying that they struggle to provide accurate results. One barrier to achieving human-level intelligence is the lack of a standard collection by which research into negation can be evaluated.
This paper presents the first large-scale dataset, Negative Instruction (NeIn), for studying negation within the vision-language domain. Our dataset comprises 530,694 quadruples, i.e., source image, original caption, negative sentence, and target image in total, including 495,694 queries for training and 35,000 queries for benchmarking across multiple vision-language tasks.
Specifically, we automatically generate NeIn based on a large, existing vision-language dataset, MS-COCO, via two steps: generation and filtering.
During the generation phase, we leverage two VLMs, BLIP and MagicBrush, to generate the target image and a negative clause that expresses the content of the source image. In the subsequent filtering phase, we apply BLIP to remove erroneous samples.
Additionally, we introduce an evaluation protocol for negation understanding of image editing models.
Extensive experiments using our dataset across multiple VLMs for instruction-based image editing tasks demonstrate that even recent state-of-the-art VLMs struggle to understand negative queries.
The creation of NeIn involves two primary stages: the first stage is generating, which employs BLIP and InstructPix2Pix to generate target samples; the second stage is filtering, where BLIP is utilized to remove erroneous samples.
The main idea is that given image \(\mathcal{I}\) and a corresponding caption \(\mathcal{T}_{o}\) describing what objects are present in \(\mathcal{I}\), we will find a negative clause, termed \(\mathcal{T}_{n}\), such that it satisfies the content of source image \(\mathcal{I}\). Next, our goal is to create an image \(\mathcal{G}\) that \(\mathcal{T}_{o}\) matches it but not \(\mathcal{T}_{n}\),
which means the object specified in \(\mathcal{T}_{n}\) is present in \(\mathcal{G}\). We eliminate generated samples \(\mathcal{G}\) which significantly alter the content of query image \(\mathcal{I}\) or make it difficult to identify object categories \(\mathcal{S}\) to produce the final samples \(\mathcal{F}\).
Thus, in the context of image editing, given image \(\mathcal{F}\), \(\mathcal{T}_{n}\) will be a query for removing some object in \(\mathcal{F}\), taking \(\mathcal{I}\) as one of the best results.
We consider whether image editing methods successfully eliminate the object categories specified in the negative sentence, and determine if these methods can preserve the objects not mentioned in the negative sentence.
The first is determined by the removal score, while the latter is assessed using the retention score. We leverage OWLv2 for object identification during evaluation.
Image editing models generally struggle to understand the
meaning of negation when they do not remove the mentioned objects, as demonstrated by the low Removal and
AUC-Removal scores. It also distort the content of the source image, as
can be seen by the low Retention score.
Instead of removing the mentioned objects, image editing models tend to
have the following problems: (1) retaining the mentioned
object in the edited image; (2) increasing the quantity of
mentioned object in the generated image, and even bringing that object to the center of the images; and (3) completely
replacing the content of the query image with that object. This observation
demonstrates the failure of VLMs in image editing task on
negation understanding, potentially affecting other vision-language tasks.