OAK: Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang^{* 1}, Sangwoo Mo^{* 1}, Stella X. Yu^1,2, Sima Behpour³, liu Ren³

¹University of Michigan, ²UC Berkeley, ³Bosch Center for AI
CVPR 2025
^*Indicates Equal Contribution

TL;DR: We introduce open ad-hoc categorization (OAK), a novel task requiring discovery of novel classes across diverse contexts, and tackle it by learning contextualized features with CLIP.

Open ad-hoc categorization

We study open ad-hoc categorization (OAK) such as things to sell at a garage sale to achieve a specific goal (selling unwanted items). Given the context garage sale, labeled exemplars such as shoes, we need to recognize all items in the scene that can be sold at the garage sale, including novel ones. Supervised models like CLIP focus on 1) closed-world generalization, recognizing other shoes. 2) novel semantic categories can be discovered by contextual expansion from shoes to hats. Unsupervised methods like GCD discover 3) novel visual clusters, identifying suitcases.

Abstract

Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories, such as things to sell at a garage sale, are created dynamically to achieve specific tasks. We study open ad-hoc categorization, where the goal is to infer novel concepts and categorize images based on a given context, a small set of labeled exemplars, and some unlabeled data.

We have two key insights: 1) recognizing ad-hoc categories relies on the same perceptual processes as common categories; 2) novel concepts can be discovered semantically by expanding contextual cues or visually by clustering similar patterns. We propose OAK, a simple model that introduces a single learnable context token into CLIP, trained with CLIP's objective of aligning visual and textual features and GCD's objective of clustering similar images.

On Stanford and Clevr-4 datasets, OAK consistently achieves the state-of-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK generates interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling accurate and flexible categorization.

Related Work

Generalized Category Discovery by Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman (CVPR 2022)
Visual Prompt Tuning by Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim (ECCV 2022)

Task: Open Set Discovery + Context Switching

Open ad-hoc categorization (OAK) learns diverse categorization rules, dynamically adapting to varying user needs at hand. The same image should be recognized differently depending on context, such as drinking for action and residential for location. We emphasize the ability to switch between multiple contexts in OAK. Specifically, given 1) a context defined by classes, 2) a few labeled images, and 3) a set of unlabeled images, OAK holistically reasons over labeled and unlabeled images, spanning both known and novel classes, to infer novel concepts and propagate labels across the entire dataset. We show the class names of labeled images in the color box and unlabeled images inside the parentheses, reflecting that the unlabeled class names are not available, only the images. OAK introduces unique challenges beyond generalized category discovery (GCD), requiring adaptation to diverse ad-hoc categorization rules based on context.

Method: Context Tokens to CLIP with Visual Clustering

OAK learns contextualized features while preserving the foundations of perception of CLIP by introducing context tokens that modulate the frozen ViT encoder, achieving context-aware attention. This contextualized feature learning follows two key principles: 1) top-down text guidance, which leverages semantic knowledge from known class names, and 2) bottom-up image clustering, which captures visual similarity to infer categorization rules. OAK aligns visual clusters with semantic cues by inferring pseudo-labels using the text encoder and refining clusters accordingly. This unified approach outperforms individual methods such as CLIP and GCD, effectively combining their strengths.

Naming Novel Concepts: CLIP-ZS with LLM-Inferred Vocabulary

Prompt: I have a dataset of images from the following classes: [KNOWN_CLASSES]. What are the most possible classes that will also be included in this dataset? Give me [NUMBER_OF_NOVEL_CLASSES] class names, only return class names separated by commas. Include quotation marks for each one.

Class names associated with novel visual clusters from our models show that it identifies reasonable words for contexts familiar to CLIP (Action, Location, Color, Shape), but less accurate ones for less familiar contexts (Mood, Texture, Count). We include full lists and visual examples in Supplementary Material C.

Visual examples with true and predicted class names, showing OAK assigns reasonable labels based on visual cues, such as jumping for people appearing dancing.

Novel Concepts Arise from Contextualized Features

OAK Contextualized Features (Stanford Action/Location/Mood)

OAK Contextualized Features (Clevr-4 Shape/Color/Texture/Count)

t-SNE visualization of CLIP visual features and nearest neighbor examples from CLIP (row 1) and OAK (row 2) on Clevr-4 Shape, Color, Texture, Count. CLIP features only work on the context that aligns with its training data such as Shape, while OAK contextualizes the feature space so that both known and novel classes form meaningful groups.

Contextualized Attention Highlights Correct Regions

Saliency maps on the Stanford dataset show that OAK focuses on relevant regions of images for different contexts, while GCD often distracts to arbitrary regions. We select two samples predicted correctly by OAK across all contexts and visualize the saliency maps of CLIP, GCD, and OAK using the approach of Chefer et al., guided by the predicted class, except for CLIP, where we use an empty string. Correct and incorrect predictions are colored accordingly. OAK focuses on human behaviors, like hand movements for Action, covers the entire scene for Location, and highlights a human face for Mood, closely aligning with human intuition. GCD produces reasonable saliency maps for Action, as seen in the phoning example, but confuses fixing a bike with riding a bike by focusing on the bike rather than human behavior. CLIP focuses on salient objects such as humans, without adapting to each context.

Effectiveness and Versatility

OAK consistently outperforms open-vocabulary classification (row group 1) and visual clustering (row group 2) baselines, particularly on novel classes and prediction consistency. This advantage is most pronounced in less familiar contexts like Mood. We report known, novel, and overall accuracies for each context, including Omni accuracy, with best results in bold. CLIP-ZS + LLM vocab performs poorly on novel classes, revealing the limitation of using class names alone. GCD addresses this by clustering visual features, but OAK goes further by contextualizing them with CLIP's semantic knowledge, achieving a 50% gain over both CLIP and GCD in Mood. In terms of the Omni accuracy, OAK achieves 70.3% overall accuracy, outperforming all baselines by 2-30% and demonstrating consistency across contexts.

BibTeX


        @InProceedings{wang2025oakCVPR,
          author    = {Wang, Zilin and Mo, Sangwoo and Yu, Stella X. and Behpour, Sima and Ren, Liu},
          title     = {Open Ad-hoc Categorization with Contextualized Feature Learning},
          booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
          month     = {June},
          year      = {2025},
          pages     = {15108-15117}
      }