Frederic Zhang - Homepage

Research

Towards Higher Effective Rank in Parameter-Efficient Fine-Tuning Using Khatri–Rao Product Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Anton ven den Hengel and Ehsan Abbasnejad International Conference on Computer Vision (ICCV), 2025. [abstract] [preprint] [code] [bibtex]

Parameter-efficient fine-tuning (PEFT) has become a standard approach for adapting large pre-trained models. Amongst PEFT methods, low-rank adaptation (LoRA) has achieved notable success. However, recent studies have highlighted its limitations compared against full-rank alternatives, particularly when applied to multimodal and large language models. In this work, we present a quantitative comparison amongst full-rank and low-rank PEFT methods using a synthetic matrix approximation benchmark with controlled spectral properties. Our results confirm that LoRA struggles to approximate matrices with relatively flat spectrums or high frequency components---signs of high effective ranks. To this end, we introduce KRAdapter, a novel PEFT algorithm that leverages the Khatri–Rao product to produce weight updates, which, by construction, tends to produce matrix product with a high effective rank. We demonstrate performance gains with KRAdapter on vision-language models up to 1B parameters and on large language models up to 8B parameters, particularly on unseen common-sense reasoning tasks. In addition, KRAdapter maintains the memory and compute efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models.

@inproceedings{albert2025khatri,
  author = {Albert, Paul and Zhang, Frederic Z. and Saratchandran, Hemanth and ven den Hengel, Anton and Abbasnejad, Ehsan},
  title = {Towards Higher Effective Rank in Parameter-Efficient Fine-Tuning Using Khatri-Rao Product},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2025},
}

RandLoRA: Full-Rank Parameter-Efficient Fine-Tuning of Large Models Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton ven den Hengel and Ehsan Abbasnejad International Conference on Learning Representations (ICLR), 2025. [abstract] [paper] [preprint] [code] [bibtex]

Low-Rank Adaptation (LoRA) and its variants have shown impressive results in reducing the number of trainable parameters and memory requirements of large transformer networks while maintaining fine-tuning performance. However, the low-rank nature of the weight update inherently limits the representation power of the fine-tuned model, potentially compromising performance on complex tasks. This raises a critical question: when a performance gap between LoRA and standard fine-tuning is observed, is it due to the reduced number of trainable parameters or the rank deficiency? This paper aims to answer this question by introducing RandLoRA, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. Our method limits the number of trainable parameters by restricting optimization to diagonal scaling matrices applied to the fixed random matrices. This allows us to effectively overcome low-rank limitations while maintaining low parameter count and memory usage during training. Through extensive experimentation across vision, language, and vision-language benchmarks, we systematically evaluate the limitations of LoRA and existing random basis methods. Our findings reveal that full-rank updates are beneficial across vision and language tasks separately, but especially so for vision-language tasks, where RandLoRA significantly reduces—and sometimes eliminates—the performance gap between standard finetuning and LoRA, demonstrating its efficacy.

@inproceedings{albert2025randlora,
  author = {Albert, Paul and Zhang, Frederic Z. and Saratchandran, Hemanth and Rodriguez-Opazo, Cristian and ven den Hengel, Anton and Abbasnejad, Ehsan},
  title = {RandLoRA: Full-Rank Parameter-Efficient Fine-tuning of Large Models},
  booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
  month = {April},
  year = {2025},
}

Temporally Grounding Instructional Diagrams in Unconstrained Videos Jiahao Zhang, Frederic Z. Zhang, Cristian Rodriguez-Opazo, Yizhak Ben-Shabat, Anoop Cherian and Stephen Gould Winter Conference on Applications of Computer Vision (WACV), 2025. [abstract] [paper] [preprint] [code] [bibtex]

We study the challenging problem of simultaneously localising a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.

@inproceedings{zhang2025compq,
  author = {Zhang, Jiahao and Zhang, Frederic Z. and Rodriguez-Opazo, Cristian and Cherian, Anoop and Gould, Stephen},
  title = {Temporally Grounding Instructional Diagrams in Unconstrained Videos},
  eprint={2407.12066},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2407.12066},
  year = {2024},
}

Knowledge Composition using Task Vectors with Learned Anisotropic Scaling Frederic Z. Zhang*, Paul Albert*, Cristian Rodriguez-Opazo, Anton ven den Hengel and Ehsan Abbasnejad Neural Information Processing Systems (NeurIPS), 2024. [abstract] [paper] [preprint] [code] [video] [bibtex]

Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This paper builds on these properties of task vectors and aims to answer (1) whether components of task vectors, particularly parameter blocks, exhibit similar characteristics, and (2) how such blocks can be used to enhance knowledge composition and transfer. To this end, we introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsic dimensionality of pre-trained models, with only a few coefficients being the learnable parameters. Furthermore, composition of parameter blocks leverages the already learned representations, thereby reducing the dependency on large amounts of data. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labeled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of aTLAS as a PEFT method, particularly with less data, and demonstrate that its scalibility.

@inproceedings{zhang2024atlas,
  author = {Zhang, Frederic Z. and Albert, Paul and Rodriguez-Opazo, Cristian and van den Hengel, Anton and Abbasnejad, Ehsan},
  title = {Knowledge Composition using Task Vectors with Learned Anisotropic Scaling},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  month = {December},
  year = {2024},
}

Visual and Spatial Understanding of Human–Object Interactions Frederic Z. Zhang PhD Thesis, The Australian National University, 2024. [abstract] [thesis] [bibtex]

In the context of computer vision, human-object interactions (HOIs) are often characterised as a (subject, predicate, object) triplet, with humans being the subject. As such, to understand HOIs is to localise pairs of interactive instances and recognise the predicates that signify their interactions.

This interpretation naturally leads to a graph structure, where humans and objects are represented as nodes while their interactions as edges. We investigate this idea by employing off-the-shelf object detectors to obtain a set of human and object detections, and building a bipartite graph with human nodes on one side of the bipartition and object nodes on the other. Unlike conventional methods, wherein nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition the message passing between pairs of nodes on their spatial relationships. With spatial conditioning, the proposed method is able to suppress numerous negative pairs due to incompatible spatial relationships, and particularly excels at inferring the correspondence between interactive humans and objects when there are many pairs in the same scene. In addition, we observe that the learned adjacency matrices spontaneously exhibit structures indicative of interactive pairs without explicit supervision.

Such emergent properties prompt us to investigate further into the formulation of graphs. Apart from the unary representations (human and object instances), we incorporate human–object pairs into the graph structure by encoding each pair as a node. Utilising the popular transformer architecture, we propose the unary--pairwise transformer, wherein self-attention blocks serve as fully-connected graphs. We observe that when separate self-attention blocks are employed for the unary and pairwise representations, they specialise in complimentary ways. Specifically, the unary layer preferentially increases the scores of positive human–object pairs, while the pairwise layer decreases the scores of negative pairs.

Despite the success in graphical modelling of HOIs, their complexity and ambiguity still poses a challenge. Through extensive visualisations, we observe that the commonly used object features are often extracted from object extremities, thus lacking the fine-grained contexts to disambiguate certain interactions. In particular, we identify two types of visual contexts lacking in current feature formulations and propose to enrich the representations with spatially-guided cross-attention, where a carefully designed box-pair positional embeddings serve as spatial biases. With rich visualisations, we demonstrate how the spatial guidance impacts the attention mechanism and supplements the model with key visual cues for the recognition of HOIs.

@phdthesis{zhang2024thesis,
  title = {Visual and Spatial Understanding of Human–Object Interactions},
  author = {Frederic Z. Zhang},
  year = {2024},
  month = {Mar},
  address = {Canberra, Australia},
  note = {Available at \url{https://openresearch-repository.anu.edu.au/items/2f2331b2-77d4-422a-8acd-093a8d894895}},
  school = {College of engineering, computing and cybernatics, The Australian National University},
  type = {PhD thesis}
}

Exploring Predicate Visual Context in Detecting Human–Object Interactions Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong and Stephen Gould International Conference on Computer Vision (ICCV), 2023. [abstract] [paper] [preprint] [code] [video] [bibtex]

Recently, the DETR framework has emerged as the dominant approach for human–object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.

@inproceedings{zhang2023pvic,
  author = {Zhang, Frederic Z. and Yuan, Yuhui and Campbell, Dylan and Zhong, Zhuoyao and Gould, Stephen},
  title = {Exploring Predicate Visual Context in Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2023},
  pages = {10411-10421},
}

Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer Frederic Z. Zhang, Dylan Campbell and Stephen Gould Computer Vision and Pattern Recognition (CVPR), 2022. [abstract] [paper] [preprint] [code] [video] [bibtex]

Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human–object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary–Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of our transformer network specialize, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU.

@inproceedings{zhang2022upt,
  author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
  title = {Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2022},
  pages = {20104-20112}
}

Spatially Conditioned Graphs for Detecting Human–Object Interactions Frederic Z. Zhang, Dylan Campbell and Stephen Gould International Conference on Computer Vision (ICCV), 2021. [abstract] [paper] [preprint] [code] [video] [bibtex]

We address the problem of detecting human–object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition messages between pairs of nodes on their spatial relationships, resulting in different messages going to neighbours of the same node. To this end, we explore various ways of applying spatial conditioning under a multi-branch structure. Through extensive experimentation we demonstrate the advantages of spatial conditioning for the computation of the adjacency structure, messages and the refined graph features. In particular, we empirically show that as the quality of the bounding boxes increases, their coarse appearance features contribute relatively less to the disambiguation of interactions compared to the spatial information. Our method achieves an mAP of 31.33% on HICO-DET and 54.2% on V-COCO, significantly outperforming state of the art on fine-tuned detections.

@inproceedings{zhang2021scg,
  author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
  title = {Spatially Conditioned Graphs for Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2021},
  pages = {13319-13327}
}

Applied Scientist

Frederic Zhang | 张真

Connect with me

Research

Affiliations