Avatar

Research Fellow

Adelaide, Australia
Email

Frederic Zhang | 张真

I'm currently a research fellow at the Centre for Augmented Reasoning, Australian Institute for Machine Learning (AIML), working with Dr. Ehsan Abbasnejad.

I did my PhD at the Australian National University, under the supervision of Prof. Stephen Gould and Dr. Dylan Campbell. My research focus was on the visual and spatial understanding of human–object interactions. This includes visual recognition and localisation.

Prior to my PhD, as part of an international partnership program, I received a bachelor degree of science in automation from Beijing Institute of Technology and a bachelor degree of engineering in mechatronics (research and development) with first-class honours from the Australian National University, where I had the pleasure to work with Prof. Yuchao Dai and Prof. Richard Hartley.

I'm greatly passionate about programming, so much so that I wrote a deep learning library called Pocket. It is a lightweight library built on top of PyTorch, featuring different boilerplate learning engines and other utilities purposed for visualisation and evaluation. I'm also a photographer. As an enthusiast of the great outdoors, my subjects are mostly nature-oriented. Find out more in my gallery!


Connect with me

Research

Temporally Grounding Instructional Diagrams in Unconstrained Videos Jiahao Zhang, Frederic Z. Zhang, Cristian Rodriguez-Opazo, Yizhak Ben-Shabat, Anoop Cherian and Stephen Gould Winter Conference on Applications of Computer Vision (WACV), 2025. [abstract] [preprint] [bibtex]
We study the challenging problem of simultaneously localising a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.
@misc{zhang2025compq,
  author = {Zhang, Jiahao and Zhang, Frederic Z. and Rodriguez-Opazo, Cristian and Cherian, Anoop and Gould, Stephen},
  title = {Temporally Grounding Instructional Diagrams in Unconstrained Videos},
  eprint={2407.12066},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2407.12066},
  year = {2024},
}
Knowledge Composition using Task Vectors with Learned Anisotropic Scaling Frederic Z. Zhang*, Paul Albert*, Cristian Rodriguez-Opazo, Anton ven den Hengel and Ehsan Abbasnejad Neural Information Processing Systems (NeurIPS), 2024. [abstract] [preprint] [code] [bibtex]
Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This paper builds on these properties of task vectors and aims to answer (1) whether components of task vectors, particularly parameter blocks, exhibit similar characteristics, and (2) how such blocks can be used to enhance knowledge composition and transfer. To this end, we introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsic dimensionality of pre-trained models, with only a few coefficients being the learnable parameters. Furthermore, composition of parameter blocks leverages the already learned representations, thereby reducing the dependency on large amounts of data. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labeled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of aTLAS as a PEFT method, particularly with less data, and demonstrate that its scalibility.
@misc{zhang2024atlas,
  author = {Zhang, Frederic Z. and Albert, Paul and Rodriguez-Opazo, Cristian and van den Hengel, Anton and Abbasnejad, Ehsan},
  title = {Knowledge Composition using Task Vectors with Learned Anisotropic Scaling},
  eprint={2407.02880},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.02880},
  year = {2024},
}
Visual and Spatial Understanding of Human–Object Interactions Frederic Z. Zhang PhD Thesis, 2024. [abstract] [thesis] [bibtex]
In the context of computer vision, human-object interactions (HOIs) are often characterised as a (subject, predicate, object) triplet, with humans being the subject. As such, to understand HOIs is to localise pairs of interactive instances and recognise the predicates that signify their interactions.

This interpretation naturally leads to a graph structure, where humans and objects are represented as nodes while their interactions as edges. We investigate this idea by employing off-the-shelf object detectors to obtain a set of human and object detections, and building a bipartite graph with human nodes on one side of the bipartition and object nodes on the other. Unlike conventional methods, wherein nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition the message passing between pairs of nodes on their spatial relationships. With spatial conditioning, the proposed method is able to suppress numerous negative pairs due to incompatible spatial relationships, and particularly excels at inferring the correspondence between interactive humans and objects when there are many pairs in the same scene. In addition, we observe that the learned adjacency matrices spontaneously exhibit structures indicative of interactive pairs without explicit supervision.

Such emergent properties prompt us to investigate further into the formulation of graphs. Apart from the unary representations (human and object instances), we incorporate human–object pairs into the graph structure by encoding each pair as a node. Utilising the popular transformer architecture, we propose the unary--pairwise transformer, wherein self-attention blocks serve as fully-connected graphs. We observe that when separate self-attention blocks are employed for the unary and pairwise representations, they specialise in complimentary ways. Specifically, the unary layer preferentially increases the scores of positive human–object pairs, while the pairwise layer decreases the scores of negative pairs.

Despite the success in graphical modelling of HOIs, their complexity and ambiguity still poses a challenge. Through extensive visualisations, we observe that the commonly used object features are often extracted from object extremities, thus lacking the fine-grained contexts to disambiguate certain interactions. In particular, we identify two types of visual contexts lacking in current feature formulations and propose to enrich the representations with spatially-guided cross-attention, where a carefully designed box-pair positional embeddings serve as spatial biases. With rich visualisations, we demonstrate how the spatial guidance impacts the attention mechanism and supplements the model with key visual cues for the recognition of HOIs.
@phdthesis{zhang2024thesis,
  title = {Visual and Spatial Understanding of Human–Object Interactions},
  author = {Frederic Z. Zhang},
  year = {2024},
  month = {Mar},
  address = {Canberra, Australia},
  note = {Available at \url{https://openresearch-repository.anu.edu.au/items/2f2331b2-77d4-422a-8acd-093a8d894895}},
  school = {College of engineering, computing and cybernatics, The Australian National University},
  type = {PhD thesis}
}
Exploring Predicate Visual Context in Detecting Human–Object Interactions Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong and Stephen Gould International Conference on Computer Vision (ICCV), 2023. [abstract] [paper] [preprint] [code] [video] [bibtex]
Recently, the DETR framework has emerged as the dominant approach for human–object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
@inproceedings{zhang2023pvic,
  author = {Zhang, Frederic Z. and Yuan, Yuhui and Campbell, Dylan and Zhong, Zhuoyao and Gould, Stephen},
  title = {Exploring Predicate Visual Context in Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2023},
  pages = {10411-10421},
}
Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer Frederic Z. Zhang, Dylan Campbell and Stephen Gould Computer Vision and Pattern Recognition (CVPR), 2022. [abstract] [paper] [preprint] [code] [video] [bibtex]
Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human–object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary–Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of our transformer network specialize, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU.
@inproceedings{zhang2022upt,
  author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
  title = {Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2022},
  pages = {20104-20112}
}
Spatially Conditioned Graphs for Detecting Human–Object Interactions Frederic Z. Zhang, Dylan Campbell and Stephen Gould International Conference on Computer Vision (ICCV), 2021. [abstract] [paper] [preprint] [code] [video] [bibtex]
We address the problem of detecting human–object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition messages between pairs of nodes on their spatial relationships, resulting in different messages going to neighbours of the same node. To this end, we explore various ways of applying spatial conditioning under a multi-branch structure. Through extensive experimentation we demonstrate the advantages of spatial conditioning for the computation of the adjacency structure, messages and the refined graph features. In particular, we empirically show that as the quality of the bounding boxes increases, their coarse appearance features contribute relatively less to the disambiguation of interactions compared to the spatial information. Our method achieves an mAP of 31.33% on HICO-DET and 54.2% on V-COCO, significantly outperforming state of the art on fine-tuned detections.
@inproceedings{zhang2021scg,
  author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
  title = {Spatially Conditioned Graphs for Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2021},
  pages = {13319-13327}
}