Avatar

PhD Researcher

Canberra, Australia
Email

Frederic Zhang | 张真

I'm currently a PhD student at the Australian National University, working with Prof. Stephen Gould and Dr. Dylan Campbell. My primary research focus is on the understanding of human–object interactions, from a computer vision and machine learning perspective. This includes visual recognition, detection and potentially generation.

Prior to my PhD, as part of an international partnership program, I received a bachelor degree of science in automation from Beijing Institute of Technology and a bachelor degree of engineering in mechatronics (research and development) with first-class honours from the Australian National University, where I had the pleasure to work with Prof. Yuchao Dai and Prof. Richard Hartley.

I'm greatly passionate about programming, so much so that I wrote a deep learning library called Pocket. It is a lightweight library built on top of PyTorch, featuring different boilerplate learning engines and other utilities purposed for visualisation and evaluation. I'm also a photographer. As an enthusiast of the great outdoors, my subjects are mostly nature-oriented. Find out more in my gallery!


Connect with me

Research

Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer Frederic Z. Zhang, Dylan Campbell and Stephen Gould In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [abstract] [paper] [preprint] [code] [video] [bibtex]
Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human–object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary–Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of our transformer network specialize, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU.
@inproceedings{zhang2022upt,
  author = {Frederic Z. Zhang, Dylan Campbell and Stephen Gould},
  title = {Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2022},
  pages = {20104-20112}
}
Spatially Conditioned Graphs for Detecting Human–Object Interactions Frederic Z. Zhang, Dylan Campbell and Stephen Gould In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. [abstract] [paper] [preprint] [code] [video] [bibtex]
We address the problem of detecting human–object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition messages between pairs of nodes on their spatial relationships, resulting in different messages going to neighbours of the same node. To this end, we explore various ways of applying spatial conditioning under a multi-branch structure. Through extensive experimentation we demonstrate the advantages of spatial conditioning for the computation of the adjacency structure, messages and the refined graph features. In particular, we empirically show that as the quality of the bounding boxes increases, their coarse appearance features contribute relatively less to the disambiguation of interactions compared to the spatial information. Our method achieves an mAP of 31.33% on HICO-DET and 54.2% on V-COCO, significantly outperforming state of the art on fine-tuned detections.
@inproceedings{zhang2021scg,
  author = {Frederic Z. Zhang, Dylan Campbell and Stephen Gould},
  title = {Spatially Conditioned Graphs for Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2021},
  pages = {13319-13327}
}