When AI “thinks” like us
Even though so-called Vision Foundation models, computer models for automated image recognition, have made enormous progress in recent years, they still differ significantly from human visual understanding. For example, they generally do not capture multi-level semantic hierarchies and have difficulty with relationships between semantically related but visually dissimilar objects. In a joint project with Google DeepMind, scientists from TU Berlin, Max-Planck-Institute for Human Cognitive and Brain Sciences, and MPI for Human Development have generated a new approach called ‘AligNet,’ which integrates human semantic structures into neural image processing models for the first time, thereby bringing the visual understanding of computer models closer to that of humans. The results have now been published in the renowned journal Nature under the title ‘Aligning Machine and Human Visual Representations across Abstraction Levels’.

The eyes of AI have become increasingly sophisticated in recent years, but they still cannot match the visual understanding of humans.
Bernstein member involved: Klaus-Robert Müller
The scientists are investigating how visual representations in modern deep neural networks are structured compared to human perception and conceptual knowledge, and how these can be better aligned. Although artificial intelligence (AI) today achieves impressive performance in image processing, machines often generalize less robustly than humans, for instance, when faced with new types of images or unfamiliar relations.
“The central question of our study is: what do modern machine learning systems lack to show human-like behavior, not only in terms of performance, but also in how they organize and form representations?” explains lead author Lukas Muttenthaler, scientist at the MPI CBS, the BIFOLD institute of TU Berlin, and former employee at Google DeepMind.
The researchers show that human knowledge is typically organized hierarchically, from fine-grained distinctions (e.g., “pet dog”) to coarse ones (e.g., “animal”). Machine learning systems, on the other hand, often fail to capture these different levels of abstraction and semantics. To align the models with human conceptual knowledge, the scientists first trained a teacher model to imitate human similarity judgments. This teacher model thus learned a representational structure that can be considered “human-like.” The learned structure was then used to improve already pretrained, high-performing Vision Foundation Models, the so-called student models, through a process called soft alignment. This fine-tuning requires several orders of magnitude less computational cost than retraining the models from scratch.
A crucial step toward interpretable, cognitively grounded AI
The student models were fine-tuned using „AligNet“, a large image dataset synthetically generated through the teacher model that incorporates similarity judgments corresponding to human perceptions. To evaluate the fine-tuned student models, the researchers used a specially collected dataset known as the „Levels“ dataset.
“For this dataset, around 500 participants performed an image-similarity task that covered multiple levels of semantic abstraction, from very coarse categorizations to fine-grained distinctions and category boundaries. For each judgment, we recorded both full response distributions and reaction times to capture potential links with human decision uncertainty. The resulting dataset represents a newly established benchmark for human-machine alignment, which we open-scourced,” reports Frieda Born, PhD student at BIFOLD and the MPI for Human Development.
The models trained with „AligNet“ show significant improvements in alignment with human judgments, including up to a 93.5% relative improvement in coarse semantic evaluations. In some cases, they even surpass the reliability of human ratings. Moreover, these models exhibit no loss in performance; on the contrary, they demonstrate consistent performance increases (25% to 150% relative improvement) across various complex real-world machine learning tasks, all at minimal computational cost.
Klaus-Robert Müller, Co-director at BIFOLD: “Our research methodologically bridges cognitive science (human levels of abstraction) and modern deep-learning practice (Vision Foundation Models), thus forming a link between the concept of representation in humans and in machines. This represents an important step toward more interpretable, cognitively grounded AI.”
AligNet demonstrates that hierarchical conceptual structures can be transferred to neural networks without explicit hierarchical training, with reorganisation visible across network layers. These results suggest that AligNet achieves fundamental improvements in visual representations that better reflect the human level of conceptual understanding, thereby making AI less of a ‘black box’.
Andrew K. Lampinen from Google DeepMind adds: “For the first time, researchers have found an efficient way to teach computer vision models about the hierarchical structure of human conceptual knowledge. We show that this not only makes the representations of these models more human-like, and therefore more interpretable, but also improves their predictive power and robustness across a wide range of vision tasks.”





