大数跨境
0
0

The Critique of Transformer(1):Some Key Facts

The Critique of Transformer(1):Some Key Facts 智识神工
2025-12-03
2
导读:Transformer is essentially a "prisoner of experience," whose capabilities are confined to "past" and

Some Key Facts

The Critique of Transformer

(1)


Abstract: This paper explores several key facts about Transformer from the intersection of philosophy and technology, systematically criticizing the fundamental limitations of large language model (LLM) based on the Transformer architecture. It posits that the Transformer is essentially a "prisoner of experience," whose capabilities are strictly confined to the "past" and the "known" as defined by its training data.  The critique is presented through three core dimensions: Firstly, at the epistemological level, its learning paradigm based on maximum likelihood estimation is an extreme form of empiricism, unable to reach a priori reason and logical necessity, entangled in the "problem of induction"; Secondly, at the ontological level, its word embeddings and attention mechanisms operate within a closed symbolic system, lacking intentionality toward the real world, and its tokenization process results in a fragmented understanding of concepts; Finally, at the level of philosophy of mind, its nature as a deterministic function approximator makes it a super version of the “Chinese room” thought experiment, lacking belief, intention, and true understanding. The conclusion of this paper is that while Transformer is an excellent engineering technology, its architecture itself cannot lead to general artificial intelligence (AGI), and future breakthroughs will require a new paradigm that transcends pure empiricism.


Keywords: Transformer, Large Language Models, Empiricism, Induction, Intentionality




Introduction



The rapid development of AI in recent years can be attributed to the introduction of the Transformer architecture (Figure 1). It has become the core architecture of contemporary large language model (LLM) in natural language processing, and has achieved disruptive successes in fields like image, video, and audio processing (Figure 2). The vast capabilities of Transformer often prompt widespread discussions about whether artificial intelligence has reached general intelligence (AGI). However, when we look past its impressive exterior and delve into its design philosophy and operational mechanisms, we find that Transformer models are fundamentally limited. They are not a step towards AGI but rather a sophisticated and large-scale “empirical prisoner” that strictly confined to the "past" and the "known"” This paper aims to integrate philosophical perspectives with technical theory to examine several key facts about Transformer and provide a systematic critique of the fundamental limitations of Transformer models, clarifying their capabilities and offering critical reflections for future research directions.


Figure 1 – Transformer Model Architecture[1], consisting primarily of Encoder and Decoder components.

Swipe left or right to see more

Figure 2 – Transformer is currently the mainstream architecture in various fields of large models, such as (1) GPT models in natural language processing, (2) Whisper model in speech recognition, (3) ViT model in image understanding, (4) ViViT model in video understanding, and (5) BLIP-2 model in multimodal domains[7, 8, 9, 10, 11].





1、The Epistemological Prisoner: 

An Extreme Empiricist Technological Realization




    The operation paradigm of Transformer represents the ultimate technical embodiment of extreme empiricism in the history of philosophy. Its knowledge is entirely derived from training data, unable to access a transcendental reason and logical necessity. This philosophical flaw is deeply embedded in its core technological principles.


Key Fact 1: The Essence of Parameter Learning – Statistical Rather Than Understanding

    The training objective of Transformer models (such as GPT series) is based on maximum likelihood estimation from large-scale corpora. The objective function predicts the conditional probability of the next token, adjusting the model parameters to maximize the probability of observed training data. This technically embodies David Hume's philosophical view: the “causal” or “associative” knowledge learned by the model is merely a “constant conjunction” of token sequences from the training data. The model might learn the high-probability co-occurrence between “cat” and “playing with a ball,” but this is a statistical regularity based on frequency, not a conceptual understanding of “cat” as a living being[2].


Key Fact 2: Limitations of Self-Attention Mechanism – Association Rather Than Reasoning

    The core of the self-attention mechanism is the calculation of dot-product similarity between Query, Key, and Value vectors, which are then weighted and summed to obtain a contextual representation. This process is adept at capturing surface and deep co-occurrence statistical relationships but fundamentally lacks the reasoning capability of symbolic logic, as it does not have a built-in logical rule engine. Its “reasoning” manifests as a chain of conditional probabilities learned from vast amounts of similar text patterns. Once it encounters rare logical structures or counterfactual conditions in the training data, its statistically-based “reasoning” chain easily breaks down, leading to “hallucinations.”


Key Fact 3: The Truth of Generalization Ability – Interpolation Rather Than Extrapolation

    The model's “generalization” ability is mathematically closer to complex interpolation in high-dimensional spaces than true extrapolation. The training data defines a high-dimensional manifold of the model’s capabilities, with all outputs residing within or near this manifold. It cannot reliably handle truly novel “black swan” events far from this manifold. This technically confirms the problem of induction: no matter how large the training data, it cannot cover all possibilities, and its knowledge base is inherently contingent and fragile[3].


Conclusion from Facts 1-3: The training data and intermediate results of Transformer are purely empirical products. Its parameter updates, attention maps, and generalization behavior are strictly constrained by the statistical structure of the training data. No matter how novel Transformer’s outputs may seem, they are just complex interpolations and reorganizations of existing information elements within the training data manifold, reproductions of “being,” rather than true explorations of the “unknown” based on a priori reason.





2、Ontology and the Lost Symbol: 

Word Embeddings and Attention Mechanisms without “Worldness”




    The core dilemma of Transformer lies in its total separation from the “world.” It processes symbols (tokens) as shadows rather than the “things themselves” they represent. This ontological deficiency prevents it from forming true conceptual understanding.


Key Fact 4: The “Semantic Shadow” of Word Embeddings

    The model converts words into word embedding vectors, which encode rich semantic and syntactic relationships in high-dimensional space. However, these vectors are merely “shadows” of symbols, quantifying statistical distributions. As phenomenologists such as Husserl and Sellars have emphasized, when the model operates on these vectors, it lacks the intentionality that points to the external world[4, 5]. When the model processes the embedding of “cat,” this vector has no connection to the living being with body heat and the ability to meow in the real world. Its “understanding” is limited to the cosine similarity between the “cat” vector and those of “mammal,” “pet,” etc.


Key Fact 5: Tokenization’s “Ontological Violence”

    A word tokenizer might split “ribosome” into ["ri", "bo", "some"] or break technical terms into meaningless fragments. This segmentation is based entirely on data compression efficiency, not semantic integrity. Technically, it destroys the wholeness of concepts as understood by human cognition, causing a severe misalignment between the model’s internal representation and human life-world concepts. The model must struggle to relearn holistic representations from these fragmented tokens, but what it learns is still statistical association, not the holistic understanding of “being” revealed by Heidegger[6].


Key Fact 6: The “Closed Feedback Loop” of Self-Attention

    Transformer’s “thinking” relies entirely on the Q, K, V matrices produced by linear transformations of the current input sequence. This forms a strict, immediate internal loop. The model does not have a long-term, updatable “world model” as background knowledge. Its context window is its entire “universe,” and any information outside this window (including its own previously generated content once it exceeds the window) is forgotten. This mechanism means its understanding is inherently fragmented and context-bound, unable to form a stable, coherent internal model of the world.


    Conclusion from Facts 4-6: Transformer only learns formalized symbolic associations, not the concepts themselves. Its information processing is confined to the structural relationships within token and vector spaces (internal symbol relationships), and cannot touch the directional and substantive meaning of these symbols in the world (conceptual meaning of symbols).





3、The Paradox of Philosophy of Mind: 

A “Heartless” Device as a Deterministic Function Approximator




    We must face the reality that Transformer, as an “information processing device,” operates through linear transformations + attention-weighted summations. It can be seen as a machine that compresses and reconstructs information empirically, with its behavior determined by fixed programs and training data.


Key Fact 7: Absolute Determinism in Forward Propagation

    Technically, a trained Transformer model is a complex but fully deterministic function. Given a set of input tokens and a fixed random seed (used to control sampling randomness during inference), the model will produce a unique output probability distribution through layers of forward propagation (matrix multiplication, activation functions, Softmax). The apparent “creativity” or “randomness” arises from sampling on the output probability distribution (e.g., nucleus sampling, temperature sampling), but this randomness is externally imposed and pseudorandom, not a result of the model’s “free will.” This technically proves that its behavior is pre-determined.


Key Fact 8: Parameterized Symbolic Syntax Operations

    John Searle’s “Chinese Room” thought experiment is fully embodied in Transformer[5]. The model’s billions of parameters are akin to an enormous “rulebook.” When the model processes the problem “describe a cat,” it does not call upon experience or understanding of cats, but activates the parameter paths related to the input sequence, calculating the most likely token sequence in the given context. The entire process is an unconscious, uncomprehending symbolic syntactic operation. It does not “believe” that cats are cute; it merely calculates that the token “cute” has a high conditional probability in the context of “cat.”


Key Fact 9: The Deterministic Nature Amidst Noise

    Dropout during training and random sampling during inference can philosophically be seen as “noise.” However, this noise is not an expression of subjectivity; rather, it is an engineering technique deliberately introduced to improve the model’s robustness and output diversity. The core—the model function itself—is deterministic. This is fundamentally different from the truly non-deterministic “noise” driven by emotions, intentions, and the unconscious in human intelligence.


Conclusion from Facts 7-9: Transformer is a strictly programmed, deterministic symbolic information processing device based on empirical patterns. Its “intelligence” is not a result of intentionality or understanding, but rather a high-dimensional function mapping driven by empirical data.





Conclusion and Future Outlook




    In conclusion, we must recognize the fundamental limitations of Transformer as an “empirical prisoner.” Its training input, intermediate results, and the processing device itself are empirical, and it cannot understand linguistic concepts. Future paradigms may need to combine the statistical power of Transformer with logical reasoning from symbolic AI, embodied cognition with environmental interaction, and goal-driven exploration mechanisms. Otherwise, merely scaling up Transformer in terms of size and data will only create more powerful “knowledgeable parrots” and more sophisticated “empirical prisoners,” not true “thinkers.”


END

参考文献

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.

[2] Hume, D. (1739). A Treatise of Human Nature.

[3] The Problem of Induction was formulated by Hume, noting that no particular experience can logically guarantee universal laws.

[4] Husserl, E. (1900). Logical Investigations.

[5] Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences.

[6] Heidegger, M. (1927). Being and Time.

[7] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[J]. 2018.

[8] Radford A, Kim J W, Xu T, et al. Robust speech recognition via large-scale weak supervision[C]//International conference on machine learning. PMLR, 2023: 28492-28518.

[9] Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.

[10] Arnab A, Dehghani M, Heigold G, et al. Vivit: A video vision transformer[C] //Proceedings of the IEEE/CVF international conference on computer vision. 2021: 6836-6846.

[11] Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International conference on machine learning. PMLR, 2023: 19730-19742.





【声明】内容源于网络
0
0
智识神工
智识神工打造企业级软件开发AI智能体平台。拥有首创的“芯片+模型+应用”的AI生产力系统SWOR,自研KPU芯片、专有模型及原生AI IDE。可孤岛式部署,确保数据安全,采用CHR人机协同范式,推动AI技术高效落地。
内容 79
粉丝 0
智识神工 智识神工打造企业级软件开发AI智能体平台。拥有首创的“芯片+模型+应用”的AI生产力系统SWOR,自研KPU芯片、专有模型及原生AI IDE。可孤岛式部署,确保数据安全,采用CHR人机协同范式,推动AI技术高效落地。
总阅读377
粉丝0
内容79