大数跨境
0
0

The Critique of Transformer(2):Can It Output Knowledge?

The Critique of Transformer(2):Can It Output Knowledge? 智识神工
2025-12-10
4
导读:Transformers output information, not knowledge.

Can It Output Knowledge?

The Critique of Transformer

(2)


Abstract: The ‘Generalized BP + Transformer’ architecture forms the foundation of this generation of AI infrastructure. While Generalized BP has been enshrined by the Nobel Prize and is widely recognized as a ‘veteran technology’ with decades of history, those with deep understanding know that the Transformer is the core magic prop—the true ‘new engine’ driving Large Language Models(LLMs). Critique here is not criticism, as critique is a judgment born from deep insight, whereas criticism tends to be negative. The importance of Transformer cannot be overstated. We must approach it with critical, rather than casual, attitudes. We oppose Turing's definition of ‘intelligence as what humans can distinguish,’ as who can represent humanity? Is it one person, a thousand, or ten thousand? Can we truly consider all of humanity as an object of study? Such definitions might have practical meaning when considered by engineers working on PCBs or code, but as a definition for AI, it is childish. Knowledge, as the result of innovation, is the basic criterion by which we assess AI. We must carefully examine the relationship between the output of Transformer and knowledge in order to better understand its connection to intelligence. This paper first reveals that the attention mechanism is essentially a process of structured noise introduction (Noise Leading In, NLI), with its weight distribution inherently unstable and biased. Secondly, it points out that the operating mechanism of Transformer relies on incomplete induction, with conclusions based on statistical data rather than logical necessity—issues that have been extensively critiqued by Hume and Popper in philosophy. Finally, Transformer’s output is a highly complex, data-driven information structure that lacks the essential characteristics of knowledge, such as ‘true belief with justification.’ Therefore, the output of Transformer is not knowledge; we should regard it as a valuable informational tool, rather than the authoritative source of knowledge, and always recognize that the final judgment of meaning and responsibility rests with human cognitive agents.


Keywords:Transformer;Epistemology;Attention Mechanism; Induction; Large Language Models




Introduction



    In recent years, large language models based on the Transformer[1] architecture have demonstrated remarkable text generation capabilities, producing outputs that are often fluent, coherent, and seemingly insightful. This leap in ability prompts us to reconsider a fundamental question: can the output of Transformer models be regarded as ‘knowledge’? To answer this, we must not focus merely on the model's external performance, but delve into its internal mechanisms and philosophical foundations. This paper will argue, based on the technical nature of the attention mechanism and the inductive properties of the Transformer, combined with the strict philosophical distinction between knowledge and information, that the output of Transformer is, in essence, information rather than knowledge.






1、The Nature of the Attention Mechanism: Structured Noise Leading In




    Traditionally, the attention mechanism is often romanticized as the process by which a model ‘focuses’ on key information. However, from a technical perspective, attention is not a precise spotlight but rather a ‘Structured Noise Leading In’ (NLI) device that continually generates random perturbations.


    First, attention introduces a biased, artificial noise. Its core operation—calculating the dot product between the Query and Key followed by a Softmax function—is highly sensitive to small input variations. The Softmax function’s exponential sensitivity to logits causes even minor differences in logits to be amplified, leading to highly concentrated attention distributions (entropy collapse), making training unstable[2]. This implies that the attention mechanism produces a weighted mixture that is highly influenced by contextual disturbances, a ‘noisy’ outcome. More importantly, this noise is not pure white noise; its ‘artificial’ nature is reflected in the model parameters and data distributions. The training process aims to stabilize this noise statistically, but due to limited data and model architecture, it can never achieve an ideal unbiased state. Therefore, in attempting to eliminate incidental ‘individual’ differences in the data, attention inevitably introduces the potential biases learned from the training data, which may lead to one-sided associations.


    Secondly, NLI is an essential characteristic that cannot be eliminated in any practical information system. Whenever a system needs to filter, compress, or combine information, it inevitably faces the issue of how to allocate weights. Any weight allocation strategy based on limited resources and data is bound to be neither absolutely objective nor unbiased. The attention mechanism in Transformer simply makes this unavoidable NLI process explicit, differentiable, and central to the model's computation.


    Finally, NLI is inherently non-directive and lacks ‘meaning.’ The direction of attention’s perturbation is not guided by any internal ‘purpose’ or ‘understanding’ but is entirely determined by the current vector representation and the parameterized dot product operation[3]. It operates as a random jittering mechanism in high-dimensional space, carrying no inherent meaning. The ‘meaning’ of this process is only assigned in hindsight by the model's training objective (e.g., predicting the next word). Therefore, the role of the attention mechanism is not to ‘select meaning,’ but to ‘introduce random variation’ to drive the model’s exploration of the function space. The mechanism itself, however, is purposeless.




2、The Inductive Nature of Transformer and Its Philosophical Limitations




    The powerful capabilities of the Transformer are grounded in vast amounts of data, and its operational mechanism can be philosophically classified as an inductive process.


    Under conditions of massive data input, the noise introduced ultimately serves to eliminate individual differences and extract commonalities. Through its self-attention mechanism, Transformer processes billions of samples, with NLI acting as a massive ‘averager.’ Random features, errors, or noise in individual samples are suppressed by the weighted averaging in a large dataset, while frequent, stable statistical patterns (commonalities) are reinforced and extracted. This is the essence of induction: deriving general laws (model parameters) from limited, specific observations (training data). However, since no dataset can be infinite, there will always be unseen ‘black swan’ situations, which makes such inductive conclusions logically incomplete.



    Philosophers have profoundly critiqued this approach. David Hume’s critique of induction points out that inductive reasoning is based on ‘psychological habits,’ not ‘logical necessity'[4]. We assume the sun will rise tomorrow because it has always risen, but there is no logical connection between the two events. Similarly, Transformer learns the correlation between ‘cats’ and ‘mammals’ merely because they frequently co-occur in the training data, without understanding the causal or logical connection. It merely acquires a strong, data-driven ‘psychological habit.’ When the data distribution changes (e.g., the training set is polluted with internet slang), this habit is broken, leading to absurd outputs, revealing the fragility of its conclusions.


    Moreover, Karl Popper’s critique of induction undermines the status of induction as a source of knowledge. Popper asserts that induction cannot provide necessary knowledge (because of the possibility of falsification) or probabilistic knowledge (because probability estimates themselves require prior assumptions, leading to circular reasoning)[5]. Transformer’s output perfectly exemplifies this: the probability distribution it generates (e.g., for the next word) only reflects the statistical features of the training data, not the actual probabilities of the real world. More importantly, as a purely inductive system, Transformer cannot be falsified. It generates the most ‘probable’ continuation based on past data but cannot propose a testable hypothesis like scientific theories can. When it erroneously generalizes that ‘all birds can fly’ to ostriches, it cannot recognize the error and correct its internal ‘theory’ but must wait for more data including ostriches to overwrite the original pattern.






3、Does Transformer Output Knowledge?—The Distinction Between Knowledge and Information




    From the above analysis, we arrive at a deeper philosophical conclusion: the output of Transformer is a highly complex information structure, not knowledge in the philosophical sense. The core of this distinction lies in the fact that knowledge is not statically present in symbol sequences, but requires ‘justified true belief’ (JTB) and alignment with the external world[6]. Equating Transformer’s output to ‘knowledge’ commits the same philosophical error as ‘thinking books contain knowledge.’


    From Kant’s epistemology, knowledge is a ‘construction’ of the subject's mind, not a passive ‘reception'.[7] 

    Kant profoundly pointed out that knowledge is not passively imprinted onto the mind from the external world but is actively ‘synthesized’ by the subject using a priori categories (such as causality, substance) to organize the sensory manifold (raw data from the senses). A book or the output of Transformer merely provides ‘sensory manifold’—a combination of symbols. These symbols, in themselves, are meaningless and silent. Transformer generates a complex arrangement of symbols through its massive parameters and data, but this is still just ‘material.’ It lacks the subject’s ability to ‘synthesize’ these symbols into coherent knowledge. Therefore, Transformer’s output is like a book that has not been read: it is just raw ‘informational material’ awaiting processing, not a knowledge structure formed by the mind's synthesis.


    Furthermore, Gadamer's hermeneutics further clarifies that meaning and knowledge emerge during the ‘understanding event’ rather than being pre-existing in texts.[8] 

    Gadamer suggests that texts (including books and Transformer’s outputs) do not ‘contain’ fixed meanings or knowledge. They are merely ‘triggers,’ and the true meaning is dynamically generated in the ‘understanding event’ by the reader. The process involves merging the reader’s ‘pre-understanding’ with the text, a result of ‘fusion of horizons.’ Transformer’s generation process, based on statistical regularities, is merely a symbolic concatenation. It has no ‘pre-understanding’ and cannot engage in true ‘fusion of horizons.’ Its fluent text simulates human knowledge expression, but does not carry any actual understanding. Therefore, whether Transformer’s output contains ‘knowledge’ entirely depends on whether a human reader can trigger a genuine ‘understanding event.’


    Finally, externalist semantics (e.g., Putnam, Berg) argues that the justification of knowledge depends on the external world and the community, not solely on the internal symbol system.[9] 

    The classic definition of knowledge, ‘justified true belief,’ requires justification, i.e., a reason why a proposition is true. Externalist semantics emphasizes that the meaning and truth conditions of symbols (e.g., ‘water’) depend on the external world (pointing to real H₂O) and the linguistic community (shared usage rules). Transformer is confined within the textual world constructed by its training data. Its symbolic associations come entirely from statistical co-occurrences within the data and cannot anchor to the external world. It outputs ‘The Earth is round’ not because it has confirmed this belief through observation or logical reasoning, but because ‘Earth’ and ‘round’ co-occur frequently in the training data. It cannot provide genuine, world-based ‘reasons’ for its output. Thus, the output of Transformer is, at most, a claim about knowledge or an expression of information, with its truth and validity ultimately determined by human cognitive agents embedded in the world and engaged in social practices.





Conclusion



    

    Ultimately, the power of Transformer lies in its ability to cleverly combine large-scale inductive reasoning with structured noise leading in (NLI), which allows it to generate impressive data-fitting and pattern-producing effects. However, from the perspective of philosophical epistemology, its attention mechanism’s purposeless perturbation and its unavoidable limitations in inductive reasoning dictate that its outputs are, in essence, information, not knowledge. We must clearly recognize that the output of Transformer is like a seemingly profound ‘book of heaven’ that has not been written by anyone; the generation of meaning and the determination of knowledge lie ultimately with the human reader, who retains the power and responsibility of judgment.


END

参考文献

[1] Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30. 

[2] Zhai S, Likhomanenko T, Littwin E, et al. Stabilizing transformer training by preventing attention entropy collapse[C]//International Conference on Machine Learning. PMLR, 2023: 40770-40803.

[3] Teo R S Y, Nguyen T. Unveiling the hidden structure of self-attention via kernel principal component analysis[J]. Advances in Neural Information Processing Systems, 2024, 37: 101393-101427.

[4] Hume, D. (1748). An Enquiry Concerning Human Understanding. Oxford University Press. 

[5] Popper, K. R. (1959). The Logic of Scientific Discovery. Hutchinson & Co. 

[6] Gettier, E. L. (1963). Is Justified True Belief Knowledge? Analysis, 23(6), 121-123. 

[7] Kant, I. (1781/1787). Critique of Pure Reason(N. Kemp Smith, Trans.). Palgrave Macmillan.

[8] Gadamer, H.-G. (1960). Truth and Method(J. Weinsheimer & D. G. Marshall, Trans., 2nd rev. ed.). Continuum.

[9] Putnam, H. (1975). The meaning of ‘meaning’. In Mind, Language and Reality: Philosophical Papers, Vol. 2. Cambridge University Press.





【声明】内容源于网络
0
0
智识神工
智识神工打造企业级软件开发AI智能体平台。拥有首创的“芯片+模型+应用”的AI生产力系统SWOR,自研KPU芯片、专有模型及原生AI IDE。可孤岛式部署,确保数据安全,采用CHR人机协同范式,推动AI技术高效落地。
内容 79
粉丝 0
智识神工 智识神工打造企业级软件开发AI智能体平台。拥有首创的“芯片+模型+应用”的AI生产力系统SWOR,自研KPU芯片、专有模型及原生AI IDE。可孤岛式部署,确保数据安全,采用CHR人机协同范式,推动AI技术高效落地。
总阅读377
粉丝0
内容79