08/12/2025
๐๐๐ฎ๐๐๐ญ๐จ๐ซ๐ฌ' ๐๐๐ฅ๐ข๐๐ง๐๐ ๐ฎ๐ฉ๐จ๐ง ๐๐๐ญ๐๐๐ญ๐จ๐ซ๐ฌ: ๐๐ซ๐ญ๐ข๐๐ข๐๐ข๐๐ฅ ๐๐ง๐๐จ๐ฆ๐ฉ๐๐ญ๐๐ง๐๐ (๐๐)
(๐๐ค๐ช-๐๐ฆ๐ค๐ฉ ๐๐ฅ๐ช๐ต๐ฐ๐ณ๐ช๐ข๐ญ)
"Harap-harapang ginagago," imagine a grueling homework, involving hundreds of words in a single paragraph, only to be accused of using Large Language Models (LLMs). Their teacher demands to rewrite the whole assignment, only to be met again with a hair-pulling, head-bashing, and eye-watering verdict of LLM detectors. Sentences reordered, words rewritten, and ideas redefined; however, it still insists of its verdict. For people who write in a formulaic and grammatically-compliant fashion, they may resemble overtones of automation, even when it is written by head and hand, bare and barred from verbosity.
Democratizing Aritifical Intelligence (AI) has its merits: to streamline tasks, overtake when the bulb dims, and countless more. Such innovation also carries its supposed limits in the educational discipline: assignment generation, plagiarism, and data hallucination, all of which are prevented by engaging in constructive use. Moreso, it is very understandable for some educators wanting to check for unethicalities because majority of the curricula require a learner's cognitive engagement. However, many teachers use very inaccurate tools. False accusations of academic dishonesty risks a learner's cognitive, mental, and emotional well-being. An institution's credibility and objectivity is lost when it relies on erroneous tools; a well-established academy that collapses like the Roman's corinthian beams.
Teachers should stop the use of LLM detectors due to its fallible and imprecise nature, and instead strive for a more reliable feedback. It is no wonder why ivy league institutions discourage the use of LLM detectors.
According to Quillbot, LLM detectors judge texts based on its "perplexity" and "burstiness." Perplexity answers the question, "How much is the text perplexing or surprising?" and burstiness refers to how easy it is to predict a sentence's structure and length. LLMs are trained based on the internet's countless information. If a text deviates from its training data, the rate of perplexity is low. Suppose the sentence "We plug our devices into" the typical completion would be "an outlet," "a powerbank" or "a charger;" however, a more perplexing response would be "our noses." To put things into perspective, searching such a sentence yields little-to-no results on the internet, therefore deviating from its training data. This divergence signifies a perplexing sentence, therefore a "human-written" verdict. The US constitution tagged as AI-generated is one of the many cases of these false-positives because of its ubiquitous presence in the online discipline of jurisprudence. In addition, humans structure sentences variably. A sentence may contain loads of adjectives and descriptiveness, lengthening it; while the preceding sentence may be simple and short, a varying burst of words. In typical AI manner, sentences may be consistent and uniform. These false-positives are notable in formal settings where texts are in a certain pattern and rigidity; serious ethical accusations are common.
Especially when LLM detectors bias against non-native speakers of English. In a study conducted by Stanford scholars (Liang et al.), seven LLM detectors unanimously identified 18 of 91 essays as AI-authored, while 89 of 91 as AI-generated. The study also found that a simple prompt can bypass LLM checks, thus being an absolute pinnacle of imprecision; false-negatives and false-positives. Being an academe means being the root of trust in the educational discipline for many people, but when impractical tools jeopardize the chain of credibility, it also brings down the quality of education with it.
Instead, homework should be checked in a synergistic fashion, involving subject and linguistics educators. Learners should be interrogated by their comprehension of the subject and the construction of their sentences, rather than relying on tools that judge based on writing styles, which varies wildly. Majority of LLM detectors report in percentages, depending on the detector used, some may report that a 20% detection rate is deemed human-written. Should teachers be adamant in their decision of using these tools, their specified threshold should be increased and even then, take the verdict as a grain of salt.
Texts that are human-written but resemble of automation may be flagged by detectors due to their inherently flawed mechanisms of comparing it against training data and writing patterns. The use of detectors should be discouraged in academia, due to natural variations of writing styles. In consequence of ignorance, a grave accusation may harm a learner's well-being in different aspects. No matter what the technological advances are, human educators should still shoulder the responsibility of thorough scrutiny; assuming their competence in academia. If Large Language Models (LLMs) are trained to write more human-like texts, then how do these detectors distinguish what is not?
โ
๐ผ๏ธโ๏ธ ๐ฟ๐๐ผ๐๐ ๐ฝ๐: Jomenabel Behagan | ๐๐๐๐ฉ๐ค๐ง๐๐๐ก ๐พ๐๐ง๐ฉ๐ค๐ค๐ฃ๐๐จ๐ฉ, ๐ฟ๐๐๐๐ฉ๐๐ก ๐ผ๐ง๐ฉ๐จ
๐ฃ๏ธ ๐๐๐๐ฟ๐ ๐ฝ๐: Ronald Campos | ๐๐๐ค๐ฉ๐ค๐๐ค๐ช๐ง๐ฃ๐๐ก๐๐จ๐ฉ, ๐๐๐๐ฉ๐ค๐ง๐๐๐ก ๐๐ง๐๐ฉ๐๐ง, & ๐๐๐ฃ๐๐๐๐ฃ๐ ๐๐๐๐ฉ๐ค๐ง