Quickly after OpenAI launched GPT-4o final Monday, some Chinese language audio system began to note that one thing appeared off about this latest model of the chatbot: the tokens it makes use of to parse textual content have been filled with spam and porn phrases.
People learn in phrases, however LLMs learn in tokens, that are distinct items in a sentence which have constant and important meanings. GPT-4o is meant to be higher than its predecessors at dealing with multi-language duties, and lots of the advances have been achieved via a brand new tokenization instrument that does a greater job compressing texts in non-English languages.
However, not less than relating to the Chinese language language, the brand new tokenizer utilized by GPT-4o has launched a disproportionate variety of meaningless phrases—and specialists say that’s doubtless resulting from inadequate knowledge cleansing and filtering earlier than the tokenizer was educated. If left unresolved, it may result in hallucinations, poor efficiency, and misuse. Learn the total story.
—Zeyi Yang
Astronomers are enlisting AI to organize for an information downpour
In deserts throughout Australia and South Africa, astronomers are planting forests of metallic detectors that may collectively scour the cosmos for radio indicators. When it boots up in 5 years or so, the Sq. Kilometer Array Observatory will search for new details about the universe’s first stars and the completely different levels of galactic evolution.
However after synching lots of of hundreds of dishes and antennas, astronomers will rapidly face a brand new problem: combing via some 300 petabytes of cosmological knowledge a yr—sufficient to fill 1,000,000 laptops. So in preparation for the data deluge, astronomers are turning to AI for help. Learn the total story.