GPT-4o’s Chinese token-training data is polluted by spam and porn websites ( www.technologyreview.com )

Of the 100 results, only three of them are common enough to be used in everyday conversations; everything else consisted of words and expressions used specifically in the contexts of either gambling or pornography. The longest token, lasting 10.5 Chinese characters, literally means “_free Japanese porn video to watch.” Oops....

  • All
  • Subscribed
  • Moderated
  • Favorites
  • [email protected]
  • kbinchat
  • All magazines