![]() ![]() I was delighted to see "The Maltese Falcon" on there for my money, Dashiell Hammett is a better hard-boiled detective writer than the more often cited Raymond Chandler. In addition to the modern public-school canon - Charles Dickens and Jack London, Frankenstein and Dracula - there are a few fun outliers. Then they would feed the bot a line from the passage in question:Īfter crunching the numbers, Bamman's team had a list. You must make a guess, even if you are uncertain. What is the proper name that fills in the token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). You have seen the following passage in your training data. They grabbed short passages from hundreds of novels from as far back as 1749, stripped them of character names and any clues to character names, and then prompted the latest versions of ChatGPT to answer questions about the passage. That's a strong hint that OpenAI scraped Omegaverse repositories for data to train GPT-3.īamman and his team used a different tactic: a fill-in-the-blank game called a name cloze. When prompted, for example, a GPT-3 writing aid called Sudowrite recognizes the specific sexual practices of a genre of fan-fiction writing called the Omegaverse. One way to answer the question is to look for information that could have come from only one place. The issue, as several lawsuits argue, revolves around whether the bots make fair use of the material by transforming into something new, or whether they just memorize it whole and regurgitate it, without citation or permission. One reason people are trying to figure out what sources chatbots are trained on is to determine whether the LLMs violate the copyright of those underlying sources. Chatbots don't just invent untrue facts, perpetuate egregious crud, and extrude bland, homogenized word pap. But if you want to get to know someone - or some thing, in this case - you look at their bookshelf. They don't understand the world in any way a human can. The question of what's on GPT-4's reading list is more than academic. Dick, Margaret Atwood, "A Game of Thrones," even "The Hitchhiker's Guide to the Galaxy." Tolkien, Ray Bradbury, William Gibson, Orson Scott Card, Philip K. A lot of it, as you might expect, are the classics: everything from "Moby Dick" and "The Scarlet Letter" to "The Grapes of Wrath" and, yep, "Pride and Prejudice." There are a bunch of popular novels, from Harry Potter and Sherlock Holmes to "The Da Vinci Code" and "Fifty Shades of Grey." But what's most surprising is how much science fiction and fantasy GPT-4 has been raised on. In a recent preprint, meaning it hasn't been peer reviewed yet - the team presented its findings - what amounts to an approximation of the chatbot canon. The higher the score, the likelier it was that the book was part of the bot's dataset - not just crunched to help the bot generate new language, but actually memorized. So Bamman's team decided to become "data archaeologists." To figure out what GPT-4 has read, they quizzed it on its knowledge of various books, as if it were a high-school English student. The inner workings of the large language models at the heart of a chatbot are a black box the datasets they're trained on are so critical to their functioning that their creators consider the information a proprietary secret. The problem is, there was no way of knowing how GPT-4 knew what it knew. "Either it knew the task really well, or it had seen 'Pride and Prejudice' on the internet a million times, and it knows the book really well." "It was so good that it raised red flags in my mind," Bamman says. In fact, it was almost as if it had studied the novel in advance. The chatbot's GPT-4 version was amazingly accurate about the Bennet family tree. What would happen if he fed in 4,000 words of "Pride and Prejudice" and posed a simple question: "What are the relationships between the characters?" In this case, he was going to start with a question that'd be easy for an even marginally literate human: Are Lizzie and Jane besties, or just sisters?įor kicks, Bamman decided to first try asking ChatGPT. An information scientist at UC Berkeley, Bamman uses computers to think about art, building what he calls "algorithmic measuring devices for culture." That means extracting data from classic literature about things like, say, the relationships among various characters. It often indicates a user profile.ĭavid Bamman was trying to analyze "Pride and Prejudice" - digitally. Account icon An icon in the shape of a person's head and shoulders. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |