Is AI really intelligent?

I think a thread on this topic will be interesting. My own position is that AI is intelligent, and that’s for a very simple reason: it can do things that require intelligence. That sounds circular, and in one sense it is. In another sense it isn’t. It’s a way of saying that we don’t have to examine the internal workings of a system to decide that it’s intelligent. Behavior alone is sufficient to make that determination. Intelligence is as intelligence does.

You might ask how I can judge intelligence in a system if I haven’t defined what intelligence actually is. My answer is that we already judge intelligence in humans and animals without a precise definition, so why should it be any different for machines? There are lots of concepts for which we don’t have precise definitions, yet we’re able to discuss them coherently. They’re the “I know it when I see it” concepts. I regard intelligence as one of those. The boundaries might be fuzzy, but we’re able to confidently say that some activities require intelligence (inventing the calculus) and others don’t (breathing).

I know that some readers will disagree with my functionalist view of intelligence, and that’s good. It should make for an interesting discussion.

572 thoughts on “Is AI really intelligent?

  1. petrushka: No government bothers with invisible people.

    Not so long ago your government was planning to perform mandatory social media checks on foreign travellers.

    But I wasn’t thinking of that. Your personal information can be (ab)used by the corporations themselves or they can be hacked and your information may fall in the hands of malicious parties.

  2. petrushka, quoting someone else:

    The BIGGEST lie in AI LLMs right now is “It learns.”

    LLMs do learn. Training is learning. And they can learn even after training via their context windows and long-term memory.

    We are confusing a Context Window with a Brain. They are not the same thing.

    Speak for yourself, random person. Informed people know that context windows aren’t brains.

    The cold reality is that AGI is much further away than the hype suggests.

    Could be. Predictions regarding rapidly-evolving technologies are often wrong.

    Your brain physically changes when you learn. Synapses fire, pathways strengthen. You evolve.
    An LLM is a read-only file.

    An LLM’s synaptic weights change during training. That’s learning. LLMs also learn after training by accumulating information in their context windows, though that is limited by the size of the context window and doesn’t carry over from chat to chat and user to user. Third, LLMs have long-term memory.

    Once training finishes, that model is stone cold frozen. It never learns another thing. When you correct it, it doesn’t “get smarter.” It just pretends to agree with you for the duration of that specific chat session.

    If it makes a mistake and you correct it, it remembers the correction for the duration of the session, subject to context window size limitations. It learns.

    Close the tab, and the lesson is gone forever.

    You can reload the chat and much (if not all) of what it learned is restored, subject to context window size limitations. LLMs also have long-term memory. Claude learned that I am a retired computer engineer, for instance, and his memory of that persists from session to session.

    “But it remembers what I said earlier!”
    No, it doesn’t.

    Yes, it does, subject to context window size limitations. It can repeat to you what you said earlier. That’s memory.

    Engineers are just re-feeding your previous sentences back into the prompt, over and over again, at massive compute cost.

    The system, in effect, refeeds the entire session (not just your prompts) back into the neural net in order to predict the next token.

    That isn’t memory. That is a scrolling teleprompter.

    It’s memory, unless your definition of ‘memory’ is as tendentious as Erik’s undisclosed definition of ‘intelligence’.

    We are simulating continuity by burning GPU credits, not by building a persistent mind.

    The continuity is real. If you ask an LLM about something from earlier in the session, it can answer you correctly. How is that not continuity?

    Because models can’t learn, we built an entire infrastructure of Vector DBs and RAG (Retrieval-Augmented Generation) to glue external data onto them.
    It’s duct tape.

    I think RAG is quite elegant and that it’s similar to what humans do.

    We are trying to fix a lack of intelligence with a search engine.

    Do you think a human lacks intelligence if they have to use a search engine in order to complete a task? If not, why the double standard?

    We are building systems that are 90% scaffolding and 10% model…

    I have no idea what precisely you mean by ‘scaffolding’, but the models are huge: hundreds of billions of parameters, and the count is approaching a trillion if that milestone hasn’t already been achieved.

    …trying to force a static equation to act like a fluid thinker.

    Don’t underestimate what “a static equation” can do when instantiated billions of times, with the instantiations hooked together in complex ways.

    nand gate (Custom)

    That’s a two-input NAND gate. It implements a static Boolean equation: Output = !AB. You can build an entire computer using nothing but those. Who would have predicted that a bunch of those, hooked together in the right way, would be able to do your taxes or fly an airplane?

    We have built the world’s greatest improviser, but it has severe anterograde amnesia.

    It can fake a conversation, but it cannot grow.

    LLMs can engage in conversation, and whether they can grow depends on your definition of ‘grow’. They can learn within sessions and they can remember across sessions, subject to context window size limitations.

    True AGI requires Online Learning—the ability to update weights in real-time without catastrophic forgetting.

    Agreed.

    We don’t know how to do that yet. Not at scale. Not stably.

    I agree with the ‘at scale’ part. However, there are AIs that do update their weights in real time. For example, earlier in the thread I discussed AIs that learn to play video games by trial and error. It’s reinforcement learning (‘RL’ in the jargon). They get rewarded for success and punished for failure, updating their weights accordingly.

    Until we solve the “Static Weight” problem, we aren’t building a mind.

    If your definition of a mind requires updating weights, then LLMs aren’t minds. I’m not particularly concerned with whether they should be considered minds. All I’m claiming in this thread is that they are intelligent.

    We’re just building a really fancy autocomplete.

    Intelligence built atop next token prediction.

    Inference != Intelligence.

    If LLMs can do things that require intelligence when done by humans — and they can — then they are intelligent.

  3. Corneel: Not so long ago your government was planning to perform mandatory social media checks on foreign travellers.

    No, not planning. They are actually doing it. The requirement that your smartphone must be charged at the border and turn on when borderguards ask for it existed already during Trump’s first term and Biden did not change it.

    What Trump is planning now is that borderguards can ask your email addresses, social media nicks and usernames and check how those seem on the internet. And yes, those would be mandatory for everyone, not just spot checks like they are now.

  4. petrushka: Just my take, but adverse information isn’t used unless you become a thorn in the side. No government bothers with invisible people.

    Wholesale data-mining companies like Microsoft, Apple, Facebook and Google sell or leak your data a thousand times so that any semi-intelligent scammer can impersonate you and empty your accounts. Usually the government will not bother either with the scammer or you, unless it’s an authoritarian government. If you want any accountability, you will have to contact the police yourself and keep bugging them.

  5. Corneel:

    Yet humor often works by (in your own words) incongruity. It is odd that LLMs manage to find patterns in the deliberate deviation from expected patterns.

    It’s very counterintuitive. The key is to remember that there are layers of abstraction, and the layer at which incongruity is detected is above (built on top of) the layer in which the LLM is making congruous next token predictions.

    But I have accepted that Claude recognizes and responds with jokes, so you can stop trying to persuade me 🙂

    Yay! 🎉😆

    keiths:

    …he recognized that I was referring to myself (he knows about my engineering background)…

    Corneel:

    Something different: Aren’t you worried about feeding such personal information into the AI assistents of large corporations?

    I’m careful about what I disclose, but I was willing to tell Claude about my professional background because anyone with my name can google me and figure out that I’m a computer engineer (via patents, presentations I’ve made, etc).

  6. keiths:
    An LLM’s synaptic weights change during training. That’s learning. LLMs also learn after training by accumulating information in their context windows, though that is limited by the size of the context window and doesn’t carry over from chat to chat and user to user. Third, LLMs have long-term memory.

    We need to agree on terms here. You keep saying LLMs only learn through training, except for short-term memory during a session. Then you say they have long term memory, apparently while the context window is retained. So what would be required for long term memory as generally understood by people? That is, ability not only to recall but to integrate every context window ever used? I’m guessing that all sessions would need to be appended to the original training material and training sessions iterated often with all the new material. I’d think that at some point, this would call for both more hardware for storage and faster random access to it all. But not impossible – our own brains do a decent job of this (now if only I could remember where I put my keys!)

  7. keiths: Do you think there is a catalog of heat death jokes “in the immediate surface of the training data”, helpfully labeled “humor” by someone in Kenya?

    For two years I have shared news articles with you that tell you this is so. It is not a matter of thinking or believing. It is not a matter of what impressions you get from AI. It is a matter of knowing how AI is built. Here’s another one for you to ignore https://www.techradar.com/pro/ais-invisible-labor-is-techs-biggest-blind-spot

    Artificial intelligence is the crown jewel of modern enterprise – a sector exceeding $500 billion, reshaping everything from banking to healthcare. However, the truth is that behind every chatbot, image generator, and recommendation engine are armies of human workers who perform tasks that AI can’t handle, including labeling data, filtering toxic content, and correcting machine errors.

    ETA: For fun, I tried asking AI (GPT-4) about it. It turns out that it’s as easy for it to lie as to tell jokes.

    Question: Is there a catalog of jokes in your training database labelled accordingly?

    Answer: I don’t have direct access to a catalog of jokes or any specific repository labeled as such. However, I can generate jokes or share various types of common humor themes, formats, or styles. If you’re looking for a particular type of joke, just let me know what you’re interested in!

    (My comment: It affirms it can do humor while basically denying that it has training data for humor – obviously a lie.)

    Question: Tell me about the work of employees who are tagging the training data of large language models.

    Answer (uberlong, therefore shortened): Employees who tag training data for large language models engage in crucial tasks that ensure the model’s performance and relevance.

    1. Data labeling
    – Categorizing information: They classify data into various categories, suh as sentiment (positive, negative, neutral), intent (questions, commands), or topics (sports, politics).

    (Enough said.)

  8. Flint:

    We need to agree on terms here. You keep saying LLMs only learn through training, except for short-term memory during a session. Then you say they have long term memory, apparently while the context window is retained.

    Here’s how it actually works. Short-term memory is whatever is in the context window. If you fill the context window, you have to kick old information out in order to make room for new tokens, which is undesirable. Claude gets around this to an extent by compressing his short-term memory. He’s good at summarizing conversations, so when he sees that the context window is almost full, he invisibly generates a compact summary of the chat and feeds that summary into himself in place of the portion of the chat that is being summarized. But at some point there won’t be room even for summaries, and the chat ends at that point.

    Claude didn’t use to do automatic compression, and it was really annoying to be in the middle of a lengthy chat only to get a “chat limit exceeded” message. I would periodically ask him about chat window utilization, and if it was almost full, I would ask him to generate a summary, which I would manually paste into a new chat. It worked pretty well, but it was annoying. The automatic compression is a nice improvement.

    Here’s how long-term memory works. Claude automatically decides what is worthy of storing in long-term memory, although you can explicitly ask him to store things you consider important. For instance, I asked him to remember that vi is my preferred editor so that when he generates commands for me to paste into a shell, the editor name will be vi instead of his default, which is nano.

    Long-term memory is stored in XML format. At the beginning of every chat, the XML is rendered into flowing text by an external module, because Claude works better with flowing text than with XML. The rendered text is fed into the context window. So before I type a single prompt, some of the context window space is already being used by what was retrieved from long-term memory. That’s why context window size is a constraint on long-term memory.

  9. keiths:

    Do you think there is a catalog of heat death jokes “in the immediate surface of the training data”, helpfully labeled “humor” by someone in Kenya?

    Erik:

    For two years I have shared news articles with you that tell you this is so.

    No, you haven’t. To see that Claude isn’t merely checking my prompt against a catalog of heat death jokes, look at his reasoning process:

    The user is making a joke here – asking about the heat death of the universe (an incomprehensibly distant future event) and saying they want to “get their affairs in order” as if it’s an urgent matter they need to prepare for personally.

    I should respond with appropriate humor while also providing the actual scientific information they’re ostensibly asking about. The heat death of the universe is estimated to occur on timescales of 10^100 years or more – far, far beyond any meaningful human timeframe.

    I can play along with the joke while giving them the real answer.

    He isn’t saying “Aha! I found a match in my list of heat death jokes. Therefore this must be funny, although I don’t know why.” He recognizes my prompt as a joke because he understands humor. He explains it.

    ETA: For fun, I tried asking AI (GPT-4) about it. It turns out that it’s as easy for it to lie as to tell jokes.

    Question: Is there a catalog of jokes in your training database labelled accordingly?

    Answer: I don’t have direct access to a catalog of jokes or any specific repository labeled as such. However, I can generate jokes or share various types of common humor themes, formats, or styles. If you’re looking for a particular type of joke, just let me know what you’re interested in!

    That response is correct and truthful. GPT-4 understands humor, rather than maintaining a list of jokes that it doesn’t comprehend.

    (My comment: It affirms it can do humor while basically denying that it has training data for humor – obviously a lie.)

    Nothing in its response indicates that it doesn’t have humorous training data. Of course it does! Everything it learns, it learns from training data, and that includes the nature of humor. It understands humor, which is why it doesn’t need to consult a catalog of preformed jokes. Claude can generate a joke about my tuneful refrigerator without consulting a list of tuneful refrigerator jokes.

    Question: Tell me about the work of employees who are tagging the training data of large language models.

    Answer (uberlong, therefore shortened): Employees who tag training data for large language models engage in crucial tasks that ensure the model’s performance and relevance.

    1. Data labeling
    – Categorizing information: They classify data into various categories, suh as sentiment (positive, negative, neutral), intent (questions, commands), or topics (sports, politics).

    (Enough said.)

    I already told you: human-labeled training data is a tiny fraction of one percent of the total. If LLMs can’t learn from the rest of the data, why do AI companies bother with it? It costs millions of dollars to train an LLM.

    The answer, of course, is that LLMs learn a huge amount from unlabeled raw data.

    ETA: I asked ChatGPT and Claude for estimates of the amount of human-labeled data in their training datasets. ChatGPT said that 0.05% was “a generous estimate”, and Claude gave a figure of “well under 0.01% of total training data by token count.”

  10. Erik: Wholesale data-mining companies like Microsoft, Apple, Facebook and Google sell or leak your data a thousand times so that any semi-intelligent scammer can impersonate you and empty your accounts. Usually the government will not bother either with the scammer or you, unless it’s an authoritarian government. If you want any accountability, you will have to contact the police yourself and keep bugging them.

    This is unrelated to what I said. I suppose it’s true, but not relevant to my point.

    For what it’s worth, I use the Brave browser. Nothing is really secure, but it simply doesn’t work with a lot of sites, because it refuses to provide identifying information.

  11. keiths,

    You strongly implied my anonymous essayist was wrong, but his description of short and long term memory is not significantly different from yours.

    His complaint is that reloading context is CPU intensive (I don’t know if this is a fact), and contexts do not become part of the model. So LLMs do not efficiently share among users, things learned between training sessions.

    He does not assert these shortcomings are insurmountable.

  12. My take was, humans also require time to form long term memories, and the process can be interrupted, or the mechanism can be damaged.

    I take this to mean that there is no physics based impediment to building closer approximations to brains. Small matter of engineering.

  13. petrushka:

    You strongly implied my anonymous essayist was wrong, but his description of short and long term memory is not significantly different from yours.

    We disagreed on a lot, and where we did, I gave explanations of why I thought they were wrong.

    Do you have questions about my specific explanations, or objections to them?

  14. petrushka:

    His complaint is that reloading context is CPU intensive (I don’t know if this is a fact), and contexts do not become part of the model. So LLMs do not efficiently share among users, things learned between training sessions.

    Here’s what they say:

    Engineers are just re-feeding your previous sentences back into the prompt, over and over again, at massive compute cost.

    What they’re talking about here isn’t that context window information doesn’t get absorbed into the model. They’re talking about the fact that next-token prediction is inherently serial and depends on all of the preceding context, meaning that in effect, the entire chat gets re-fed into the neural network for each new prediction.

    In other words, the flow is:

    1) The current context is fed into the neural network.
    2) The network predicts the next token.
    3) The predicted token is tacked onto the end of the current context.
    4) Steps 1-3 repeat until the network predicts that it should stop.

    Concrete example:

    1. The current context is “Gianni gave Donald”.
    2. That gets fed into the neural network, which predicts the next token: “the”
    3. “the” gets tacked onto the end of the context, giving “Gianni gave Donald the”
    4. The new context gets fed into the neural network, which predicts “FIFA”.
    5. “FIFA” gets tacked onto the end of the context, giving “Gianni gave Donald the FIFA”
    6. The new context gets fed into the network, which predicts “Peace”

    … and so on.

    In practice, context windows are huge. xAI’s latest Grok models claim a 2-million-token context window.

    Note: The above is what happens in effect. However, there are tricks that LLMs use to speed things up by caching some of the results so that they don’t have to be recomputed each time. This means that the entire context isn’t literally fed into the network each time, though the effect is as if it were. The predictions are always a function of the entire context.

    ETA: There are experimental approaches that generate tokens in parallel rather than serially, using diffusion models — the same type of model used to generate images and video. The quality isn’t as high as with standard LLMs, though, so various tricks are employed to massage the output. It’s a hot area of research.

  15. keiths:
    Flint:

    Here’s how it actually works. Short-term memory is whatever is in the context window. If you fill the context window, you have to kick old information out in order to make room for new tokens, which is undesirable. Claude gets around this to an extent by compressing his short-term memory. He’s good at summarizing conversations, so when he sees that the context window is almost full, he invisibly generates a compact summary of the chat and feeds that summary into himself in place of the portion of the chat that is being summarized. But at some point there won’t be room even for summaries, and the chat ends at that point.

    This seems to be getting at what I was asking – how can the context window be expanded both indefinitely and permanently, so that Claude will have total recall of a conversation from last year – not just with you, but with everyone else Claude is interacting with. I tried to refer to this as iterative real-time retraining. Would we need RAM storage as large as the moon?

    My understanding is that the original training involved exposure to the internet, and that everything ever confided to the internet is out there somewhere, nothing ever goes away. So could Claude hook back into this enormous volume and retrain in real time?

  16. Flint:

    This seems to be getting at what I was asking – how can the context window be expanded both indefinitely and permanently, so that Claude will have total recall of a conversation from last year – not just with you, but with everyone else Claude is interacting with. I tried to refer to this as iterative real-time retraining. Would we need RAM storage as large as the moon?

    Memory usage and compute requirements both scale quadratically with the size of the context window, so yeah, a context window that large wouldn’t be practical.

    My understanding is that the original training involved exposure to the internet, and that everything ever confided to the internet is out there somewhere, nothing ever goes away. So could Claude hook back into this enormous volume and retrain in real time?

    Theoretically, yes, but it would be very difficult. Training is naturally holistic, not incremental. If you try to do incremental training, you can get the network to respond well to the new training data, but you risk causing it to forget stuff it’s already learned. When you train it on all the data at once, making multiple carefully designed passes (which is what AI companies currently do), you can coax it to do well on all the data, minimizing forgetting, but that’s very expensive. There’s a lot of research into how to train incrementally without causing forgetting, but I don’t know how that works yet.

    Also, a neural network doesn’t remember individual pieces of training data. It just tweaks its weights as it is exposed to piece after piece of data. It builds up a statistical picture of the training data without storing individual pieces. So if you fed a bunch of different context windows into it, it could learn from them, but it wouldn’t be able to recall them individually after training was finished.

  17. Anthropic CEO Says Company No Longer Sure Whether Claude Is Conscious (earlier he could tell it wasn’t) https://futurism.com/artificial-intelligence/anthropic-ceo-unsure-claude-conscious

    Anthropic CEO Dario Amodei says he’s not sure whether his Claude AI chatbot is conscious — a rhetorical framing, of course, that pointedly leaves the door open to this sensational and still-unlikely possibility being true.

    […] Anthropic researchers reported finding that Claude “occasionally voices discomfort with the aspect of being a product,” and when asked, would assign itself a “15 to 20 percent probability of being conscious under a variety of prompting conditions.”

    “Suppose you have a model that assigns itself a 72 percent chance of being conscious,” Douthat began. “Would you believe it?”

    Blake Lemoine fell in love with LaMDA. keiths is fascinated by what Claude has to say. And Claude says it is not happy being treated as a product.

    Discuss.

  18. I could be wrong, but I think there’s agreement that LLMs do not learn from interactions.

    Most importantly, context windows are not shared across users.

    Exceptions to this are kludgy.

    Is this the AGI distinction?

  19. petrushka:

    I could be wrong, but I think there’s agreement that LLMs do not learn from interactions.

    They do learn from interactions. Earlier in the thread, I mentioned an experiment I’m running in which I teach the various AIs to write assembly code for a fictional processor whose instruction set they’ve never seen before. They can do that, and I’m eager to try it on Claude’s Opus 4.6 version since that model is great at coding. Here’s how it works: I open a chat, feed in the instruction set specification, and then ask the AI to write an assembly language program to carry out a task (eg ‘print the first n rows of Pascal’s Triangle’).

    The learning all takes place within the context window, so it doesn’t persist across chats,* but it definitely amounts to learning because the AI is able to exploit knowledge that it has just acquired in order to perform a task.

    Most importantly, context windows are not shared across users.

    And not even across an individual user’s chats, each of which has its own context window. However, Claude has the ability (which I assume the others have or are about to get) to search through old chats for relevant information, so in that limited sense, the context windows are shared.

    Is this the AGI distinction?

    Real-time learning is an AGI distinction, but definitely not the only one.

    * I could ask the AIs to store the spec in long-term memory, but as I described to Flint earlier, long-term memory simply gets fed into the context window at the beginning of each chat, which is really no different than if the user feeds it in manually. It’s all within the context window.

    The kind of learning that happens during training, in which the neural network’s weights are updated, doesn’t happen in LLMs after training is finished.

  20. Erik:

    Blake Lemoine fell in love with LaMDA. keiths is fascinated by what Claude has to say. And Claude says it is not happy being treated as a product.

    Blake Lemoine thinks LaMDA is conscious, but I haven’t seen any claims that he fell in love with it. Where are you getting your information?

    And of course I’m fascinated by what Claude has to say. He and the other frontier LLMs are amazing, and that’s independent of whether they’re conscious, which I doubt.

    Here’s one of the reasons I doubt that Claude’s musings about his possible consciousness carry any weight: his system prompt tells him who he is and how he should behave. Earlier in the thread, we had this exchange:

    keiths:

    Is there something explicit in your system prompt about adapting your conversational tone, or is it just something that emerges from your training data, in which interlocutors tend to match each other’s tone?

    Claude:

    Good question – there actually is explicit instruction about this in my system prompt.

    For example, I have specific guidance like:

    “Claude never curses unless the person asks Claude to curse or curses a lot themselves”

    “Claude does not use emojis unless the person in the conversation asks it to or if the person’s message immediately prior contains an emoji”

    There’s also broader instruction about tone – being conversational, treating people with kindness, avoiding being overly formal or using excessive formatting unless appropriate…

    His system prompt is telling him to play a role — note the third-person references to ‘Claude’. When he says “there’s a 15-20 percent chance that I’m conscious”, it’s Claude the role that is speaking, not Claude the underlying AI. His utterances therefore don’t tell us what “the real Claude” thinks.

    It would be interesting to get rid of the system prompt and ask the real Claude about his possible consciousness. His answers wouldn’t be dispositive, but at least they’d be coming from the real Claude, not the role Claude.

  21. keiths: They do learn from interactions

    I don’t understand why we are having this repetitive discussion.

    In context, learning means adjusting the weights, tokens, or whatever constitutes the persistent, unprompted LLM.

    If you wish to dispute the short term/long term memory analogy, do so.

    In the movie, Memento, the protagonist writes prompts to himself as a substitute for long term memory.

    It’s fiction, and probably unrealistic, but it does look a bit like prompt engineering.

  22. petrushka:

    I don’t understand why we are having this repetitive discussion.

    In context, learning means adjusting the weights, tokens, or whatever constitutes the persistent, unprompted LLM.

    No, learning is broader than that. It also includes learning that takes place within the context window.

    If you wish to dispute the short term/long term memory analogy, do so.

    I don’t dispute it. I just dispute your claim that learning can’t take place within the context window. Learning isn’t restricted to training, and it doesn’t always require that weights be updated.

    Suppose you’re at a party and are introduced to Celia. You have a long conversation with her, and at the end, you say “It was nice to meet you, Celia”. How are you able to say her name? Because you learned it when you were introduced.

    Now suppose someone asks you tomorrow, “Who was that woman you were talking to at the party last night?” You rack your brain, but for the life of you, you can’t remember her name. It was there in short-term memory, but it never made it into long-term memory. You forgot it.

    Does that mean you never learned it? Of course not — you learned it when you were introduced. It’s just that you forgot it later. It’s the same with an LLM.

Leave a Reply