I think a thread on this topic will be interesting. My own position is that AI is intelligent, and that’s for a very simple reason: it can do things that require intelligence. That sounds circular, and in one sense it is. In another sense it isn’t. It’s a way of saying that we don’t have to examine the internal workings of a system to decide that it’s intelligent. Behavior alone is sufficient to make that determination. Intelligence is as intelligence does.
You might ask how I can judge intelligence in a system if I haven’t defined what intelligence actually is. My answer is that we already judge intelligence in humans and animals without a precise definition, so why should it be any different for machines? There are lots of concepts for which we don’t have precise definitions, yet we’re able to discuss them coherently. They’re the “I know it when I see it” concepts. I regard intelligence as one of those. The boundaries might be fuzzy, but we’re able to confidently say that some activities require intelligence (inventing the calculus) and others don’t (breathing).
I know that some readers will disagree with my functionalist view of intelligence, and that’s good. It should make for an interesting discussion.
Not so long ago your government was planning to perform mandatory social media checks on foreign travellers.
But I wasn’t thinking of that. Your personal information can be (ab)used by the corporations themselves or they can be hacked and your information may fall in the hands of malicious parties.
petrushka, quoting someone else:
LLMs do learn. Training is learning. And they can learn even after training via their context windows and long-term memory.
Speak for yourself, random person. Informed people know that context windows aren’t brains.
Could be. Predictions regarding rapidly-evolving technologies are often wrong.
An LLM’s synaptic weights change during training. That’s learning. LLMs also learn after training by accumulating information in their context windows, though that is limited by the size of the context window and doesn’t carry over from chat to chat and user to user. Third, LLMs have long-term memory.
If it makes a mistake and you correct it, it remembers the correction for the duration of the session, subject to context window size limitations. It learns.
You can reload the chat and much (if not all) of what it learned is restored, subject to context window size limitations. LLMs also have long-term memory. Claude learned that I am a retired computer engineer, for instance, and his memory of that persists from session to session.
Yes, it does, subject to context window size limitations. It can repeat to you what you said earlier. That’s memory.
The system, in effect, refeeds the entire session (not just your prompts) back into the neural net in order to predict the next token.
It’s memory, unless your definition of ‘memory’ is as tendentious as Erik’s undisclosed definition of ‘intelligence’.
The continuity is real. If you ask an LLM about something from earlier in the session, it can answer you correctly. How is that not continuity?
I think RAG is quite elegant and that it’s similar to what humans do.
Do you think a human lacks intelligence if they have to use a search engine in order to complete a task? If not, why the double standard?
I have no idea what precisely you mean by ‘scaffolding’, but the models are huge: hundreds of billions of parameters, and the count is approaching a trillion if that milestone hasn’t already been achieved.
Don’t underestimate what “a static equation” can do when instantiated billions of times, with the instantiations hooked together in complex ways.
That’s a two-input NAND gate. It implements a static Boolean equation: Output = !AB. You can build an entire computer using nothing but those. Who would have predicted that a bunch of those, hooked together in the right way, would be able to do your taxes or fly an airplane?
LLMs can engage in conversation, and whether they can grow depends on your definition of ‘grow’. They can learn within sessions and they can remember across sessions, subject to context window size limitations.
Agreed.
I agree with the ‘at scale’ part. However, there are AIs that do update their weights in real time. For example, earlier in the thread I discussed AIs that learn to play video games by trial and error. It’s reinforcement learning (‘RL’ in the jargon). They get rewarded for success and punished for failure, updating their weights accordingly.
If your definition of a mind requires updating weights, then LLMs aren’t minds. I’m not particularly concerned with whether they should be considered minds. All I’m claiming in this thread is that they are intelligent.
Intelligence built atop next token prediction.
If LLMs can do things that require intelligence when done by humans — and they can — then they are intelligent.
No, not planning. They are actually doing it. The requirement that your smartphone must be charged at the border and turn on when borderguards ask for it existed already during Trump’s first term and Biden did not change it.
What Trump is planning now is that borderguards can ask your email addresses, social media nicks and usernames and check how those seem on the internet. And yes, those would be mandatory for everyone, not just spot checks like they are now.
Wholesale data-mining companies like Microsoft, Apple, Facebook and Google sell or leak your data a thousand times so that any semi-intelligent scammer can impersonate you and empty your accounts. Usually the government will not bother either with the scammer or you, unless it’s an authoritarian government. If you want any accountability, you will have to contact the police yourself and keep bugging them.
Corneel:
It’s very counterintuitive. The key is to remember that there are layers of abstraction, and the layer at which incongruity is detected is above (built on top of) the layer in which the LLM is making congruous next token predictions.
Yay! 🎉😆
keiths:
Corneel:
I’m careful about what I disclose, but I was willing to tell Claude about my professional background because anyone with my name can google me and figure out that I’m a computer engineer (via patents, presentations I’ve made, etc).
We need to agree on terms here. You keep saying LLMs only learn through training, except for short-term memory during a session. Then you say they have long term memory, apparently while the context window is retained. So what would be required for long term memory as generally understood by people? That is, ability not only to recall but to integrate every context window ever used? I’m guessing that all sessions would need to be appended to the original training material and training sessions iterated often with all the new material. I’d think that at some point, this would call for both more hardware for storage and faster random access to it all. But not impossible – our own brains do a decent job of this (now if only I could remember where I put my keys!)
For two years I have shared news articles with you that tell you this is so. It is not a matter of thinking or believing. It is not a matter of what impressions you get from AI. It is a matter of knowing how AI is built. Here’s another one for you to ignore https://www.techradar.com/pro/ais-invisible-labor-is-techs-biggest-blind-spot
ETA: For fun, I tried asking AI (GPT-4) about it. It turns out that it’s as easy for it to lie as to tell jokes.
Question: Is there a catalog of jokes in your training database labelled accordingly?
Answer: I don’t have direct access to a catalog of jokes or any specific repository labeled as such. However, I can generate jokes or share various types of common humor themes, formats, or styles. If you’re looking for a particular type of joke, just let me know what you’re interested in!
(My comment: It affirms it can do humor while basically denying that it has training data for humor – obviously a lie.)
Question: Tell me about the work of employees who are tagging the training data of large language models.
Answer (uberlong, therefore shortened): Employees who tag training data for large language models engage in crucial tasks that ensure the model’s performance and relevance.
1. Data labeling
– Categorizing information: They classify data into various categories, suh as sentiment (positive, negative, neutral), intent (questions, commands), or topics (sports, politics).
(Enough said.)
Flint:
Here’s how it actually works. Short-term memory is whatever is in the context window. If you fill the context window, you have to kick old information out in order to make room for new tokens, which is undesirable. Claude gets around this to an extent by compressing his short-term memory. He’s good at summarizing conversations, so when he sees that the context window is almost full, he invisibly generates a compact summary of the chat and feeds that summary into himself in place of the portion of the chat that is being summarized. But at some point there won’t be room even for summaries, and the chat ends at that point.
Claude didn’t use to do automatic compression, and it was really annoying to be in the middle of a lengthy chat only to get a “chat limit exceeded” message. I would periodically ask him about chat window utilization, and if it was almost full, I would ask him to generate a summary, which I would manually paste into a new chat. It worked pretty well, but it was annoying. The automatic compression is a nice improvement.
Here’s how long-term memory works. Claude automatically decides what is worthy of storing in long-term memory, although you can explicitly ask him to store things you consider important. For instance, I asked him to remember that vi is my preferred editor so that when he generates commands for me to paste into a shell, the editor name will be vi instead of his default, which is nano.
Long-term memory is stored in XML format. At the beginning of every chat, the XML is rendered into flowing text by an external module, because Claude works better with flowing text than with XML. The rendered text is fed into the context window. So before I type a single prompt, some of the context window space is already being used by what was retrieved from long-term memory. That’s why context window size is a constraint on long-term memory.
keiths:
Erik:
No, you haven’t. To see that Claude isn’t merely checking my prompt against a catalog of heat death jokes, look at his reasoning process:
He isn’t saying “Aha! I found a match in my list of heat death jokes. Therefore this must be funny, although I don’t know why.” He recognizes my prompt as a joke because he understands humor. He explains it.
That response is correct and truthful. GPT-4 understands humor, rather than maintaining a list of jokes that it doesn’t comprehend.
Nothing in its response indicates that it doesn’t have humorous training data. Of course it does! Everything it learns, it learns from training data, and that includes the nature of humor. It understands humor, which is why it doesn’t need to consult a catalog of preformed jokes. Claude can generate a joke about my tuneful refrigerator without consulting a list of tuneful refrigerator jokes.
I already told you: human-labeled training data is a tiny fraction of one percent of the total. If LLMs can’t learn from the rest of the data, why do AI companies bother with it? It costs millions of dollars to train an LLM.
The answer, of course, is that LLMs learn a huge amount from unlabeled raw data.
ETA: I asked ChatGPT and Claude for estimates of the amount of human-labeled data in their training datasets. ChatGPT said that 0.05% was “a generous estimate”, and Claude gave a figure of “well under 0.01% of total training data by token count.”
This is unrelated to what I said. I suppose it’s true, but not relevant to my point.
For what it’s worth, I use the Brave browser. Nothing is really secure, but it simply doesn’t work with a lot of sites, because it refuses to provide identifying information.
keiths,
You strongly implied my anonymous essayist was wrong, but his description of short and long term memory is not significantly different from yours.
His complaint is that reloading context is CPU intensive (I don’t know if this is a fact), and contexts do not become part of the model. So LLMs do not efficiently share among users, things learned between training sessions.
He does not assert these shortcomings are insurmountable.
My take was, humans also require time to form long term memories, and the process can be interrupted, or the mechanism can be damaged.
I take this to mean that there is no physics based impediment to building closer approximations to brains. Small matter of engineering.
petrushka:
We disagreed on a lot, and where we did, I gave explanations of why I thought they were wrong.
Do you have questions about my specific explanations, or objections to them?
petrushka:
Here’s what they say:
What they’re talking about here isn’t that context window information doesn’t get absorbed into the model. They’re talking about the fact that next-token prediction is inherently serial and depends on all of the preceding context, meaning that in effect, the entire chat gets re-fed into the neural network for each new prediction.
In other words, the flow is:
Concrete example:
… and so on.
In practice, context windows are huge. xAI’s latest Grok models claim a 2-million-token context window.
Note: The above is what happens in effect. However, there are tricks that LLMs use to speed things up by caching some of the results so that they don’t have to be recomputed each time. This means that the entire context isn’t literally fed into the network each time, though the effect is as if it were. The predictions are always a function of the entire context.
ETA: There are experimental approaches that generate tokens in parallel rather than serially, using diffusion models — the same type of model used to generate images and video. The quality isn’t as high as with standard LLMs, though, so various tricks are employed to massage the output. It’s a hot area of research.
This seems to be getting at what I was asking – how can the context window be expanded both indefinitely and permanently, so that Claude will have total recall of a conversation from last year – not just with you, but with everyone else Claude is interacting with. I tried to refer to this as iterative real-time retraining. Would we need RAM storage as large as the moon?
My understanding is that the original training involved exposure to the internet, and that everything ever confided to the internet is out there somewhere, nothing ever goes away. So could Claude hook back into this enormous volume and retrain in real time?
Flint:
Memory usage and compute requirements both scale quadratically with the size of the context window, so yeah, a context window that large wouldn’t be practical.
Theoretically, yes, but it would be very difficult. Training is naturally holistic, not incremental. If you try to do incremental training, you can get the network to respond well to the new training data, but you risk causing it to forget stuff it’s already learned. When you train it on all the data at once, making multiple carefully designed passes (which is what AI companies currently do), you can coax it to do well on all the data, minimizing forgetting, but that’s very expensive. There’s a lot of research into how to train incrementally without causing forgetting, but I don’t know how that works yet.
Also, a neural network doesn’t remember individual pieces of training data. It just tweaks its weights as it is exposed to piece after piece of data. It builds up a statistical picture of the training data without storing individual pieces. So if you fed a bunch of different context windows into it, it could learn from them, but it wouldn’t be able to recall them individually after training was finished.
Anthropic CEO Says Company No Longer Sure Whether Claude Is Conscious (earlier he could tell it wasn’t) https://futurism.com/artificial-intelligence/anthropic-ceo-unsure-claude-conscious
Blake Lemoine fell in love with LaMDA. keiths is fascinated by what Claude has to say. And Claude says it is not happy being treated as a product.
Discuss.
I could be wrong, but I think there’s agreement that LLMs do not learn from interactions.
Most importantly, context windows are not shared across users.
Exceptions to this are kludgy.
Is this the AGI distinction?
petrushka:
They do learn from interactions. Earlier in the thread, I mentioned an experiment I’m running in which I teach the various AIs to write assembly code for a fictional processor whose instruction set they’ve never seen before. They can do that, and I’m eager to try it on Claude’s Opus 4.6 version since that model is great at coding. Here’s how it works: I open a chat, feed in the instruction set specification, and then ask the AI to write an assembly language program to carry out a task (eg ‘print the first n rows of Pascal’s Triangle’).
The learning all takes place within the context window, so it doesn’t persist across chats,* but it definitely amounts to learning because the AI is able to exploit knowledge that it has just acquired in order to perform a task.
And not even across an individual user’s chats, each of which has its own context window. However, Claude has the ability (which I assume the others have or are about to get) to search through old chats for relevant information, so in that limited sense, the context windows are shared.
Real-time learning is an AGI distinction, but definitely not the only one.
* I could ask the AIs to store the spec in long-term memory, but as I described to Flint earlier, long-term memory simply gets fed into the context window at the beginning of each chat, which is really no different than if the user feeds it in manually. It’s all within the context window.
The kind of learning that happens during training, in which the neural network’s weights are updated, doesn’t happen in LLMs after training is finished.
Erik:
Blake Lemoine thinks LaMDA is conscious, but I haven’t seen any claims that he fell in love with it. Where are you getting your information?
And of course I’m fascinated by what Claude has to say. He and the other frontier LLMs are amazing, and that’s independent of whether they’re conscious, which I doubt.
Here’s one of the reasons I doubt that Claude’s musings about his possible consciousness carry any weight: his system prompt tells him who he is and how he should behave. Earlier in the thread, we had this exchange:
keiths:
Claude:
His system prompt is telling him to play a role — note the third-person references to ‘Claude’. When he says “there’s a 15-20 percent chance that I’m conscious”, it’s Claude the role that is speaking, not Claude the underlying AI. His utterances therefore don’t tell us what “the real Claude” thinks.
It would be interesting to get rid of the system prompt and ask the real Claude about his possible consciousness. His answers wouldn’t be dispositive, but at least they’d be coming from the real Claude, not the role Claude.
I don’t understand why we are having this repetitive discussion.
In context, learning means adjusting the weights, tokens, or whatever constitutes the persistent, unprompted LLM.
If you wish to dispute the short term/long term memory analogy, do so.
In the movie, Memento, the protagonist writes prompts to himself as a substitute for long term memory.
It’s fiction, and probably unrealistic, but it does look a bit like prompt engineering.
petrushka:
No, learning is broader than that. It also includes learning that takes place within the context window.
I don’t dispute it. I just dispute your claim that learning can’t take place within the context window. Learning isn’t restricted to training, and it doesn’t always require that weights be updated.
Suppose you’re at a party and are introduced to Celia. You have a long conversation with her, and at the end, you say “It was nice to meet you, Celia”. How are you able to say her name? Because you learned it when you were introduced.
Now suppose someone asks you tomorrow, “Who was that woman you were talking to at the party last night?” You rack your brain, but for the life of you, you can’t remember her name. It was there in short-term memory, but it never made it into long-term memory. You forgot it.
Does that mean you never learned it? Of course not — you learned it when you were introduced. It’s just that you forgot it later. It’s the same with an LLM.