I suggested "Jevons' paradox" per mail, but then Danny said:
> Well, there are competing conventions. I strongly prefer the one that doesn't introduce ambiguity. Like are there two people named JEVON (no s) and together they have a paradox? That's when I would write "Jevons' paradox". I especially care about this since Reeve and Reeves are both common names.
I think in that case, if we strive to eliminate all ambiguity(*), I would still prefer "the Jevons paradox", for aesthetic reasons. "Jevons's paradox" just looks a bit inelegant—but grammatically, it would be perfectly sound as far as I can tell. (English isn't my native language.)
(*) I understand where that's coming from, but I personally don't think that unambiguousness should be one of the top 3 priorities of language. MAYBE it should be in the top 5? Definitely somewhere in the top 10.
But I did study literature, so obviously I'm a big fan of ambiguity in language. Not to mention that it will eventually develop on its own, because language and meaning and usage etc. are never static.
(I quickly browsed the general list on Wikipedia [https://en.wikipedia.org/wiki/List_of_paradoxes]. At first glance, it seemed that one-name paradoxes are always with 's [Cantor's paradox, Fitch's paradox], two-name paradoxes are always just the names [Grelling–Nelson paradox, Downs–Thomson paradox], and then there's the others where no names are involved, which never seem to have an 's either, of course [Potato paradox, Metabasis paradox].
> Random tip: If you ask an LLM to, say, extract a bunch of headers from a document for you, you can’t really trust it not to have missed any. But if you say “can you write a Python program to do that” then, gobsmackingly, it will, and the result will be pretty trustworthy. It’s starting to feel like anyone not using these tools on a daily basis is… unserious.
This frustrates me, because this approach (probably?) can’t work on arbitrary tokens like language or poetry as well where you’re judging things like sentence structure or tone compatibility.
However, every LLM manufacturer must know at this point exactingly the relative accuracy and hallucinations caused by running a certain non-chain of thought or chain of thought model for a certain context window.
If I’m not mistaken current chain of thought models attempt to fit as much into their larger context windows as possible resulting in progressively slower, expensive, AND less accurate output as the tokens are spat out.
I’m sure SOMEONE has already done this like most everything I mention in my comments on this blog but why don’t chain of thought models spin up cheap non chain of thought models up to the length of the context window it knows can still have accurate results for mission critical outputs like headers, and treats the main “thread” as a dispatcher for other smaller threads?
Even better, when a “sub calculation” would clearly faire better in a non-llm context like R why not spin up a R script using a plugin to process data?
Lastly, I don’t see why (there must be a good reason) why many chain of thought models just allow themselves to keep getting slower over time for users by allowing such giant context windows, and not instead making a judgement call on whether certain sets of tokens require a bigger context window, and how many compute cycles could be saved by using chain of thought models for one off “calculations” and having cheaper models be the ones that actually talk to end users only seeing the context windows that pertains to the prompt and whatever the chain of thought model spit out.
Wait, wait, wait, this isn’t how deep research models work, no? I haven’t looked into their architectures but never bothered figuring it’s as black box as everything else.
Your "sub calculation" idea is happening. Often GPT-o3 will run code of its own accord as part of answering questions. I was kinda disappointed today to have to tell it explicitly to do that, but at least it proceeded to do so when asked.
I suggested "Jevons' paradox" per mail, but then Danny said:
> Well, there are competing conventions. I strongly prefer the one that doesn't introduce ambiguity. Like are there two people named JEVON (no s) and together they have a paradox? That's when I would write "Jevons' paradox". I especially care about this since Reeve and Reeves are both common names.
I think in that case, if we strive to eliminate all ambiguity(*), I would still prefer "the Jevons paradox", for aesthetic reasons. "Jevons's paradox" just looks a bit inelegant—but grammatically, it would be perfectly sound as far as I can tell. (English isn't my native language.)
(*) I understand where that's coming from, but I personally don't think that unambiguousness should be one of the top 3 priorities of language. MAYBE it should be in the top 5? Definitely somewhere in the top 10.
But I did study literature, so obviously I'm a big fan of ambiguity in language. Not to mention that it will eventually develop on its own, because language and meaning and usage etc. are never static.
All fair points! But we say, for example, "Murphy's law" rather than "the Murphy law". So "Jevons's paradox" feels more consistent to me.
Fair enough—there are, of course, exceptions to almost every language rule. Though Ross's paradox and Braess's paradox seem to be on your side as well: https://en.wikipedia.org/wiki/Imperative_logic#Ross's_paradox / https://en.wikipedia.org/wiki/Braess%27s_paradox
(I quickly browsed the general list on Wikipedia [https://en.wikipedia.org/wiki/List_of_paradoxes]. At first glance, it seemed that one-name paradoxes are always with 's [Cantor's paradox, Fitch's paradox], two-name paradoxes are always just the names [Grelling–Nelson paradox, Downs–Thomson paradox], and then there's the others where no names are involved, which never seem to have an 's either, of course [Potato paradox, Metabasis paradox].
However, then I discovered the Taeuber Paradox [https://en.wikipedia.org/wiki/Taeuber_Paradox], and I'm afraid your consistency is already broken.)
> Random tip: If you ask an LLM to, say, extract a bunch of headers from a document for you, you can’t really trust it not to have missed any. But if you say “can you write a Python program to do that” then, gobsmackingly, it will, and the result will be pretty trustworthy. It’s starting to feel like anyone not using these tools on a daily basis is… unserious.
This frustrates me, because this approach (probably?) can’t work on arbitrary tokens like language or poetry as well where you’re judging things like sentence structure or tone compatibility.
However, every LLM manufacturer must know at this point exactingly the relative accuracy and hallucinations caused by running a certain non-chain of thought or chain of thought model for a certain context window.
If I’m not mistaken current chain of thought models attempt to fit as much into their larger context windows as possible resulting in progressively slower, expensive, AND less accurate output as the tokens are spat out.
I’m sure SOMEONE has already done this like most everything I mention in my comments on this blog but why don’t chain of thought models spin up cheap non chain of thought models up to the length of the context window it knows can still have accurate results for mission critical outputs like headers, and treats the main “thread” as a dispatcher for other smaller threads?
Even better, when a “sub calculation” would clearly faire better in a non-llm context like R why not spin up a R script using a plugin to process data?
Lastly, I don’t see why (there must be a good reason) why many chain of thought models just allow themselves to keep getting slower over time for users by allowing such giant context windows, and not instead making a judgement call on whether certain sets of tokens require a bigger context window, and how many compute cycles could be saved by using chain of thought models for one off “calculations” and having cheaper models be the ones that actually talk to end users only seeing the context windows that pertains to the prompt and whatever the chain of thought model spit out.
Wait, wait, wait, this isn’t how deep research models work, no? I haven’t looked into their architectures but never bothered figuring it’s as black box as everything else.
Your "sub calculation" idea is happening. Often GPT-o3 will run code of its own accord as part of answering questions. I was kinda disappointed today to have to tell it explicitly to do that, but at least it proceeded to do so when asked.
Totally agreed about being smarter about switching between models. Codebuff, for one, is doing that.