More

kenjackson · 2026-04-07T21:24:13 1775597053

So private companies shouldn’t get to determine who they provide services to? Assuming no extremely malicious intent, I’d be fine if they said it was only going to McDonalds because the founders like Big Macs.

pizlonator · 2026-04-08T00:26:36 1775607996

McDonalds isn't a public benefit corporation.

kenjackson · 2026-04-07T13:17:59 1775567879

I think it’s also that contrarianism generates an argument they can follow - it’s often much more simplistic along some axis. For example, flat earthers superficially have a really simple model. Throw a ball up, of course it comes down. You look straight ahead and it looks flat. Ask them how GPS works and they can’t follow the math anyways.

kenjackson · 2026-04-07T12:59:37 1775566777

While I agree with the sentiment, using AI to write the final draft of the article isn’t cheating. People may not like it, but it’s more a stylistic preference.

TylerE · 2026-04-07T19:32:18 1775590338

Using AI and a human byline is 100% cheating.

josephg · 2026-04-08T03:00:53 1775617253

Yeah I agree. Don't tell me you authored something when claude did the majority of the writing. Use claude if you want, but don't pretend you wrote the content when you didn't.

I also hate this style of plastic, pre-digested prose. Its soulless and uninteresting. Maybe I've just read too much AI slop. I associate this writing style with low quality, uninteresting junk.

kenjackson · 2026-04-02T17:41:20 1775151680

Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.

It would be interesting to actively track how far long each progressive model gets...

revachol · 2026-04-02T18:19:00 1775153940

I just tried it in ChatGPT "Auto" and it didn't work

> Yes — ((((()))))) is balanced.

> It has 6 opening ( and 6 closing ), and they’re properly nested.

Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.

> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).

> A balanced version would be: ((((()))))

Testing a couple of different models without a harness such that no tool calls are possible would be interesting

kenjackson · 2026-04-02T18:30:11 1775154611

Weird. I tried in chatGPT auto and it worked perfectly. I tried like 10 variations. I also did the letters in words. Got all of them right.

The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.

Clearly, my ChatGPT is just better than yours.

revachol · 2026-04-02T18:41:15 1775155275

heh, interesting that. I just tried it twice more with ChatGPT "Instant" (disabling "Auto-switch to Thinking") and it got it wrong both times. Does yours get it right without thinking or tool calls? If so, maybe it does like you better than me.

kenjackson · 2026-04-02T20:22:27 1775161347

OK, I didn't think to disable switch to thinking (didn't know this was a mode). When I did that then it did get it wrong -- oddly it took about the same amount of time, so thinking mode wasn't taking longer, but it was more accurate.

revachol · 2026-04-02T22:45:58 1775169958

Right, though I didn't explicitly disable thinking for my first attempt either. I'd guess my prompt was less detailed than yours and so ChatGPT (in "Auto" mode) decided to allocate thinking tokens for your questions but not mine.

coldtea · 2026-04-02T17:47:30 1775152050

Even more interesting to track how many of those are just ad-hoc patched.

raincole · 2026-04-02T18:10:23 1775153423

Probably zero. At the end of the day people pay for LLMs that write better code or summarize PDFs of hundreds of pages faster, not the ones that can count the letter r's better.

When LLMs can't count r's: see? LLMs can't think. Hoax!

When LLMs count r's: see? They patched and benchmark-maxxed. Hoax!

You just can't reason with the anti-LLM group.

toraway · 2026-04-02T18:28:46 1775154526

Whenever an "LLM fail" goes viral like the car wash question, you can observe the exact same wording of the question get "fixed" within a week or so. With slight variations in phrasing still able to replicate the problem.

Followed by lots of "works perfectly for me, why are people even talking about this?"

I can't say what exactly they're doing behind the scenes but it's a consistent pattern among the big SOTA model providers. With obvious incentive to "fix" the problem so users will then organically "debunk" the meme as they try it themselves and share their experiences.

simianwords · 2026-04-02T18:49:39 1775155779

You are misremembering. There’s no patch. All these examples used the instant model.

coldtea · 2026-04-02T19:30:04 1775158204

The same non-argument could be said for all kinds of cheating on benchmarks by tech companies and yet we have tons of documented example of them caught with pants down.

>You just can't reason with the anti-LLM group.

On the contrary, the reasoning is simple and consistent:

LLMs can't count r's shows that LLM don't actually think the way we understand thought (since nobody with the kind of high skills they have in other areas would fail that). And because of that, there are (likely) patches for commonly reported cases, since it's a race to IPO and benchmark-maxxing is very much conceivable.

moffkalast · 2026-04-02T17:50:09 1775152209

Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.

Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.

azakai · 2026-04-02T18:29:51 1775154591

You are trying it on a production model. The paper is using models with tool calls disabled.

simianwords · 2026-04-02T19:58:15 1775159895

It worked for you because the paper does the experiment without allowing the model to use any reasoning tokens - something that is grossly misleading.

wg0 · 2026-04-02T17:55:11 1775152511

Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.

So yes.

And the valuations. Trillion dollar grifter industry.

kenjackson · 2026-03-24T04:24:35 1774326275

I learn a lot from code I read, but don’t write. Did the author not read the code and simply threw it over the fence?

kenjackson · 2026-03-23T15:14:09 1774278849

This is all valid (except probably the last sentence), but it also describes so many attempted changes right until they become darn near the default.

This sounds like why I heard Redfin wouldn’t work, or Netflix, or Amazon, or Uber, or PayPal, etc…. There are always these business complexities that make it seem like these spaces have too much friction, but if there’s enough money - if it can be done then people will figure it out.

wavemode · 2026-03-23T15:26:39 1774279599

tbh this sounds revisionist... I don't recall anyone saying that any of those services "wouldn't work". Uber I suppose is one where people thought they might run into regulatory problems, and with some of those companies people were concerned about profitability. But none those companies have I ever heard that the product itself was not going to work or be useful. (Nor, indeed, that the product was tested at large scale and performed 3x worse than the incumbent...)

kenjackson · 2026-03-23T17:18:59 1774286339

Not revisionist at all.

or Netflix, or Amazon, or Uber, or PayPal, etc… Netflix and Amazon both were competing against brick and morter that were everywhere. Blockbuster was in every town, usually in every major neighborhood. The thought was that on Friday night people wanted to get a movie they wanted, not just happen to have the movie that was shipped to them. And then with streaming it was "the content on Netflix is old and dated, who would want this?" They slowly ate from below. Blockbuster scrambled with their own mailed disc offering. And died before it even had a chance to confront streaming.

Repeat this story with B&N where people said that you had to browse the books physically. You couldn't just blindly order online and wait two weeks to get the book (remember they got big before "2 Day Prime").

With PayPal it was about "they don't understand banking or payment -- and it wants to be both?!".

For this OpenAI experience, it doesn't sound great. I have accounts with these places I buy things from. I want to make sure I get my Prime shipping and digital discount via using the Amazon app. But if you could find a way to integrate my accounts all into ChatGPT things might be different. In the same way I used to never use Apple Wallet, but now it really is my go to place for everything I have a card for. I don't have to worry about having my grocery loyalty card or my football season tickets with me or my car insurance card. It's all in wallet. The Apple Wallet sucked until it was suddenly great.

mosdl · 2026-03-23T18:32:51 1774290771

Sorry that is revisionist. The idea of getting a movie mailed or streamed always sounded better than shitty blockbuster with limited selection and late fees.

The growth was fast for netflix/amazon/paypal/etc and people saw how it was an improvement from the get go.

djeastm · 2026-03-23T19:12:37 1774293157

I seem to recall a lot more hype for these companies than people saying it won't work. You seem to be cherry picking from the naysayers of the time, but not the broad consensus.

kenjackson · 2026-03-21T01:50:28 1774057828

I’ve read your posts for the past 25 years - originally on slashdot (not literally you). As you proposed, I think you’re fundamentally wrong. I got my MIL a Chromebook and it was probably the single worst technical support decision I ever made. For some, it will always be the year of Linux on the desktop. But rather the reality is the desktop will run its course before Linux has a foothold there.

kenjackson · 2026-03-19T06:16:17 1773900977

Code is usually over specified. I recently used AI to build an app for some HS kids. It built what I spec’wd and it was great. Is it what I would’ve coded? Definitely not. In code I have to make a bunch of decisions that I don’t care about. And some of the decisions will seem important to some, but not to others. For example, it built a web page whereas I would’ve built a native app. I didn’t care either way and it doesn’t matter either way. But those sorts of things matter when coding and often don’t matter at all for the goal of the implementation.

kenjackson · 2026-03-17T19:50:34 1773777034

Well I think there are some people that disagree.

kenjackson · 2026-03-14T17:59:56 1773511196

It may be unreliable to you. I see the life of most people around me getting better. Even people that are somewhat poor (not dirt poor, but free lunch poor) have homes, three squares and snacks, PS5, mobile phones with cellular data, and cable tv. The biggest life issues I see are usually strongly related to substance abuse and mental health.

Transplanting to even just the 80s would be a culture shock for most people.

throwawa1 · 2026-03-17T14:33:42 1773758022

Really curious. How is life getting better for most people? No one has kids, they can't afford families, medical care is unreachable, housing is unaffordable, I guess if your life is scrolling on an iphone its MUCH better but if you want to live, educate a family, retire, live in a safe community, or have healthcare its worse in every measure.

hackable_sand · 2026-03-14T20:33:10 1773520390

Still waiting on your evidence for peoples' lives getting better

HEmanZ · 2026-03-14T18:30:12 1773513012

There is huge variation in what the US trend looks like from the ground that varies by region, age, income level, industry, and demographic.

EI think if you’re a professional class baby boomer the trajectory has looked fantastic through your life.

If you’re a 35 middle income living on the coasts (where at least 100 million Americans live) you may have watched affordability collapse and QOL decease significantly over the last decade.