Next question: Why doesn't the query optimizer make this change automatically? T...

Diggsey · on Oct 21, 2023

Nobody actually answered your question helpfully, so here goes:

When you write `SELECT DISCTINCT a, b` you are asking for every unique pair (a, b) in the result. Specifically, the same value of `a` could appear multiple times in the result paired with different values of `b`.

The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.

As a human, we look at this and think "Oh, but `a` is the primary key (in this case CustomerID) and `b` is a column in the same table (CustomerName) so we only need to be distinct over `a`, which means the query optimizer should be able to "push down" the DISTINCT clause into the scan of the joined tables". The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.

At this level you really start getting into the nitty gritty of what optimizations a particular database has implemented though: the solution presented in the article may be the best way to write this for SQL Server, but this "correlated subquery" form could be catastrophically slow in other database engines...

Sesse__ · on Oct 21, 2023

> The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.

This isn't quite right; there are two general ways. One is a hash table as you say, the other one is sorting. The second is quite relevant once you have a reasonable framework for interesting orders (e.g., scanning along an index, or doing sortahead on a smaller table with a foreign key that guarantees uniqueness).

> The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.

There are absolutely query optimizers that know this (try such a query in MySQL, for instance).

marcosdumay · on Oct 21, 2023

Hum... If you know `a` is a key of table `X` and `b` is a key of table `Y`, you can replace a `select distinct` with a `join ... on exists ()` with complete certainty.

And this solves almost all of the problematic cases of this on practice. If the thing you are querying isn't a key, you will naturally think a bit further and not write the distinct.

I guess that kind of optimization just goes counter to the philosophy of SQL optimizers. I know that I would be deeply surprised if I found a database doing it, and not in a good way.

Diggsey · on Oct 21, 2023

You're overthinking it - the optimization is absolutely possible as you've described. The reason databases haven't done is not some question of design philosophy, it's simply that nobody bothered to implement it.

> I know that I would be deeply surprised if I found a database doing it, and not in a good way.

SQL engines do way more surprising optimizations than this. SQL is a "4th gen" language after all, meaning it's absolutely within the spirit that the database engine does whatever it wants to get the result.

gigatexal · on Oct 21, 2023

Query optimizers are already black boxes of magic I guess one has to stop somewhere.

A top level distinct to a query like this is a huge red flag. Even in a sub select or a join clause id rather have a group by than a distinct.

Sesse__ · on Oct 21, 2023

There's no good reason why you couldn't rewrite DISTINCT to GROUP BY, but remember that it wouldn't work in all cases; you could have both. E.g. SELECT DISTINCT COUNT(*) FROM t1 GROUP BY a;. Similarly, HAVING and window functions happen after GROUP BY but before DISTINCT, so a simple move won't always be correct and you'd have to account for that.

_a_a_a_ · on Oct 21, 2023

> There's no good reason why you couldn't rewrite DISTINCT to GROUP BY

Exactly. People post a lot of things about SQL optimisation without knowing anything about it. Frankly, I'm getting really fed up with it on HN.

slotrans · on Oct 21, 2023

The optimizer doesn't know your semantic intent.

tuukkah · on Oct 21, 2023

Does it need it to do its job?

phanimahesh · on Oct 21, 2023

Yes. As pointed out above the intent is not always obvious in complex queries, and the optimiser is already a large black box.

twoodfin · on Oct 21, 2023

Actually, I’d think using “ANY” would be the clearest translation of the semantic intent.

If your query processor is even marginally competent, ANY & EXISTS are indeed optimized exactly the same way.

A slightly smarter QP will perform as well for the equivalent IN condition, too.

sgarland · on Oct 22, 2023

Tangentially related, how IN and ANY can produce wildly different results [0] due to type casting. I have personally seen this, and at least for my data, the fastest query / fewest buffers scanned was to shift the matching ids out to a CTE with VALUES, and join with IN. It wasn’t a minor speed up, either, it was literally twice as fast.

[0]: https://dba.stackexchange.com/a/125500

cwillu · on Oct 21, 2023

I'm pretty sure postgres will in fact do this in at least some cases.

chaps · on Oct 21, 2023

Yeah, using `SELECT 1` language is a big nah from me. I want my queries to be readable by both new and experienced.

slotrans · on Oct 21, 2023

where exists (select 1...) is completely standard and normal.

We should write our code to be readable, but it is reasonable to expect that the reader knows the language.

chaps · on Oct 21, 2023

Has nothing to do with the language. 1 (in its isolated presentation) exists outside of my set of data and that's not something that gives me comfortable feelings. It forces me to think (less think more good) why some random value needs to float around my data and maybe needs to be appended or summed or whatever.. because reasons. As far as I'm concerned you might as well make it 200 or 1648194. And why would those ever be needed in any context around my data?

naikrovek · on Oct 21, 2023

1 isn't data, here. it's just a flag to return that row.

I am sure you know this, so I'm wondering why you consider it data which is outside of your data set. it isn't data when used like this.

chaps · on Oct 21, 2023

In the sense that code is data: yes, it absolutely is data.

naikrovek · on Oct 22, 2023

don't be obtuse. it isn't in the dataset and it isn't returned in the results.

> code is data, so it's data

dude, don't. this is the weakest and most grasping argument I think I've ever heard.

chaps · on Oct 22, 2023

No need to be a dismissive dickhole about it, friend.

Maybe I'm not wording it right then? Like I said in other posts, I copy paste SQL statements all the time. If I were to copy/paste that statement then all I'd get is a bunch of ones and that's useless to me. The SQL itself is data to me in the same way that when I view a lazy list comprehension, I view it (and SQL) as something that's just waiting to be run. Maybe not now, and maybe not in the current SQL, but there's a non-zero chance that I'll copy/paste it. So in that context, a bunch of ones is useless to me and IS data because the ones are literally the output of the SQL statement. Better to generalize my code writing process in a way so that I can copy/paste a "select *" or "select rownum" because those are more useful down the line.

Really think we all just code differently.

naikrovek · on Oct 23, 2023

> No need to be a dismissive dickhole about it, friend.

name-calling. it's a weak, last resort move.

you don't get a bunch of 1s from that query. And, I think I know why you're copying and pasting queries all the time instead of writing them.

chaps · on Oct 23, 2023

"you don't get a bunch of 1s from that query."

You've seriously missed the point if this is what you're still saying. Take a step back, breathe, and try to consider that you've missed something. Whether that's a point I've made, or a lack of perspective, I don't know. For example, we almost definitely work in different fields with different practices and reasons for doing things differently. And that's fine. But your continued dismissiveness isn't helpful. Like another post said, "select *" is helpful in data analytics work. If it's not helpful in your field, that's also fine. But for me and my colleagues, it is. And I promise you're wrong in your thinking of why I copy paste SQL. What a weird fucking conversation.

Also, calling someone obtuse is name calling, so pot kettle black and all that, ya obtuse weirdo.

Good luck, friend.

hobs · on Oct 21, 2023

Yes, and all those constants would be treated the same, just like in many languages things are conventions - you don't need to use _ to name a variable that you throw away, but you do, because conventions.

There's so many people that want to complain about the conventions of SQL but praise the conventions of other languages and I think its often just "which did you see first?"

chaps · on Oct 21, 2023

You're missing the point that I don't really care about how they're treated. I want to glance at it and know what it does. For me, the 1 gets in the way. Maybe it doesn't get in the way for you because you don't use it the same way as I do, as a practice.

rrrrrrrrrrrryan · on Oct 21, 2023

> Has nothing to do with the language.

...but... this is literally how you write an EXISTS clause in SQL.

> you might as well make it 200 or 1648194

Yes, you can put anything there.

There's lots of funky inconsistent syntax in SQL because it evolved organically over literally the past half century. It has everything to do with the language.

chaps · on Oct 21, 2023

  Yes, you can put anything there.

And yes, you missed the point.

kbenson · on Oct 21, 2023

Would it make you feel better about it to think of it as (or write it as) "select true" ? Because that's essentially what it's communicating, that there is something to return and that's what we're checking for.

chaps · on Oct 21, 2023

Nope, for the same exact reason. Select * makes the most sense because * is in the context of my data, not something I'm working into it. Pretty sure it's the same speed.

Also mind you, I use a lot of CTEs, so this would look weird in that context -- hence why using row number sometimes makes more sense and achieves the same thing.

hobs · on Oct 21, 2023

Filter out everything, Project only what you need, Transform it as lightly as possible.

In any context I understand a row number would never "make sense" if a constant of 1 would be the same output, it would be a lot more code that does... nothing?

Any code using select * just breaks in the future with any new columns being added, no thanks.

chaps · on Oct 21, 2023

  Any code using select * just breaks in the future with any new columns being added, no thanks.

For you, maybe. In my workflows this is really a non-issue for me.

Maybe consider that we use SQL differently and your goals and challenges are different from mine.

(Edit: What's with the downvotes from people just disagreeing about preferences? So weird.)

rrrrrrrrrrrryan · on Oct 21, 2023

I've spent most of my career working with teams that mostly write SQL code, at multiple companies.

Except for very rare fringe cases, using "SELECT *" in production code is universally considered bad practice.

OoooooooO · on Oct 21, 2023

Select * is pretty standard for wide table queries like they are used in Data Analytics teams regularly. Because you really want ALL columns, even new ones without going back to fix all 300+ dashboards.

chaps · on Oct 21, 2023

Yep. I think the others commenting here aren't using SQL for analytics and don't recognize the importance of select * in that context.

__jem · on Oct 21, 2023

you're getting downvoted because it's not really a preference, it's pretty widely known to be bad practice and unhygienic in production queries. select * would get your pr rejected and chewed out by dba at every place i've ever worked at. so you kinda just look like you don't know what you're talking about.

croes · on Oct 21, 2023

In an exists query select * is harmless, select 1 and select * result in the same execution plan at least in MS SQL.

In a query than returns result rows it could break the query as soon as you add columns with names that already exist in other tables you joined in the query.

__jem · on Oct 21, 2023

yes, it's harmless in this position but it provides no additional benefits to the select 1 idiom and is suggestive of poor query discipline. it's far easier to say just don't ever use select * in queries.

RHSeeger · on Oct 21, 2023

I would add to this a bit in that

1. Given that "select " is considered something to avoid except when necessary in edge cases

2. And "select 1" will accomplish the same goal

Anyone reading the "select " version of the code will have to stop and consider whether it is using "select " for a reason, because "select 1" would be the normal choice. Using "select " is assumed to be conveying some intent (that isn't there) _because_ it's not the expected way to do it.

I kind of see it like

    if (thisField == thatField) ...

vs

    if ( (( true || false )) && ( 11 == 11 ) && thisField == thatField ) ...

Sure, they do the same thing... but you have to stop and look at the second one to make sure you're understanding what it does and if there's some reason its weird.

n4r9 · on Oct 21, 2023

I haven't downvoted anyone, but have followed this argument with interest as an intermediate SQL user.

If I was to guess why someone would downvote you, it wouldn't be for disagreeing with you, but more because you've subtly shifted from quite a strong objective stance ("this is not readable") to a subjective one ("this is not how I prefer to write it"), without really conceding anyone else's points.

chaps · on Oct 21, 2023

1 think my point makes more sense when you consider that I 1. Don't work with production code (more analysis, ad hoc code in an investigatory capacity) and 2. that when 1 mention someone is "new" what 1 mean is someone actively learning and not from a technical background. 1ME, folk like that have a difficult time with that floating 1. So while yes it's a standard that programmers are familiar with, it's not something that someone new will be very comfortable with. Lots of people 1 work with come from a pandas-only background.

Not really conceding because as far as 1 can see, everybody is coming from a position of familiarity.

n4r9 · on Oct 23, 2023

That's totally fair. Perhaps the confusion could have been avoided by qualifying in your initial comment that you're referring to a specific situation i.e. not-too-technical analysts writing ad hoc code.

hobs · on Oct 21, 2023

I didn't downvote you, but consider this - I work with SQL a lot, like a lot a lot. Something that's your code today is probably my code tomorrow.

So when you say "my flow is X" and your flow is inimical to maintaining it and extending it, people might get a bit irritated at the last dev that did the exact same thing.

croes · on Oct 21, 2023

How would Select * break with a new column?

hobs · on Oct 21, 2023

Any situation where a new or elsewise unknown attribute breaks your code, binding is a big one.

* Say you are joining two tables and one now has a conflicting/duplicate name, surprise, you now have broken code, it literally will not execute with ambiguous duplicate references.

* By the same token, downstream views can break for the same/similar reason.

* In some engines views are not going to actually include your columns until they are "refreshed"(SQL Server) so that one day that's out of band of your deployment your views will suddenly change.

* Say you have a report with specific headers - tada, it's now got whatever people add to the table - and sourcing it can be a pain because its unclear where the source is in the query, requiring schema.

* Performance expectations can change if the data type is much larger, up to the point of actually breaking the client or consuming tens of billions of times more resources.

cwbriscoe · on Oct 21, 2023

If you have a table with two columns and you do a 'select *', adding a column to the table can break code that is only expecting 2 columns.

croes · on Oct 21, 2023

Usually code refers to columns by name and additional columns are just ignored, the only case I know is when you add a column to a joined table and the column name already exists in the other table resulting in ambiguous column names.

In an exists clause the * is harmless

cwbriscoe · on Oct 23, 2023

I am talking about 'select ' at the top level, they can be harmless in exists, sub-selects and CTE's. The number of columns sent to the client (the program) will change when you add a column to the database. If you don't remember to change all of the places in your code where you used a 'select ', you program is likely going to fail or have unexpected results.

moritzwarhier · on Oct 21, 2023

Just wanted to write, why not select count(*) ? But I guess that's what you meant by row numbers.

Select 1 communicates that no columns need to be selected, so it forces inexperienced readers such as myself to understand why that is the case.

So imho, it carries more information than selecting some arbitrary columns, or counting the number of rows (for which I only care it's > 0)

mjan22640 · on Oct 21, 2023

The relevant thing for exists is the set of the rows returned by the select being non/empty. The value in the columns is irrelevant. Using * creates an incorrect impression that the value is relevant and the relevance tied to the data.

DonHopkins · on Oct 21, 2023

Sounds like what you want is "bang" from Max, which is kind of like the opposite of null (the billion dollar mistake) but more existential, or dang (the hn moderator) but more explosive:

User:

Explain Max/MSP's use of bang.

ChatGPT:

Max/MSP, often simply referred to as "Max", is a visual programming language for music and multimedia. In Max, the fundamental building blocks are "objects" that are connected with "patch cords". These objects can process, generate, or manipulate various types of data, including audio, MIDI, and other control messages.

One of the most basic and essential objects in Max is the "bang". Visually, a bang is represented by a small button in the Max patcher window. Conceptually, a bang is an event trigger. When a bang is received by an object, it tells that object to do something, often to execute its primary function.

Here's a breakdown of how the bang is used in Max:

Triggering Actions: Many objects in Max perform an action when they receive a bang. For example, the metro object (a metronome) sends out bangs at regular intervals. When connected to other objects, these bangs can trigger various actions in a patch.

Flow Control: Bangs can be used to control the order of operations in a patch. The trigger object, often abbreviated as t, is particularly useful for this. It can output a sequence of bangs (and other data types) in a specific order.

User Interaction: The bang object can also be used as an interactive button in a user interface. When you click on it, it sends out a bang.

Feedback: Bangs can be used to provide visual feedback. For instance, when a certain condition is met in your patch, you might use a bang to flash, indicating that the event has occurred.

Conversion: Sometimes, you might receive data, but you only want to know when the data has arrived and not necessarily what the data is. In such cases, you can convert any incoming data into a bang. For example, if you connect a MIDI note input to a bang, the bang will flash every time a note is played, regardless of the note's pitch or velocity.

In essence, the bang is a fundamental concept in Max/MSP, representing an event or a moment in time. It's a way to initiate actions, control the flow of operations, interact with the user, and more.

User:

How about a special value "dang" for propogating mistakes?

The concept of a special value like "dang" to propagate mistakes is an interesting and humorous idea. Let's explore this hypothetical scenario:

Purpose: "Dang" could be used in a system to indicate that a mistake or an error has occurred and has been propagated. Instead of failing silently or producing unpredictable results, the system would clearly mark the affected data or process with "dang".

Usage:

In a programming context, "dang" could be a special return value indicating that a function has encountered an error that wasn't caught but isn't catastrophic enough to halt execution.

In a database, "dang" could indicate corrupted or suspicious data that resulted from a known error but hasn't been cleaned or addressed.

Advantages:

Transparency: Instead of silent failures or cryptic error messages, "dang" provides a clear and human-readable indication of a problem.

Debugging: It could make debugging easier, as developers can quickly identify and trace back the source of "dang" values.

User Experience: For end-users, seeing a "dang" value could be a more friendly or humorous way of being informed that something went wrong, as opposed to a generic error message.

Challenges:

Overuse: If used too liberally, "dang" could become a catch-all for various issues, making it harder to pinpoint specific problems.

Misunderstanding: Without proper documentation or understanding, developers or users might be confused about what "dang" actually represents.

Standardization: For "dang" to be widely adopted, it would need to be standardized across platforms and tools, which is a significant challenge.

In essence, while "dang" as a concept introduces a fun and novel way to handle and propagate errors, its practical implementation would require careful consideration to ensure it adds value to the development and user experience.

mvdtnz · on Oct 21, 2023

Strange hill to die on - the SELECT 1 query is not particularly hard to understand, and it's demonstrably better.

chaps · on Oct 21, 2023

Eh, anytime I need to read SQL that includes that I have to sit down and think about it. It's not glance-and-understand and you have to maintain that state in your head. Not the sort of SQL I want to rely on as a practice.

Mind you, most of what I write are ad hoc SQL queries where I might write 100-200 queries in a single day. So that sort of thing really adds up. And it's in a strange place where performance isn't dire, since I can work on other stuff while queries chug.

wodenokoto · on Oct 21, 2023

Help a fellow out here. What’s the purpose?

noisem4ker · on Oct 21, 2023

With most databases (PostgreSQL is an exception), SELECT must select something, even if you only care about the record existing in general, and you're not interested in reading anything from it. SELECT 1 means "select the first column", whatever it is. Specifying a column number also works with ORDER BY and other clauses. Of course any natural number works, as long as the record has at least that many fields.

wruza · on Oct 21, 2023

SELECT 1 means "select the first column", whatever it is

No, it selects a literal integer 1 in all SQL engines.

You can, and do, select any expressions, not only column names.

  CategoryId,
  CategoryId + 1,
  1,
  sqrt(CategoryId * ProductId),

Column names alone are expressions too.

SQL queries return not columns from table, but a result of relational operations with new columns specified by expressions. What a downvoted user itt cannot accept is that 1 is as meaningful as * or <colname> in general and is an idiom for “columns don’t matter here, only [the number of] rows”. It’s like [[1]].length vs [[id, …]].length. Saying “it’s not my data” is misunderstanding what that data is.

magicalhippo · on Oct 21, 2023

> You can, and do, select any expressions, not only column name

I find this very useful when I need to insert some fixed data alongside some data from the database. I'll then use

    insert into TableX(Key, TotalWeight, Name, Category) 
    select Key, sum(Weight), 'foo', 42 from TableY where Key = 1234 group by Key

or something like that. Usually the source of the fixed data is in a spreadsheet, so I just use Excel to generate the SQL statements.

nice-try-fletch · on Oct 21, 2023

_SELECT 1 means "select the first column"_

In SQL Server at least, no, it literally means select the integer 1. In the ORDER BY clause, it does mean to order by ordinal position, but that's not a great thing to glorify, since ordinal position is not necessarily stable. I think other dialects like MySQL might allow GROUP BY 1, but that's not a great thing to glorify either.

pcthrowaway · on Oct 21, 2023

I'm pretty sure you still have to select something in postgres... or are you saying there's a different operator to do the same thing?

> SELECT 1 means "select the first column"

interesting.. in postgres this actually gives you a single-column value of 1, in an unnamed column... I haven't used other DBMSes much but expected similar behaviour

Quekid5 · on Oct 21, 2023

It's kind of a workaround for the fact that SQL doesn't do 0-tuples, aka. "unit".

paulddraper · on Oct 21, 2023

You can write `SELECT NULL` or `SELECT *` if you prefer.

PostgreSQL allows a zero-column `SELECT`.

inopinatus · on Oct 21, 2023

It’s literally the documented idiom for producing singular output from amplified rows at https://www.postgresql.org/docs/current/functions-subquery.h....

gigatexal · on Oct 21, 2023

This isn’t readable? It’s a common pattern.

funnymony · on Oct 21, 2023

Its a matter of your eyes being used to this.

If for ten years you always indented the code this way

    Void F()
    {    Foo();
         Bar();
         Baz();}

Then following snippet will seem hard to parse mentally:

    Void F()
    {
         Foo();
         Bar();
         Baz();
    }

And vice versa

gigatexal · on Oct 21, 2023

You mean the where exists/not exists concept? Or the select 1 usage in that correlated subquery?

funnymony · on Oct 21, 2023

wvenable · on Oct 21, 2023

I assume you could just use `SELECT *` there and the result is the same.

TurkTurkleton · on Oct 21, 2023

Indeed, I would be surprised if there exists any RDBMS where the content of the select list in an EXISTS subquery matters. Postgres's SQL dialect even lets you use an empty select list (`... WHERE EXISTS (SELECT FROM related_table WHERE related_id = id_from_outer_query)`). In T-SQL, however, a non-empty select list is required, and in my experience, developers writing T-SQL tend to prefer a constant value like `1` or `NULL` over `*` -- I suspect there's some superstition there related to both the common wisdom that `SELECT *` queries are to be avoided (which is true -- you really should only ever select the columns of interest) and a lack of truly understanding that EXISTS only reads enough of the table to ascertain whether rows exist that satisfy the subquery's predicate, and returns a boolean value, not a result set.

gpm · on Oct 21, 2023

As a non-sql-expert (I've used an exists query approximately once in my life) I prefer `select 1` over `select column_name` or `select *` because it doesn't look like it might be meaningful.

Postgres's dialect seems like it made the right choice here.

jrumbut · on Oct 21, 2023

Selecting NULL to signify something exists breaks my brain a little bit. I really prefer 1 stylistically, even if there is no technical difference.

justsomehnguy · on Oct 21, 2023

It works fine for me.. when I'm thinking in sqlese. The bigger travesty here is using SELECT for the tasks which are not selecting (returning data) anything.

mjan22640 · on Oct 21, 2023

Turn the brain 90° from columns to rows and it will be fine.

chaps · on Oct 21, 2023

Yeah. 1 breaks my brain for the same reason probably.

marcosdumay · on Oct 21, 2023

Both Oracle and SQL Server have the advice to use a constant value on their documentation. Postgres used to advise the use of `*`, but looks like they improved their optimizer so it doesn't matter anymore.

It's not superstition. It's people that know deeply how a complex system works picking the option with the best set of side-effects.

nice-try-fletch · on Oct 21, 2023

You could. But `SELECT 1` is a very common pattern in SQL Server, intended exactly to declare intent to the reader that "this subquery doesn't return any real data" - in the event the `EXISTS` that surrounds it isn't enough of a clue. You could use `SELECT NULL` or `SELECT 0` or `SELECT *` or `SELECT 12/0` and the behavior would be identical, but most of those would likely raise even more questions to an inexperienced reader than `SELECT 1`.

Sesse__ · on Oct 21, 2023

If you don't like EXISTS, the other standard way to get a semijoin is using IN with a subquery. SELECT * FROM a WHERE x IN (SELECT y FROM b);

magicalhippo · on Oct 21, 2023

Just to be sure to check the resulting plans. The DB we use usually has sub-optimal plans when using IN compared to EXISTS, often quite drastically so.

Sesse__ · on Oct 21, 2023

You always need to check the resulting plans if you care about execution performance, that's just how life is.

I'm offhand a bit surprised IN does worse than EXISTS; I can understand NOT IN being slow, because it has very surprising NULL handling that is hard to optimize for.

meinaccount00 · on Oct 21, 2023

I've been working with Spark lately and just beeing able to write "semi join" (or "anti join") is quite nice and readable

Sesse__ · on Oct 21, 2023

Explicit semi- and antijoin operators would be a nice addition to the language, but do note that they present some additional challenges for the optimizer. In particular, when the join condition becomes explicit, it can take on more forms than before, which can limit the reorderings allowed or allow new ones. E.g. you now need to deal with this structure:

  (a SEMIJOIN b ON a.x=b.y) JOIN c ON b.z=c.z

This was an impossible structure before, since IN and EXISTS both hide b's columns (all semi- and antijoins effectively come last), and your optimizer will now need to know whether e.g. this associative rewrite is allowed or not:

  a SEMIJOIN (b JOIN c ON b.z=c.z) ON a.x=b.y

Also, you'll need to deal with LATERAL semijoins, which you didn't before…

None of this is impossible, but there's more to it than just a small syntactic change.

nsenifty · on Oct 21, 2023

Yeah just change it to `SELECT OrderID`.

chaps · on Oct 21, 2023

Yep, agreed.

thaumasiotes · on Oct 21, 2023

I don't get it. "SELECT 1" is significantly more informative than "SELECT red_herring". It's a self-documenting and automatically enforced comment telling you that the information requested by the SELECT statement is not used by the query.

chaps · on Oct 21, 2023

I copy paste SQL all the time, especially with CTE after CTE all the time. The row number is a simple thing to copy paste that fills multiple functions with zero modification.

thaumasiotes · on Oct 21, 2023

And that's different from 1 in what way other than being more confusing?

Note that in the case you're advocating for, the row number is called "OrderId", which you might have trouble with if you insist on copying a query from somewhere else and using it without modification. Wouldn't you prefer "1"?

chaps · on Oct 21, 2023

It's helpful within CTEs as a unique identifier. This whole conversation makes me wonder if not many people minmax this sort of meta programming thing.

croes · on Oct 21, 2023

Then write Select *