Nobody actually answered your question helpfully, so here goes:
When you write `SELECT DISCTINCT a, b` you are asking for every unique pair (a, b) in the result. Specifically, the same value of `a` could appear multiple times in the result paired with different values of `b`.
The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.
As a human, we look at this and think "Oh, but `a` is the primary key (in this case CustomerID) and `b` is a column in the same table (CustomerName) so we only need to be distinct over `a`, which means the query optimizer should be able to "push down" the DISTINCT clause into the scan of the joined tables". The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
At this level you really start getting into the nitty gritty of what optimizations a particular database has implemented though: the solution presented in the article may be the best way to write this for SQL Server, but this "correlated subquery" form could be catastrophically slow in other database engines...
> The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.
This isn't quite right; there are two general ways. One is a hash table as you say, the other one is sorting. The second is quite relevant once you have a reasonable framework for interesting orders (e.g., scanning along an index, or doing sortahead on a smaller table with a foreign key that guarantees uniqueness).
> The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
There are absolutely query optimizers that know this (try such a query in MySQL, for instance).
Hum... If you know `a` is a key of table `X` and `b` is a key of table `Y`, you can replace a `select distinct` with a `join ... on exists ()` with complete certainty.
And this solves almost all of the problematic cases of this on practice. If the thing you are querying isn't a key, you will naturally think a bit further and not write the distinct.
I guess that kind of optimization just goes counter to the philosophy of SQL optimizers. I know that I would be deeply surprised if I found a database doing it, and not in a good way.
You're overthinking it - the optimization is absolutely possible as you've described. The reason databases haven't done is not some question of design philosophy, it's simply that nobody bothered to implement it.
> I know that I would be deeply surprised if I found a database doing it, and not in a good way.
SQL engines do way more surprising optimizations than this. SQL is a "4th gen" language after all, meaning it's absolutely within the spirit that the database engine does whatever it wants to get the result.
There's no good reason why you couldn't rewrite DISTINCT to GROUP BY, but remember that it wouldn't work in all cases; you could have both. E.g. SELECT DISTINCT COUNT(*) FROM t1 GROUP BY a;. Similarly, HAVING and window functions happen after GROUP BY but before DISTINCT, so a simple move won't always be correct and you'd have to account for that.
Tangentially related, how IN and ANY can produce wildly different results [0] due to type casting. I have personally seen this, and at least for my data, the fastest query / fewest buffers scanned was to shift the matching ids out to a CTE with VALUES, and join with IN. It wasn’t a minor speed up, either, it was literally twice as fast.
Has nothing to do with the language. 1 (in its isolated presentation) exists outside of my set of data and that's not something that gives me comfortable feelings. It forces me to think (less think more good) why some random value needs to float around my data and maybe needs to be appended or summed or whatever.. because reasons. As far as I'm concerned you might as well make it 200 or 1648194. And why would those ever be needed in any context around my data?
No need to be a dismissive dickhole about it, friend.
Maybe I'm not wording it right then? Like I said in other posts, I copy paste SQL statements all the time. If I were to copy/paste that statement then all I'd get is a bunch of ones and that's useless to me. The SQL itself is data to me in the same way that when I view a lazy list comprehension, I view it (and SQL) as something that's just waiting to be run. Maybe not now, and maybe not in the current SQL, but there's a non-zero chance that I'll copy/paste it. So in that context, a bunch of ones is useless to me and IS data because the ones are literally the output of the SQL statement. Better to generalize my code writing process in a way so that I can copy/paste a "select *" or "select rownum" because those are more useful down the line.
You've seriously missed the point if this is what you're still saying. Take a step back, breathe, and try to consider that you've missed something. Whether that's a point I've made, or a lack of perspective, I don't know. For example, we almost definitely work in different fields with different practices and reasons for doing things differently. And that's fine. But your continued dismissiveness isn't helpful. Like another post said, "select *" is helpful in data analytics work. If it's not helpful in your field, that's also fine. But for me and my colleagues, it is. And I promise you're wrong in your thinking of why I copy paste SQL. What a weird fucking conversation.
Also, calling someone obtuse is name calling, so pot kettle black and all that, ya obtuse weirdo.
Yes, and all those constants would be treated the same, just like in many languages things are conventions - you don't need to use _ to name a variable that you throw away, but you do, because conventions.
There's so many people that want to complain about the conventions of SQL but praise the conventions of other languages and I think its often just "which did you see first?"
You're missing the point that I don't really care about how they're treated. I want to glance at it and know what it does. For me, the 1 gets in the way. Maybe it doesn't get in the way for you because you don't use it the same way as I do, as a practice.
...but... this is literally how you write an EXISTS clause in SQL.
> you might as well make it 200 or 1648194
Yes, you can put anything there.
There's lots of funky inconsistent syntax in SQL because it evolved organically over literally the past half century. It has everything to do with the language.
Would it make you feel better about it to think of it as (or write it as) "select true" ? Because that's essentially what it's communicating, that there is something to return and that's what we're checking for.
Nope, for the same exact reason. Select * makes the most sense because * is in the context of my data, not something I'm working into it. Pretty sure it's the same speed.
Also mind you, I use a lot of CTEs, so this would look weird in that context -- hence why using row number sometimes makes more sense and achieves the same thing.
Filter out everything, Project only what you need, Transform it as lightly as possible.
In any context I understand a row number would never "make sense" if a constant of 1 would be the same output, it would be a lot more code that does... nothing?
Any code using select * just breaks in the future with any new columns being added, no thanks.
Select * is pretty standard for wide table queries like they are used in Data Analytics teams regularly. Because you really want ALL columns, even new ones without going back to fix all 300+ dashboards.
you're getting downvoted because it's not really a preference, it's pretty widely known to be bad practice and unhygienic in production queries. select * would get your pr rejected and chewed out by dba at every place i've ever worked at. so you kinda just look like you don't know what you're talking about.
In an exists query select * is harmless, select 1 and select * result in the same execution plan at least in MS SQL.
In a query than returns result rows it could break the query as soon as you add columns with names that already exist in other tables you joined in the query.
yes, it's harmless in this position but it provides no additional benefits to the select 1 idiom and is suggestive of poor query discipline. it's far easier to say just don't ever use select * in queries.
1. Given that "select " is considered something to avoid except when necessary in edge cases
2. And "select 1" will accomplish the same goal
Anyone reading the "select " version of the code will have to stop and consider whether it is using "select " for a reason, because "select 1" would be the normal choice. Using "select " is assumed to be conveying some intent (that isn't there) _because_ it's not the expected way to do it.
Sure, they do the same thing... but you have to stop and look at the second one to make sure you're understanding what it does and if there's some reason its weird.
I haven't downvoted anyone, but have followed this argument with interest as an intermediate SQL user.
If I was to guess why someone would downvote you, it wouldn't be for disagreeing with you, but more because you've subtly shifted from quite a strong objective stance ("this is not readable") to a subjective one ("this is not how I prefer to write it"), without really conceding anyone else's points.
1 think my point makes more sense when you consider that I 1. Don't work with production code (more analysis, ad hoc code in an investigatory capacity) and 2. that when 1 mention someone is "new" what 1 mean is someone actively learning and not from a technical background. 1ME, folk like that have a difficult time with that floating 1. So while yes it's a standard that programmers are familiar with, it's not something that someone new will be very comfortable with. Lots of people 1 work with come from a pandas-only background.
Not really conceding because as far as 1 can see, everybody is coming from a position of familiarity.
That's totally fair. Perhaps the confusion could have been avoided by qualifying in your initial comment that you're referring to a specific situation i.e. not-too-technical analysts writing ad hoc code.
I didn't downvote you, but consider this - I work with SQL a lot, like a lot a lot. Something that's your code today is probably my code tomorrow.
So when you say "my flow is X" and your flow is inimical to maintaining it and extending it, people might get a bit irritated at the last dev that did the exact same thing.
Any situation where a new or elsewise unknown attribute breaks your code, binding is a big one.
* Say you are joining two tables and one now has a conflicting/duplicate name, surprise, you now have broken code, it literally will not execute with ambiguous duplicate references.
* By the same token, downstream views can break for the same/similar reason.
* In some engines views are not going to actually include your columns until they are "refreshed"(SQL Server) so that one day that's out of band of your deployment your views will suddenly change.
* Say you have a report with specific headers - tada, it's now got whatever people add to the table - and sourcing it can be a pain because its unclear where the source is in the query, requiring schema.
* Performance expectations can change if the data type is much larger, up to the point of actually breaking the client or consuming tens of billions of times more resources.
Usually code refers to columns by name and additional columns are just ignored, the only case I know is when you add a column to a joined table and the column name already exists in the other table resulting in ambiguous column names.
I am talking about 'select ' at the top level, they can be harmless in exists, sub-selects and CTE's. The number of columns sent to the client (the program) will change when you add a column to the database. If you don't remember to change all of the places in your code where you used a 'select ', you program is likely going to fail or have unexpected results.
The relevant thing for exists is the set of the rows returned by the select being non/empty.
The value in the columns is irrelevant. Using * creates an incorrect impression that the value is relevant and the relevance tied to the data.
Sounds like what you want is "bang" from Max, which is kind of like the opposite of null (the billion dollar mistake) but more existential, or dang (the hn moderator) but more explosive:
User:
Explain Max/MSP's use of bang.
ChatGPT:
Max/MSP, often simply referred to as "Max", is a visual programming language for music and multimedia. In Max, the fundamental building blocks are "objects" that are connected with "patch cords". These objects can process, generate, or manipulate various types of data, including audio, MIDI, and other control messages.
One of the most basic and essential objects in Max is the "bang". Visually, a bang is represented by a small button in the Max patcher window. Conceptually, a bang is an event trigger. When a bang is received by an object, it tells that object to do something, often to execute its primary function.
Here's a breakdown of how the bang is used in Max:
Triggering Actions: Many objects in Max perform an action when they receive a bang. For example, the metro object (a metronome) sends out bangs at regular intervals. When connected to other objects, these bangs can trigger various actions in a patch.
Flow Control: Bangs can be used to control the order of operations in a patch. The trigger object, often abbreviated as t, is particularly useful for this. It can output a sequence of bangs (and other data types) in a specific order.
User Interaction: The bang object can also be used as an interactive button in a user interface. When you click on it, it sends out a bang.
Feedback: Bangs can be used to provide visual feedback. For instance, when a certain condition is met in your patch, you might use a bang to flash, indicating that the event has occurred.
Conversion: Sometimes, you might receive data, but you only want to know when the data has arrived and not necessarily what the data is. In such cases, you can convert any incoming data into a bang. For example, if you connect a MIDI note input to a bang, the bang will flash every time a note is played, regardless of the note's pitch or velocity.
In essence, the bang is a fundamental concept in Max/MSP, representing an event or a moment in time. It's a way to initiate actions, control the flow of operations, interact with the user, and more.
User:
How about a special value "dang" for propogating mistakes?
The concept of a special value like "dang" to propagate mistakes is an interesting and humorous idea. Let's explore this hypothetical scenario:
Purpose: "Dang" could be used in a system to indicate that a mistake or an error has occurred and has been propagated. Instead of failing silently or producing unpredictable results, the system would clearly mark the affected data or process with "dang".
Usage:
In a programming context, "dang" could be a special return value indicating that a function has encountered an error that wasn't caught but isn't catastrophic enough to halt execution.
In a database, "dang" could indicate corrupted or suspicious data that resulted from a known error but hasn't been cleaned or addressed.
Advantages:
Transparency: Instead of silent failures or cryptic error messages, "dang" provides a clear and human-readable indication of a problem.
Debugging: It could make debugging easier, as developers can quickly identify and trace back the source of "dang" values.
User Experience: For end-users, seeing a "dang" value could be a more friendly or humorous way of being informed that something went wrong, as opposed to a generic error message.
Challenges:
Overuse: If used too liberally, "dang" could become a catch-all for various issues, making it harder to pinpoint specific problems.
Misunderstanding: Without proper documentation or understanding, developers or users might be confused about what "dang" actually represents.
Standardization: For "dang" to be widely adopted, it would need to be standardized across platforms and tools, which is a significant challenge.
In essence, while "dang" as a concept introduces a fun and novel way to handle and propagate errors, its practical implementation would require careful consideration to ensure it adds value to the development and user experience.
Eh, anytime I need to read SQL that includes that I have to sit down and think about it. It's not glance-and-understand and you have to maintain that state in your head. Not the sort of SQL I want to rely on as a practice.
Mind you, most of what I write are ad hoc SQL queries where I might write 100-200 queries in a single day. So that sort of thing really adds up. And it's in a strange place where performance isn't dire, since I can work on other stuff while queries chug.
With most databases (PostgreSQL is an exception), SELECT must select something, even if you only care about the record existing in general, and you're not interested in reading anything from it. SELECT 1 means "select the first column", whatever it is. Specifying a column number also works with ORDER BY and other clauses. Of course any natural number works, as long as the record has at least that many fields.
SQL queries return not columns from table, but a result of relational operations with new columns specified by expressions. What a downvoted user itt cannot accept is that 1 is as meaningful as * or <colname> in general and is an idiom for “columns don’t matter here, only [the number of] rows”. It’s like [[1]].length vs [[id, …]].length. Saying “it’s not my data” is misunderstanding what that data is.
In SQL Server at least, no, it literally means select the integer 1. In the ORDER BY clause, it does mean to order by ordinal position, but that's not a great thing to glorify, since ordinal position is not necessarily stable. I think other dialects like MySQL might allow GROUP BY 1, but that's not a great thing to glorify either.
I'm pretty sure you still have to select something in postgres... or are you saying there's a different operator to do the same thing?
> SELECT 1 means "select the first column"
interesting.. in postgres this actually gives you a single-column value of 1, in an unnamed column... I haven't used other DBMSes much but expected similar behaviour
Indeed, I would be surprised if there exists any RDBMS where the content of the select list in an EXISTS subquery matters. Postgres's SQL dialect even lets you use an empty select list (`... WHERE EXISTS (SELECT FROM related_table WHERE related_id = id_from_outer_query)`). In T-SQL, however, a non-empty select list is required, and in my experience, developers writing T-SQL tend to prefer a constant value like `1` or `NULL` over `*` -- I suspect there's some superstition there related to both the common wisdom that `SELECT *` queries are to be avoided (which is true -- you really should only ever select the columns of interest) and a lack of truly understanding that EXISTS only reads enough of the table to ascertain whether rows exist that satisfy the subquery's predicate, and returns a boolean value, not a result set.
As a non-sql-expert (I've used an exists query approximately once in my life) I prefer `select 1` over `select column_name` or `select *` because it doesn't look like it might be meaningful.
Postgres's dialect seems like it made the right choice here.
It works fine for me.. when I'm thinking in sqlese. The bigger travesty here is using SELECT for the tasks which are not selecting (returning data) anything.
Both Oracle and SQL Server have the advice to use a constant value on their documentation. Postgres used to advise the use of `*`, but looks like they improved their optimizer so it doesn't matter anymore.
It's not superstition. It's people that know deeply how a complex system works picking the option with the best set of side-effects.
You could. But `SELECT 1` is a very common pattern in SQL Server, intended exactly to declare intent to the reader that "this subquery doesn't return any real data" - in the event the `EXISTS` that surrounds it isn't enough of a clue. You could use `SELECT NULL` or `SELECT 0` or `SELECT *` or `SELECT 12/0` and the behavior would be identical, but most of those would likely raise even more questions to an inexperienced reader than `SELECT 1`.
Just to be sure to check the resulting plans. The DB we use usually has sub-optimal plans when using IN compared to EXISTS, often quite drastically so.
You always need to check the resulting plans if you care about execution performance, that's just how life is.
I'm offhand a bit surprised IN does worse than EXISTS; I can understand NOT IN being slow, because it has very surprising NULL handling that is hard to optimize for.
Explicit semi- and antijoin operators would be a nice addition to the language, but do note that they present some additional challenges for the optimizer. In particular, when the join condition becomes explicit, it can take on more forms than before, which can limit the reorderings allowed or allow new ones. E.g. you now need to deal with this structure:
(a SEMIJOIN b ON a.x=b.y) JOIN c ON b.z=c.z
This was an impossible structure before, since IN and EXISTS both hide b's columns (all semi- and antijoins effectively come last), and your optimizer will now need to know whether e.g. this associative rewrite is allowed or not:
a SEMIJOIN (b JOIN c ON b.z=c.z) ON a.x=b.y
Also, you'll need to deal with LATERAL semijoins, which you didn't before…
None of this is impossible, but there's more to it than just a small syntactic change.
I don't get it. "SELECT 1" is significantly more informative than "SELECT red_herring". It's a self-documenting and automatically enforced comment telling you that the information requested by the SELECT statement is not used by the query.
I copy paste SQL all the time, especially with CTE after CTE all the time. The row number is a simple thing to copy paste that fills multiple functions with zero modification.
And that's different from 1 in what way other than being more confusing?
Note that in the case you're advocating for, the row number is called "OrderId", which you might have trouble with if you insist on copying a query from somewhere else and using it without modification. Wouldn't you prefer "1"?
It's helpful within CTEs as a unique identifier. This whole conversation makes me wonder if not many people minmax this sort of meta programming thing.