Master SQL joins by understanding precisely how tables connect and reveal hidden insights within large datasets. Gain clarity on identifying sales trends, user behaviors, and unsold products through effective join strategies.
Key Insights
- Understand your dataset thoroughly to effectively perform SQL joins; knowing table structure and relationships is essential for accurate querying.
- Identify outliers such as unsold products by using outer joins; for example, joining the products and line items tables with a left join reveals products without associated orders through null values.
- Recognize that joins simply match equivalent values across tables and do not necessarily require primary or foreign keys; matching song IDs from unrelated music platforms illustrates how joins can occur even without formal table relationships.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
So let's talk through this solutions challenge question here. So just as a recap of one of the things that I mentioned yesterday, but just to really drive it home until you understand the data set that you're working with. It's kind of hard to understand all the ways in which you can use things.
So your first thing, whenever you start working with a new database is to look through it, but also to understand the intent of the data. And whenever you're working for a company, if you don't understand how the data works, if you don't understand what it means, talk to somebody else at the company to try to understand what this stuff means. And, you know, some of it you could just figure out maybe on your own, but sometimes there might be certain names that are confusing.
So first you want to make sure you understand what's going on. And so just to explain our database in this case, the idea here is if I open up the tables here and look at the ER diagram, is that, zoom in here, make all this bigger here. Is it users place orders? Those orders know who placed the order, when it was made, who it was shipped to, which could be different than where that person lives.
It does not know any status. It doesn't know what was ordered. Those are broken up by line items.
So associated with a certain order, we know the product that was purchased, the price, the quantity, and the status, you know, was it returned? Was it shipped? That sort of stuff. But we also have this list of products where we have the title of the products, tags for searching when the products were created, what their normal price is. It's possible that somebody paid a different price.
Maybe they got a discount or something. But when we think about products, we may or may not have sold all of these products. We've listed them for sale on the website or app, but we might not have sold all of those.
So let's say we define a new product that goes into the products table. Is there automatically a line item for that product that we just added? No, because somebody needs to purchase or purchase, right? They need to place an order. So the only products that will be in line items are products that have been purchased.
What if a product has not been sold? It will be in products, but it will not be in line items. We're trying to, when we're searching for things, we need to figure out what tables do I need to look at? What do I need to involve to find what I'm looking for? Because if I look in products, I only know my list of products. I don't know whether they've been sold or not.
Line items tells me what's been sold, but it doesn't tell me what hasn't been sold. I need to see the connection between my list of all the products and the list of sold products to figure out those outliers of products that haven't sold. Kind of like if I was thinking about users and what orders they placed, if I just want to see the users who placed an order, that would be a match.
But what if I want to see users who didn't place an order? Those would be outliers. And I'd have to consult the orders to know whether they place an order or not. We have to think about the tables that we can connect to give us this information.
Is there's created at in these, but they're not connected to each other. So let's think about what these are and the way that joins work. So why is user ID connected to user ID? Because these are unique things.
So for example, a user, a user has a unique number. So user number one, user number two. In orders, we need to know who placed that order.
And so when we think about user number one, place this order. Theoretically, could it be possible that two orders were placed at the same exact time? Or two users could create their account at exactly the same time. It would probably be pretty hard, but in theory, if you're getting enough users, even though they track it down to the millisecond, in theory, two people could create their account at exactly the same time.
So what, when the user created their accounts, be a good way to identify a unique individual person. That wouldn't be a good way to identify a person because in theory, two people could create their account at the same time. Now, this is when the user created their accounts.
This is when the user placed their order. Those two pieces of information have nothing to do with each other. As far as connecting tables.
If I look up when an order was placed, does that help me to know who placed that order? No, the only way I can know who placed the order is to look it up by their user ID. When we're doing joins, we're looking for two columns that have the same data that is equivalent to each other. User number one, user number one, place that order.
User number two, user number two, place that order. We're looking for something unique and repeated. That we have the same data in both.
Let's just say we have the same time here and the same time here. So a person created their account and placed an order at exactly the same millisecond? Probably not, right? They're going to have to create their account and then place an order. Those two things don't connect to each other because they're completely different type of data.
They're not like data. I know they're the same kind of data in the sense that they're both time data, but they have nothing to do in terms of building a connection between those two tables. Does that make sense? Also, just to be clear here, when I select user ID and it highlights user ID over here, I really like that DBeaver makes that connection for me.
Not all database apps will do that. Sometimes you just have to look to see one will be labeled a primary key. One will be labeled a foreign key, but also primary and foreign keys are not required to make joins.
They are often used to make joins, but all a join needs is to say that there's some column in one table and some column in another table that has the same data that will sometimes match. For example, there was a company I was training that works in the music industry, and they get database dumps of song play counts. At the end of every day, let's say they get a data dump from Amazon, Spotify, Apple Music.
They get this database dump. They get it once a day at the very end of the day of play counts for songs. And they aggregate all of that data together to say, this is how many song plays across all of these services they got.
So they need some way to connect those databases, which they didn't create that data. They get that database information dumped into their database from these different services. When they create that database there, they have one for Amazon, one for Apple, one for Spotify, let's say.
And think about products, like how products have a UPC code, which is a universal product code. So think about songs, if songs have a unique number. In the Apple, Amazon, and Spotify tables that are just three unrelated tables, they all have a song identifier because there needs to be some universal song identifier that all these services use so that they know that that's a certain song.
All three tables, they never get updated for each other. The whole idea of a primary and a foreign key is that in the database, that if something ever got changed, it would update the other one. They're literally connected.
But since they're just getting these database dumps coming in, they're not modifying data or anything. They don't have primary and foreign keys set up. They're just all just three separate unrelated tables.
But we can find a common thread that binds them all together in that song identifier. So all three tables have a song identifier. And it wouldn't necessarily light up like this because they're not primary and foreign keys.
Because which one defines that? None of them define the, they don't say, I'm the creator of this song ID. They just all share the same song ID. So they're just three unrelated tables.
So here there would be no connection between those three tables. They're just three separate tables, but they have a common column. They might be all called the same thing, like song ID.
But because they're from different companies, they might not have chosen the same name. But yet it's the same kind of data. If one called it Apple song ID, if another one called it music ID, the data that's in there is still the same data to make a match.
The name doesn't matter. They can be different names or they can be the same name. And it doesn't matter whether they light up or not as far as primary and foreign key.
The only thing that's required for a join is to say, I know these two columns that the data inside matches. As long as that data matches, we can join on that data. Right.
If you say, well, an order was for user number two and you change the two to a three, you reassign that order to a different person because now they have to go to three to say user number three. That's a different person. Now, if it was accidentally attributed to the wrong person, you could change that here, but you're never changing it in the place that defines it because user number two will always be user number two.
Right. Over here, you could say, oh, wait, actually, oh, sorry, that wasn't for this person. It was for another person.
Let me change it there to point to that other person. But then three will point to three. Now we know which user.
So there's always got to be some matches where the data is the same. Many times it's a number just because we have unique numbers for things, but it doesn't have to be a number. It could be something other that is unique, like some sort of like model number or something.
Like there could be letters and characters in there. Doesn't have to be a number. We're just saying this equals this.
And that's what a join does. When we say join on, you're saying these are my two columns and they equal each other. That when they equal, that's when we're joining them.
Right. So joins are very simple. When you actually look at what they do, they simply say, let me compare values in two columns.
This does this equal this. That's why we join on two columns. We're saying when this equals this, that's a match.
We know it's a match. I'll put that row into the result set. They're just looking for matches.
And then if we do an outer join, we're saying, oh, don't just give me the matches where they do equal. Also give me all the non-matches as well. The outliers.
I know joins can be kind of confusing in the beginning, but when you look at what they actually do, they're just trying to match rows in two different tables. And they're looking for equal values. That's all they do.
They say, does this equal this? And if they don't find a match, an outer join would include it, but an inner join wouldn't. Right. So they should go row by row.
Think of how much work a join is. It's amazing how fast it works. I'm amazed.
Because they go row by row and they compare every user to every order and say match, no. Match, yes. Include the row.
So there's so many times they don't include the row, and there's only a few times they do include the row. But they do a ton of matching. And actually, there's a little animated GIF.
SQL joins animated GIF on this data school website. If this helps you to kind of understand what it does. So here they're looking through and they're saying, hey, does Matt equal Matt? Does Lisa equal Lisa? Does Lisa equal Lisa? When it says red, it says, hey, that's not a match.
Right. So it's like, Matt, is Matt? Yes. Include the row.
Matt is not Lisa. Matt is not Sarah. Matt is not Louis.
Is Lisa all of those? And so it literally goes through and is Lisa all of those things? Is Jeff all of those things? Is Sarah all of those things? And it only includes the row when it finds a match. But it compares every single row, row by row by row. That's a match.
Let's go to Lisa. Not a match. Match.
Not a match. Not a match. Jeff.
See, so Jeff got forgotten about because Jeff had no matches. So he was left off because it said not, not, not, not. Don't include him.
It's a lot of work to do a join, yet they happen so quickly. It's kind of amazing that they can process it that fast. But they're just looking for matches.
That's all joins do. And then if you do an outer join, it also includes the outer. See, this is just a regular join.
If it were a left join where it included this, then Jeff would have been output into the result set. So the result set of a join starts empty, and only when matches are made does it include it in the result set. Or if it's a case of an outer join, it also includes it even if there is no match.
But the question is, what does it put on the right? If there's a match, it puts the stuff on the right from the match. If there is no match, it just puts nulls. Because if it matches with nothing, it puts nothing.
It can't make up data. Okay. If the goal here is to find products that have no orders.
Products. And I can know what was ordered by looking at line items. Because these are the products that were sold.
And there's a product ID here. That if I find that product ID in line items, I know it was sold. But I want to see sold or not sold.
Because I want to see all the products, whether or not they've been sold. If I just want to see products that have been sold, I would do an inner join, just the matches. But I want to see products, whether or not they have been sold.
Give me the sold ones and the not sold. So I want every product, no matter what. When I do this, I care about products.
Because I want to find products that have no orders. So I'm going to start with products. And I'm going to, right now I'm going to just type in join.
But I'll talk about which kind of join I'm going to do properly in just a moment. I'm going to join in with line items. And I'm going to join on the column that has common data, which is product ID.
That connects the two. So in one table, let's say I start with products. It's got product ID.
And then equals in the other table, the same product ID. And it's because the columns have the same name. That's why I must specify which table they come from.
If the columns had different names, I would not have to specify the table they come from because it would be obvious. It's only when you have two columns that are the same name that it's confusing or ambiguous. Now, this would only give me products that sold.
But if I want all the products, I want a left join. Because I want all products, whether they've sold or not. Join would only give me the matches.
Left join says, I care about this left table. Give me all rows from that left table. So you'll get every product, no matter what, whether there's a match or not.
Which means that on the right-hand side in the line items, If there's a match with a line item, it'll fill in the actual line item. But if there's not a line item, what will it fill in on that side? Nulls. So that gives me the join.
And you might be able to browse the data to find nulls. Which if I do, all the way down here at the bottom, there's some nulls down here. I say you might be able to browse the data.
This is intentionally a small training database. So it helps you to visualize the data. But in real life data, you're not going to deal with 100 users and like 500 orders.
You might deal with hundreds of thousands or millions of users. And deal with thousands or millions or even billions of orders. So you can't always just browse your data to start to make sense of it.
Sometimes you can, but when you're looking through millions of stuff, It could take you a really long time. So that's why we learn on a small training database. So you can start to understand these things.
Because when you scale up to a full-size database, Some of these things you just have to know, not just browse for the stuff. Like you'll have to type in a search. Which is, hey, if I want to find those, instead of browsing for those, Because it might take me way too long to browse to find those.
How do I find those products that didn't sell? Well, I'm looking somewhere in my line items table. And if you think about the columns that are here. Are you going to have an order ID for a product you didn't sell? No.
Are you going to have a price, a quantity, a status in the line item side? Really, none of this data would be filled in. Like it would all be empty. So technically you could search in any of those columns.
I would personally recommend the line item ID. Because that's the one that's the primary key. And in a primary key, you must have a value for a primary key.
That defines the line items identifier. You must have that if you have a line item. So there's no way the database could ever have that be known.
If I look for rows where that line item ID is, no. Those are the products that have no line item associated with them. But technically speaking, you could also look for order ID.
That would work. Really all of these things, status. Really, you could look at any of those things.
Because all of those would really be required if you actually had a line item. So I personally would look for the primary key. But you could look in any of those columns.
And in this particular case, it would be fine. You don't have to show all of these in the final result set. If you don't want to see them all.
If you're just looking for some of this information. You could just type in those columns that you want to see. Now, if you don't want to have all this line item stuff.
If you just want to have the stuff from products. You could say, from the products table, show me all the columns. And that will just show you the columns that are in the products table.
So that way, you still needed to do the join. Because you needed that information for the where to work. But you don't have to show people that empty stuff that you don't want people to see.
When I first got started with SQL, I thought the asterisk was so simple. It was asterisk or something else. And I'm like, wow, you can actually do a lot with that asterisk.
But the asterisk simply means all columns. If you have it by itself, it's all the columns for everything. But if you put it inside of a table, it's all the columns just in that table.
Just like if you have this table with this column. Here you have this table with all the columns. Pretty cool.
Cool things you can do with it. So that was the point of doing this join. The next step now is to look for those outliers and find them based on their no.
This was just to show these. If I left it on the asterisk, see how we see all of these extra columns that I don't really need? But if I just want to find products that have no orders, why do I want to show people the line item stuff being empty? As far as making a nice result for somebody else. If they just want to see a list of the products, I just want to show them the products information.
So it's just to hide those extra columns that I don't want. And I don't want to list out 1, 2, 3, 4, 5 columns by writing out the names. Because that takes a long time to write out all those column names.
If you use asterisk, you're saying all the columns from all of the tables. And I have the products table here on the left. And I have the line item table over here on the right.
So this means all from both. But I could say just show me the line item columns, which would be completely pointless in this case. Or just show me the product table columns, which is actually what I'm looking for.
I don't have to show all of those columns. Which only works when you have a join, when you have columns from multiple tables. And that's one of the reasons why we define the alias.
So I can use that shorthand there instead of having to write out products. How do you identify which is the primary key? When you are looking at, let's say I double click on tables here. And you look in the ER diagram.
The primary keys are put up here at the top. See how those are up here? So those are the primary keys. So that's one way that you can identify a primary key.
Because the stuff that's on this top line, those are your primary keys. And notice how a primary key connects into another column. So this would be a foreign key.
But all the primary keys are sitting at the top of those tables. Notice that not all of these tables have a primary key. So notice how some of these don't have anything listed up there in that top part.
So these do. If you see something above this line here, those are primary keys. The other way that you can also do it is if you look at a particular table.
So let's say you're outputting it. There's a little orange key. And I know it's small, but there's a little orange key.
And that tells you that it's a primary key. It's a tiny little orange key right there. That's the other way you can see it in the results set.
So let's say we want to find the names of people who ordered something in a quantity of 5 or greater. People who placed an order. Do we care about people who didn't place an order? No.
So do we want an outer join or an inner join? Inner join. We don't care about those outliers, the people who didn't place an order. We just care about people who did.
We'll just do a regular plain inner join. Keep in mind, it's called join for a reason. It is the most common kind of join.
I know we learned outer joins after inner joins. But most of the time, inner joins are the way you go. Most of the time, you're just looking for matches.
The vast majority of times, that's your default. Your assumption should be in most cases, you just do join. That's why they call it the normal join, just join.
Because it is the most common join. Every once in a while, you do want those outer joins. But that's not the norm, or else they would have called that the normal kind of join.
So just your regular inner join, just called plain join. That is your normal join that you do most of the time. When we're looking for people who did place orders in a quantity of five or greater, we're just going to do a regular join.
We want the names of people, and we want only people who purchased a quantity of five or greater. So on any individual line item we're looking for, is the quantity five or more on an individual line item? Because that's where the quantity is. Quantity is over here in line items.
So I want the quantity from line items, and I want the name from users, but is there a way for me to connect just those two tables? Just those two. Not directly. Because there's nothing common between just those two.
Yes, I want data from here, and I want data from here, but there's no way to just connect them directly. I can go through orders, but I have to go through orders because I can know who placed an order by user ID, and then I can know what this order is associated with, right? This line item is associated with this order. So I can kind of go from this line item to this order to this user.
So I have to go through orders because you might say, well, I only need stuff from here and here. Why do I have to go through orders? Because that's the flow of data. To get from Manhattan to New Jersey, you've got to go through a bridge or tunnel.
You can't just go any other way. You've got to go through one of those things. We've got to go through orders.
So I can either go line items to orders to users, or I can go users to orders to line items. I can go in either direction. I don't care which table you start with.
The one you start with, though, we read left to right. So the results set that we create going across your first table will be on the left. Your next table will go to the right.
Your next table will go to the right of that. So start with the table that you want to start seeing. So it's up to you which way you want to go.
I would probably start with users because I want to find users. So I'd probably start with those, but it doesn't really matter. It's just ordering my columns in this case.
If I start with my users and then I join because I just want to do a regular join, it's the most common. Just want people who did place an order and I join in my orders. Every join requires two columns to say I'm joining on the equality of these two.
So what columns am I joining on? Users. User ID. Those two things are equal to each other.
When the users table has a user ID, user ID, that when that equals in the orders table, it's also called user ID. For me, I just go and say, OK, well, I want one column of this and I want one column of this. So just for me, for mentally, I just think, let me find a column in the first table.
And then that equals another column in the second table. It doesn't matter the order of these because two equals two and two equals two, no matter which side they're on. So the order of those don't matter.
But just for my mental health, I just think I'll do one from the first table and then I'll do one from the second table. Because I've got to have a column in each table to say this equals this or this equals this. The order doesn't matter.
They're equal to each other. So that gets me my user table over here on the left. And then as I go across, it gives me the order data over here.
So this is all the order data. Now I know that, you know, this user here, Elma, she placed this order at this time, shipped it to this person. But what I don't have is the line items that comes next.
Each join, you just do one join at a time. Now we want to join in again. And I just want when orders match up with the lineup.
Not looking for outliers here. And now I want to join from orders to line items on what? Order ID. When order ID equals order ID, that's what I want to join on.
So I want to join in line items. Line items on. And just for my mental health, I just think, well, I need one column from here.
So there's called order ID. I need one column from the other table. And it's also called order ID there.
So I've got one column in each table. Because I have one column in each table. When those two are equivalent, I know there's a match between that order and that line item ID.
They match. They go together. Now I have even more columns.
This user data is on the left. And as I scroll to the right, I now will see order data, which comes next. And then as I keep scrolling over, eventually I get to line item data, which is on the right.
So each new table just gets more and more columns on the right. I just keep getting a bigger and bigger table, not just columns, but also rows. Because I'm matching line items to orders and there might be multiple line items per order.
So I'm just adding not only columns, I'm adding rows as well. Now in any given row on an individual row, I know who placed what order for which items all on any given individual line. Now whenever I look at a line, I just want to see people who have a quantity of five or greater.
And so I've got my quantity over here. I can say only show me rows where that quantity is greater than or equal to five. Now I'm hiding a bunch of rows.
Now, as far as columns, I'm really only interested in the user data. Maybe just their name and email. If that's all I want to see in my list of columns, I can say, just show me name and email.
And I know that those are the people who purchased a quantity of five or greater in any individual line item. And here I'm saying to display name, email and quantity. So let's also add quantity here so that I know how much they purchased.
If I want to put the highest quantities at the top, I could say order by quantity. Now that would put the lowest at the top. If I want to flip it, I have to add descending because the default is ascending.
So that puts the highest at the top. Secondarily, I'd like these people to also be put alphabetically. So for all the people within 10, I want to alphabetize them by their name.
So I'm going to put a comma and then say name. So first in the list, it'll organize, it'll sort by quantity of 10. After that, within all the 10s, it'll sort alphabetically by name.
Within the fives, it'll alphabetize those within the groups of five. So it kind of breaks them into groups. First, it does it by quantity.
Secondly, it alphabetizes by name. This default is ascending for name. Although if I wanted to do it reverse, I could add descending to there.
And that would do a reverse alphabetical order. So I would do Z to A rather than A-to-Z for the name. I want to be A-to-Z. So each column, you can say which order you want to be.
One to three or three to one, A-to-Z or Z to A. Your limit is how many columns you have. You can only sort by columns you have. That's your only limit.
So you can say the first thing you want to sort by, the second thing you want to sort by. But also at some point when you get to like three or four, do you really have groups that you can sort within? Because all the tens will be sorted alphabetically. But do you really have these? Like there's no sense in sorting the email like that won't change sort at this point.
Also, I see some rows that are a duplicate of other rows. If I want to get rid of those, how would I get rid of those? Select distinct, yes. Distinct eliminates duplicate rows.
Now anybody that did it more than once is going to be gone. I only need to see them once. And so you take things one step at a time.
And in the end, you've gotten a big query, but you've done it one step at a time, adding piece by piece. You keep filtering and filtering and filtering down to what you want. You start with lots of data and you whittle it down to only showing what you want.
Once you're done, if you want to export this and share it with somebody, down here, you can right click or control click if you're on a Mac, and you can export this data to give it to somebody. Let's say they want to get it in Excel or something. When you right click anywhere in this result set and export your data, you want to choose CSV, which is a comma separated value file.
You can open those up in Excel or Apple's numbers or something. You can import those into data visualization software. If you want to take it into Tableau or Power BI or something like that to visualize the data, you could do that.
So CSV is what you want. You click next, next, next. And when you get to the directory, this is where it's going to put the file.
So you just want to hit the browse button and choose the folder that you want to put it in. Let's say I choose my desktop. And then I hit proceed.
And so it just made a CSV file on my desktop right here. And I could open that up in Excel and or put into my data visualization software and do something else. So that's how you get it out and be able to give it to somebody.