Dive into SQL joins and null value handling to streamline your data analysis. Learn the differences between inner, left, and full joins, and how to effectively pinpoint missing or incomplete data in your tables.
Key Insights
- Understand SQL joins thoroughly, distinguishing between inner joins (only matches), left joins (all rows from the left table plus matches), and full joins (all rows from both tables), and learn how each affects your query results.
- Recognize the importance of clearly identifying null values in databases using keywords such as "IS NULL" and "IS NOT NULL," instead of the equals sign, to accurately find empty or missing entries.
- Explore practical techniques demonstrated in Noble Desktop's training session, including querying for empty strings versus null values, utilizing table aliases for readability, and handling scenarios where table joins reveal data outliers.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
So let's see this in action here. Close this up here. All right, so I'm going to open up a file here from our class files.
Today we're going to be working in level two. I'm going to go through, I'm going to demonstrate some code, show you some warm-ups here. During this time, no need to follow along.
Afterwards, you're going to have some time where you can do this yourself. So I've got these warm-ups in reference here. So I don't expect you to remember all this code off the top of your head.
So here's some code that you can see. We're going to type that code in and we're going to go through and understand it. And then down below, there's going to be some challenge questions where you can use what you've learned up above and code yourself.
And then once we're done with that, we'll come back and we'll review the answers to these challenge questions here. But let me go through and explain and show you some examples of how this stuff works. And explain a few more things as well.
Before I get started, up here at the top, I do not have an active data source because this particular file on my computer has never been associated with my server. If I come over here to the little platters, three little platters there, and I click the NA for not available, I can choose my server. I renamed my data source.
I renamed it Noble Desktop Server for anybody here in level two that's new. Yours would probably have been named Company Data, but I renamed mine to the server because technically, I know they name your connection the same thing as your database that you connect into. But technically, the database, there's actually multiple databases here on my server.
So I like to rename my connection and call it Noble Desktop Server because that's technically the server. And then I have my database sitting in that server. So the three little platters there, if you click on that and you choose your server, it then by default connects into the Company Data database, which is the database that I want to connect into.
Later, we'll talk about connecting into other databases on the same server. But for right now, we're in Company Data and we got these tables here. So for employees, I want to show you employees here.
And we run this by hitting the little play button here or hitting Command-Return on the Mac or Control-Enter on Windows. So here is all of the rows, all the columns, everything here for the employees table. And I'll make that a little bit bigger.
All right. When we look here, we've got employees. We've got the departments they work in.
And before I even get into joins, since this is the first time we're looking at nulls, notice how some of these say null. If I want to find nulls, because that's one of the things we do when we do outer joins is we search for nulls. We haven't looked for nulls before.
If I just want to, without even doing a join, just find these nulls, I could say where, so I can filter out rows and only show rows where the department ID, which is this column right here, is null. Is null will find the null values. You cannot say equals null.
A lot of people think that would work. No, that does not work. This is more like plain English.
They try to read nicely. Equals is used for strings. Equals is used for numbers.
But equals cannot be used for null. You must say is null. You must write that out.
Notice null is not a string. It's not a number. It's a special thing.
So that's a special keyword that we use to just find those empty values. Again, it's not a string. It's just an empty value.
If I just look here, I find people that don't work in the department. If I look in the manager ID, if I look in that, that would tell me people that don't have a manager. That's Bruce.
Bruce is the CEO. So he has no manager. He has no department ID.
So sometimes you just have nulls in your data. That's how you find those nulls. Now, what if you want to find people that do have a department ID? So here are the people that don't have a department ID.
I could also say is not null. These would only be the people who do have an employee. I'm actually hiding the people that are null.
See, if I show the whole employees table, that will show all of them. Then if I put the filter to filter out rows where it's not null, I will only see the people that do have a value. So that's a quick way.
If you want to hide the people of a certain type that are null, you can just hide those people by saying is not null. Also, moving on to a similar kind of related feature here that we need to understand is when people are creating the data in our database. You are not in control of that.
If you are just querying the database, somebody else is responsible maybe for creating the data in here. And in here, in the users table, there's a bunch of passwords that are encrypted. But notice how this one is empty, but it doesn't say the word null.
That one is also empty, but it doesn't say the word null. And there is one down here that says null. So clearly there's a difference between an empty cell and a null.
If I say where the password is null, notice there's only one of those. What about the other ones? Well, these contain strings. This little A-to-Z here tells me that that's a string.
I can look also in the users and look in the column and look at that password and see it's varchar, which is short for variable length of characters. 32 tells me that these passwords can be up to 32 characters maximum, but not beyond 32. They can be less than 32, but they can't be more than 32 characters.
That's the maximum. If that's a string, it could be an empty string. It could have nothing in it.
Now, this is not something that a join would ever create. Joins would only create nulls, but just so you know, if you're ever looking for these things in your database, not related to joins, but just because we're talking about empty stuff here, you might find that there's an empty string. Now, when we do strings, we use single quotes.
So here I'm going to put two single quotes, nothing in between, just two single quotes. And when I look for that, that is looking for an empty string, a string that has nothing inside of it. Sometimes programmers will do that instead of making it null, like they probably should.
So I can't tell you how your databases will be when you get to them, but just know that if somebody's creating this data, if users are creating it or programmers are creating it, they might do either nulls or empty strings, depending on how they programmed it. Now, when you're doing joins, joins will always create nulls. Just so you know, the join itself always makes nulls, but this is just a related topic here.
By the way, if you ever wanted to find both, you could say, show me the empty strings. So where password is an empty string, or also include rows where the password is no. If either one of those is true, you should include those rows.
Include rows where the password is empty or include a row if it is null. Now you see both. When you're typing in password.
Okay. So technically password is, so if you think about, think about a database and you're defining things in a database. When people connect, like when we connected to the database, did we have a username and a password? So there is technically a password keyword in SQL because you can define a password for users.
So what you're seeing here is there's a keyword for password, but then there's also in our case, we're using a column name that's called password. So that's why it's also, it was capitalizing it here. When I typed in password, it was capitalizing it because it knows that there's actually a keyword called password.
Now, because I'm not using it in the correct situation to define a password for a user, like I'm not defining users in the database. It doesn't get confused, but that's why, because I happen to be using a SQL keyword as a column name. So once you're seeing it as a column name and once you're seeing it as the keyword called password to define that for users.
We're only doing it from a query standpoint. We're not getting into the database administration side where you can define users and passwords and those sorts of things, but that's what that's for. That's why you're seeing it twice.
And it's perfectly fine because in this situation, it understands that this is looking for a column with that name. So even though it uppercases it, it's still fine. It still works, but that's why you see it twice.
Okay, so all of those things are just for us to kind of file in the back of our mind. Now, when we do joins, when we do the outer joins, it's going to include nulls in certain cases. We could then look for those nulls.
So in review of inner joins, where we get matches, if I look at my employees table, this is my employees table with departments by their ID. If I separately look at my departments table, here is a list of departments. This creates the IDs and sets the name of those.
If I want to see, let's say employees and the name of the departments they work in, I can't see that by just looking at one of these tables. I've got the employee information here and I've got the number of the department, but I don't have the name unless I look in the departments table. But here I don't have the employees that work in them.
So I have data in two tables and I want them in one query. So I can see the name of the person and the name of the department all in one query. So I must join them together.
Now, if you're only interested in people who work in a department and you want to ignore the people who don't work in departments, then you only want matches. In that case, you want to join the two. We are joining these two.
I want to call employees E for short, and then I'm going to join departments as D. Now from now on, I could just call it E and D for employees and departments. Those are table aliases so that we can write less code. I'm using just a regular join.
I know you can write out inner join, but I don't bother because it's so common. That's why they call it as a regular join, but that is an inner join that will only give us matches. And we have to define for the join, what are we joining on? Because joins require us to join on two columns that match.
So what two columns have matching data? As we saw, the department ID. Now, because they're both called department ID, fundamentally, I am saying department ID equals department ID. I'm saying in the one table, it's called department ID.
In the other table, it's called department ID. And if they were different names, this code would work if they were different names. But because they're the same name, if I try to run this, it's going to say, that is ambiguous, meaning I'm confused.
I don't know where it comes from because it's the same name. But if they were unique names, that would work fine. Since they are the same name, I have to say that one comes from one of the tables.
So this one can come from employees. And the other one has to come from the other table, which is D for departments. And now it understands that one is in one table and the other one is in the other table.
And it says those two match. When there's a match, include those rows. We only get people who work in a department.
Notice that people who don't work in a department are gone. So in the original table, we had Chad and Bruce. But in the join table, we no longer have them.
But now we have all of the department data for the people who work in a department. So I can see Amy works in sales, Elma works in sales and so on. And when I write this, notice that this is on the left, this is on the right, but it doesn't matter in a regular join because we don't care.
We're just looking for matches. We're not also including the non-matches. So I don't care.
The order of this does not matter. When you're doing a regular plain join. When it does start to matter is when I take that same code and I say, let's just do a left join.
And when you do a left join, it does also include all of the rows from the left table. See, I put this on the left and I put this one on the right. Since I put this on the left, when I say the left join, I'm saying, give me every row from this table.
I might not get every row from this table, but I'm going to get every row from this table because it's a left join and I care about the left table. So I will get every employee, every row, no matter what. I could do the same exact thing as a right join.
All I'd have to do is say, let's switch these. Cause if I want every employee and I say to do that for the right table, this would need to be on the right. I would need to reorder these.
And data wise, that will be exactly the same data entirely. The only difference is the order of the columns is a little different because now I put departments first. So they're on the left of my result set.
And these, since I put that second, those are on the right of my result set. But if I just reordered those columns, it would be the same exact results for both of these joints. Because when you say to do a right join, you're saying, give me every row of the right table.
Or when you do a left join, you're saying, give me every row of the left table. So you determine the order of your tables. Therefore you determine which side you care about.
I just think it makes more sense to do a left join. If you care about employees, wouldn't you put them on the left? Wouldn't you start with the thing you care about? So you'd put them on the left. Why would you end with the thing you care about? I think that's a little weird to care about employees, but to put them on the right, why would you start with the thing you don't care about as much? So since you get to reorder it, just don't do right joints.
Just choose to do a left join. Start with the thing that you care about. It's just simpler, just more, more logical.
When you put employees on the left, this refers to left here and it's on the left. And then when you put something on the right, that's also going to be on the right here. So the two impacts of you ordering your code this way are that not only does it change for whether you're doing a left or right join, but it does impact down here.
If you put employees first, it's on the left. And if you put this on the right, that's on the right. So like here, for example, this puts departments on the left and then employees is right.
So that puts that on the right, but that's just the default placement of things. And you can reorder columns if you want to. So for example, asterisk means all columns, but here's a crazy thing that I didn't realize for a while is it just means all columns, but you can put it into a context.
For example, if you can say that this column is in this table, you could also say all the columns in a table, comma, all the columns in a table. So I could say, give me all the columns of the employees table, then give me all the columns of the departments table. So I could actually reorder my results set so that this query up here and this query down here would end up exactly the same.
Just to prove that these results are the same. The only difference was the default ordering of the columns. But I could reorder those columns if I wanted to.
It's just when you say asterisk, you're saying include all the columns in their default order. But of course I could reorder those columns if I want to. So the order that you choose is left and right, which affects the order in the result set, but you could always change that order if you want to.
Does that answer your question or am I still misunderstanding? Yeah. Thank you. It's clear.
I got it. Okay, good. So a regular join only gives you matches, only employees who work in a department.
So that means you might not have all the employees. When I say left, I'm saying do that, do the matches, but also include the non-matches, right? So the regular join is going to give you the least amount of rows because you're only getting matches. When you say left, because I care about all the rows from the employees table, which is on the left, I'm saying also include the non-matches.
So you see how we get more rows. So regular join doesn't include those matches, but the left join does. It also includes the non-matches.
So you get more rows. If you want even more, if you say full, now we also include all of the rows from employees and all the rows from departments. Full says left and right join, meaning give me all the rows from the left table and give me all the rows from the right table.
So another way to think of it is this. Okay. And that's just, there's really only three fundamental joins.
Inner, left and full. Okay. So just think of those three inner joins, give you matches only.
So that means you could lose some rows from the left or the right tables. When you don't want to lose those rows, those non-matches, then you go to the left or the full. And do you want all the rows from the left table, but you might be missing some from the rights or do you want all the rows from both tables? That's the full.
Full gives you the fullness of your data. You don't lose anything. That's the simplest way to explain the three types of joins.
So outer joins will give you more data than inner, but do you want those non-matches? For example, if you were counting people who placed an order, if you were counting rows, I know we haven't gotten to counting yet, but let's say you counted. If you do an inner join where you just get people who placed an order, you're only getting a list of users who placed an order. You could then count those people.
If you do, let's say a left or full join, you're going to also include people who didn't place an order. Now you're counting all the users, not just users who placed an order. We need to be careful about including those outliers when we don't want to include them in our result set.
So then the reason to do something like a full join, let me go back to the full join, is because you want all of the stuff, all of the outliers, right? And so let's say you have departments that have no person in them. You're going to see that employee ID is empty, but employee ID is not empty. It's a primary key.
There's a little key right there. That's a primary key. Primary keys can never be empty.
The only way that a null could actually exist there must be because the join created it. Because if there was an employee, they would have an ID. If I search for a null in that employee ID column, so if I say where that employee ID is null, notice I get down to just the departments that have no employees.
Now, I don't need to show all the columns. I could just say, hey, just show me the department name because I just want to see the name of departments. And those are the names of departments that have no employees in them.
The only way I could know that is by joining those two tables together. And I'm literally looking for the outliers. Think about joins as saying people who do work, people who did make a purchase.
But what about the people who didn't, right? The did nots. It's those are the outliers. Those are why we turn to those outer joins.
Those are the did nots. That's the outer joins. The dids, those are the inner joins.
People who did place an order. People who do work in a department. It's the nots.
Those are the outliers. Those require outer joins. And so often you do joins to search for those nulls to try to find those outliers because those are people you're looking for.
Okay, so let's all open up from level two. So today we're in the level two folder. Let's open up this 1.0 outer join and null.
So open this. I'm going to put this name in the chat here. Open this file.
Go through those warmups at the top and then use that knowledge that you've learned to then go and do these challenge exercises here. And we'll come back and we'll discuss the solutions here in a little bit.