Data Analytics

Imp. Concepts of SQL with Examples. Ch-1

*Topics are:

1.COUNT( ) vs. COUNT(*)

2.GROUP BY vs. PARTITION BY

3.ROW_NUMBER vs. RANK vs. DENSE_RANK

---- > First, let us create a table in MySQL and visualize it.

----------------------------------------------------------------------------

CREATE TABLE EMPLOYEE ( employee_id int primary key,

employee_name text , dept text ,salary int, gender text);

INSERT INTO EMPLOYEE ( employee_id, employee_name, dept, salary, gender)

VALUES (0001, 'Clark', 'Accounting', 140000, 'm'),

(0002, 'Dave', 'Accounting', 230000, 'm'),

(0003, 'Eva', 'Marketing', null, 'f'),

(0004, 'Joy', 'Marketing', 160000, 'm'),

(0005, 'Danial', 'Accounting', 150000, 'm'),

(0006, 'Alexa', 'Accounting', 150000, 'f');

Select * From EMPLOYEE;

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

employee_id	employee_name	dept	salary	gender
1	Clark	Accounting	140000	m
2	Dave	Accounting	230000	m
3	Eva	Marketing	NULL	f
4	Joy	Marketing	160000	m
5	Danial	Accounting	150000	m
6	Alexa	Accounting	150000	f

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

**1.COUNT( ) vs. COUNT(*)**

● COUNT( ) :

(1) It counts the data from the columns. The value which we put in the bracket, 'COUNT' will take it as a parameter(column_name) and return the data.

---------------------------

Select Count(employee_name) as total_employee
From EMPLOYEE ;

---------------------------

Total_employee
6

(2) It will not consider ‘null’ values while counting.

------------------------

Select Count(salary) as salary_count
From EMPLOYEE ;

---------------------------

salary_count
5

*Note: For employee_id = 3, Salary is not specified or null. So it wont be counted in total count.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

**● COUNT(*) :**

(1) In this, ' * ' is given as a parameter, which means 'All records' in table.

(2) It will count out each and every row in table.

---------------------------

Select Count(*) as all_rows

From EMPLOYEE

---------------------------

all_rows
6

Note: generally count() will count all rows irrespective of any entry being 'null'.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

2.GROUP BY vs. PARTITION BY

● GROUP BY:

(1) As it suggests, it will make groups for the same type of records in a column.

(2) Generally it helps in data fetching operations for aggregating functions like SUM, AVERAGE,COUNT. In result table, it will give only one row/record per group.

---------------------------

Select dept, sum(salary) as dept_wise_total_salary
from EMPLOYEE
group by dept;

---------------------------

dept	dept_wise_total_salary
Accounting	670000
Marketing	160000

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←------------------------------------------------------>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

(3) In some cases, In select clause, apart from aggregated column value; if any other column is mentioned which is not part of Groupby clause, result will show error.

---------------------------

Select employee_name, dept , sum(salary) as dept_wise_total_salary
from EMPLOYEE
group by dept;

---------------------------

*Throwing an error*

*Note: Here three column were asked to fetch. ‘dept’ column is part of GroupBy clause and ‘salary’ column is used with aggregator function SUM. Here, it will throw error because column 'employee_name' is not a part of any of these.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

● PARTITION BY:

(1) It gives aggregated columns with each record in the given table.

(2)It keeps original values while also allowing to produce aggregated values. It is used with OVER clause.

---------------------------

select dept,

Count(employee_name) Over (partition by dept) as dep_wise_employee

from EMPLOYEE;

---------------------------

dept	dep_wise_employee
Accounting	4
Accounting	4
Accounting	4
Accounting	4
Marketing	2
Marketing	2

*Note: Here partition by is with each dept and asked to count employee_name for same. So record is shown for every row.

(3) Apart from asked column and aggregated column if any other column is specified in Select query then it will also give it as an output.

---------------------------

Select employee_name,salary, gender,

Count(employee_name) Over (partition by gender) as gen_wise_employee

from EMPLOYEE;

---------------------------

employee_name	salary	gender	gen_wise_employee
Eva	NULL	f	2
Alexa	150000	f	2
Clark	140000	m	4
Dave	230000	m	4
Joy	160000	m	4
Danial	150000	m	4

*Note: Here column ‘employee_name’ is associated with count aggregator function and partition is happening for different gender. The ‘salary’ column which is not related to any of these can also be fetched as shown above.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←--------------------------------------------------→>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

3. Row_number, Rank & Dense_Rank

● Row_number ():

(1)It gives the serial no. to every row.

(2)The ranking is based on given column partition with Over Clause.

---------------------------

select *,
row_number() over (partition by gender) as number
from EMPLOYEE;

---------------------------

employee_id	employee_name	dept	salary	gender	number
3	Eva	Marketing	NULL	f	1
6	Alexa	Accounting	150000	f	2
1	Clark	Accounting	140000	m	1
2	Dave	Accounting	230000	m	2
4	Joy	Marketing	160000	m	3
5	Danial	Accounting	150000	m	4

(3) When results are fetched in either ascending or descending order, the row number is keep increasing by 1 unit, keeping 1 as a start value. It does not care about duplicate values.

---------------------------

select *,
row_number() over (partition by dept order by salary) as number
from EMPLOYEE;

---------------------------

employee_id	employee_name	dept	salary	gender	number
1	Clark	Accounting	140000	m	1
5	Danial	Accounting	150000	m	2
6	Alexa	Accounting	150000	f	3
2	Dave	Accounting	230000	m	4
3	Eva	Marketing	NULL	f	1
4	Joy	Marketing	160000	m	2

*Note: Here column ‘dept' is used for partition and with order by column 'Salary' in ascending manner. Here salary is same for employee_id =5 and 6 , which is 15000. In such cases the duplication is ignored and given upcoming regular value to record.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←------------------------------------------------------>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

● Rank ():

(1)It is used to specify rank for each row in result set.

(2) In clause for Partitionby, if it find same values; it will give them all ‘1’ as a result.

---------------------------

select *,
rank() over (partition by gender order by salary) as rnk_1
from EMPLOYEE;

---------------------------

employee_id	employee_name	dept	salary	gender	rnk_1
3	Eva	Marketing	NULL	f	1
6	Alexa	Accounting	150000	f	2
1	Clark	Accounting	140000	m	1
5	Danial	Accounting	150000	m	2
4	Joy	Marketing	160000	m	3
2	Dave	Accounting	230000	m	4

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<------------------------------------------------------>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

(3)It gives the same rank (n) to duplicate records and does not give (n+1) rank to next record. Instead, it gives (n+2) rank to next record.

---------------------------

select *,
rank() over (partition by dept order by salary) as rank_no.
from EMPLOYEE;

---------------------------

employee_id	employee_name	dept	salary	gender	rank_no.
1	Clark	Accounting	140000	m	1
5	Danial	Accounting	150000	m	2
6	Alexa	Accounting	150000	f	2
2	Dave	Accounting	230000	m	4
3	Eva	Marketing	NULL	f	1
4	Joy	Marketing	160000	m	2

*Note : Here is data is fetched for partition with dept and then ascending order with salary, since employee_id = 5 and 6 has same salary being in same department , RANK=2 will be given to both of them. For the next record, it will skip the upcoming value and give next to next (n+2) [here 2+2 = 4] value.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←------------------------------------------------------>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←------------------------------------------------------>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

● Dense_rank() :

(1) It is also used to specify rank for each row in result set.

(2) Same as rank here also ranking is based on any Column Partition with Over Clause.

---------------------------

select *,
rank() over (partition by gender order by salary) as d_rank
from EMPLOYEE;

---------------------------

employee_id	employee_name	dept	salary	gender	d_rank
3	Eva	Marketing	NULL	f	1
6	Alexa	Accounting	150000	f	2
1	Clark	Accounting	140000	m	1
5	Danial	Accounting	150000	m	2
4	Joy	Marketing	160000	m	3
2	Dave	Accounting	230000	m	4

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<←------------------------------------------------------>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

(3)The difference is ,it gives the same rank (n) to duplicate records and does not skip next value as in case of RANK. it will follow regular order after finding duplicates.

---------------------------

select *,

dense_rank() over(partition by dept order by salary) as d_rank

from EMPLOYEE;

---------------------------

employee_id	employee_name	dept	salary	gender	d_rank
1	Clark	Accounting	140000	m	1
5	Danial	Accounting	150000	m	2
6	Alexa	Accounting	150000	f	2
2	Dave	Accounting	230000	m	3
3	Eva	Marketing	NULL	f	1
4	Joy	Marketing	160000	m	2

*Note : Here is data is fetched for partition with dept and then ascending order with salary, since employee_id = 5 and 6 has same salary being in same department , RANK will give them same value. For next record it will skip the upcoming value and give next to next (n+2) value to record.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<------------------------------------------------------>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

-------------------------- Few concepts discussed here and next will be covered in second part of this blog ---------------------------------

Harsh Mehta
Oct, 17 2022