sql Join 이란?

2012. 5. 21. 14:18

7~8년전에는 관계형 DB(오라클, MS-SQl, mysql)를 잘썼는데.. 오랫동안 안쓰니 다 까먹었다.

요즘 android에서 sqlite를 이용해서 join 하나 해도 버벅데니..

여기저기 뒤져보다가 괜찮은걸 발견했다.. 아래는 퍼온글.. 이미지가 깨져서 내가 생각데로 그림1, 그림2 만 다시 그린다. 그리고 sqlite 홈페이지에 있는 joins 관련 글도 적어본다.

JOIN 이란 무엇일까???

- 논리적 관계를 기준으로 둘 이상의 테이블에서 데이터를 검색하기 위한 방법 (BOL 인용)

이라고 정의해 보죠.

좀더 쉽게 구체적으로 설명하자면...

회원 관리 테이블 : tblUser[userId, userName],

주문 테이블 : tblOrder[orderNumber, userId, productName]

이 있다고 가정하고...([]안의 것을 필드명입니다.)

주문 내역을 보여주고 싶은데 orderNumber, productName, userName을 한줄로 뿌리고 싶다..

고 가정했을 때...

보시다시피 tblOrder안에는 userName이 없고 userId밖에 없습니다. 대신 tblUser와 tblOrder에 공통적으로 userId가 있습니다.

이럴 경우 userId를 매개로 해서 두 테이블을 조인하면 원하는 결과를 얻을 수 있습니다.

이런것처럼 관계를 갖고 있는 테이블들의 정보를 같이 검색할 때 사용 하는 방법이지요.

이런 조인(Join)은 관계형 데이터베이스(이하 데이터베이스)의 큰 장점이자 특징을 나타내는 것이 아닐까 생각합니다.

그러나 데이터베이스에 대한 이해가 부족한 분들은 데이터베이스를 단순히 데이터를 저장하는 것이고 검색이 편리하다... 정도로만 이해하고 사용하는 분들이 많습니다.

쩝... 그럼 그 얘기가 틀린걸까요?

물론 아니죠. 맞습니다. 데이터베이스는 데이터를 저장하는 곳이죠. 그런데 문제는 데이터베이스의 장점을 살리지 못한체 이용하는 사람들이 많다는 것이죠.

JOIN에 대해서 설명하는데 왜 이리 주저리 주저리 설명하는지 이해가 잘 안가시겠지만...

좀 아시는 분들은 아실껍니다.

많은 사람들이 데이터베이스 설계를 할 때, 하나의 테이블에 거의 모든 정보를 다 담아두고 사용하는 경우가 많습니다. 여기저기 데이터의 중복이 발생하는 경우도 많구요.

이런 경우 사실 JOIN이 필요 없습니다. 그래서 주저리주저리 설명을 드렸던 것이구요.

그럼 다시 본론으로 들어와서...

여러 가지 종류의 조인이 있지만 많이 쓰이는 내부, 외부 조인에 대해서만 간단히 설명드리겠습니다.

그림 1과 같은 테이블이 있다고 가정하고 설명드리겠습니다.

그림 1. 예제 테이블

내부 조인 (INNER JOIN)

일반적으로 가장 많이 사용하는 JOIN으로...

쉽게 설명하면, 두 테이블에 공통적으로 존재하는 것을 찾을 때 사용합니다.

SELECT *

FROM tblA INNER JOIN tblB ON tblA.code = tblB.code

SELECT *

FROM tblA JOIN tblB ON tblA.code = tblB.code

--> INNER는 생략 가능합니다.

code를 기준으로 INNER JOIN을 할 경우 결과는 그림 2와 같습니다.

두 테이블에 공통으로 들어가 있는 A만 검색이 됐습니다.

그림 2. INNER JOIN 결과

외부 조인 (OUTER JOIN) - LEFT OUTER JOIN

LEFT OUTER JOIN은 조인 이름에서 알 수 있듯이... 왼쪽 테이블을 기준으로 검색을 하는 것입니다. 즉, 왼쪽 테이블의 값들은 모두 출력하고 오른쪽 테이블에서는 조인 조건에 맞는 값들은 출력이 되고 왼쪽 테이블을 기준으로 오른쪽 테이블에 존재 하지 않는 값은 NULL로 검색이 됩니다.

SELECT *

FROM tblA LEFT OUTER JOIN tblB ON tblA.code = tblB.code

SELECT *

FROM tblA LEFT JOIN tblB ON tblA.code = tblB.code

--> OUTER는 생략 가능합니다.

그림 3과 같이 왼쪽 테이블 즉 tblA의 모든 값들이 출력이 되고, 오른쪽 테이블의 값은 조인 조건에 해당하는 경우 그 값이 출력되고 조건에 맞는 값이 없는 경우는 NULL로 출력된 것을 볼 수 있습니다.

그림 3. LEFT OUTER JOIN 결과

외부 조인 (OUTER JOIN) - RIGHT OUTER JOIN

RIGHT OUTER JOIN의 경우는 LEFT와 반대겠지요. 오른쪽 테이블의 값들은 모두 출력하고 왼쪽 테이블에서는 조인 조건에 맞는 값들은 출력이 되고 조인 조건에 맞는 값이 없는 경우는 NULL로 검색이 됩니다.

SELECT *

FROM tblA RIGHT OUTER JOIN tblB ON tblA.code = tblB.code

그림 4. RIGHT OUTER JOIN 결과

그럼... LEFT OUTER JOIN나 RIGHT OUTER JOIN을 사용할 때 FROM절의 테이블 위치가 중요하겠네요?

맞습니다.

SELECT *

FROM tblA LEFT OUTER JOIN tblB ON tblA.code = tblB.code

와

SELECT *

FROM tblB RIGHT OUTER JOIN tblA ON tblA.code = tblB.code

는 동일합니다.

직접 한번 해 보시죠~

외부 조인 (OUTER JOIN) - FULL OUTER JOIN

마지막으로... 짐작이 되셨을것이라 생각됩니다.

LEFT와 RIGHT의 결과를 합쳐 놓은 것입니다.

바로 예제를 보죠

SELECT *

FROM tblA FULL OUTER JOIN tblB ON tblA.code = tblB.code

그림 5와 같이... 두 테이블의 결과가 모두 검색이 되는데, 단 조인 조건에 맞는 경우 한 row에 두 테이블의 값이 모두 출력되고, 나머지는 LEFT, RIGHT OUTER JOIN의 결과와 동일하게 출력됩니다.

그림 5. FULL OUTER JOIN 결과

사용 빈도를 살펴보면

INNER JOIN > LEFT, RIGHT OUTER JOIN > FULL OUTER JOIN 입니다.

그리고.. sqlite 홈페이지에 Joins 관련해서 좀 설명이 되어있다.

5.0 Joins (sqlite 홈페이지에서)

The ON and USING clauses of an inner join are converted into additional terms of the WHERE clause prior to WHERE clause analysis described above in paragraph 1.0. Thus with SQLite, there is no computational advantage to use the newer SQL92 join syntax over the older SQL89 comma-join syntax. They both end up accomplishing exactly the same thing on inner joins.

For a LEFT OUTER JOIN the situation is more complex. The following two queries are not equivalent:

SELECT * FROM tab1 LEFT JOIN tab2 ON tab1.x=tab2.y;

SELECT * FROM tab1 LEFT JOIN tab2 WHERE tab1.x=tab2.y;

For an inner join, the two queries above would be identical. But special processing applies to the ON and USING clauses of an OUTER join: specifically, the constraints in an ON or USING clause do not apply if the right table of the join is on a null row, but the constraints do apply in the WHERE clause. The net effect is that putting the ON or USING clause expressions for a LEFT JOIN in the WHERE clause effectively converts the query to an ordinary INNER JOIN - albeit an inner join that runs more slowly.

5.1 Order of tables in a join

The current implementation of SQLite uses only loop joins. That is to say, joins are implemented as nested loops.

The default order of the nested loops in a join is for the left-most table in the FROM clause to form the outer loop and the right-most table to form the inner loop. However, SQLite will nest the loops in a different order if doing so will help it to select better indices.

Inner joins can be freely reordered. However a left outer join is neither commutative nor associative and hence will not be reordered. Inner joins to the left and right of the outer join might be reordered if the optimizer thinks that is advantageous but the outer joins are always evaluated in the order in which they occur.

When selecting the order of tables in a join, SQLite uses a greedy algorithm that runs in polynomial (O(N²)) time. Because of this, SQLite is able to efficiently plan queries with 50- or 60-way joins.

Join reordering is automatic and usually works well enough that programmers do not have to think about it, especially if ANALYZE has been used to gather statistics about the available indices. But occasionally some hints from the programmer are needed. Consider, for example, the following schema:

CREATE TABLE node(

id INTEGER PRIMARY KEY,

name TEXT

);

CREATE INDEX node_idx ON node(name);

CREATE TABLE edge(

orig INTEGER REFERENCES node,

dest INTEGER REFERENCES node,

PRIMARY KEY(orig, dest)

);

CREATE INDEX edge_idx ON edge(dest,orig);

The schema above defines a directed graph with the ability to store a name at each node. Now consider a query against this schema:

SELECT *

FROM edge AS e,

node AS n1,

node AS n2

WHERE n1.name = 'alice'

AND n2.name = 'bob'

AND e.orig = n1.id

AND e.dest = n2.id;

This query asks for is all information about edges that go from nodes labeled "alice" to nodes labeled "bob". The query optimizer in SQLite has basically two choices on how to implement this query. (There are actually six different choices, but we will only consider two of them here.) Pseudocode below demonstrating these two choices.

Option 1:

foreach n1 where n1.name='alice' do:

foreach n2 where n2.name='bob' do:

foreach e where e.orig=n1.id and e.dest=n2.id

return n1.*, n2.*, e.*

end

Option 2:

foreach n1 where n1.name='alice' do:

foreach e where e.orig=n1.id do:

foreach n2 where n2.id=e.dest and n2.name='bob' do:

return n1.*, n2.*, e.*

end

The same indices are used to speed up every loop in both implementation options. The only difference in these two query plans is the order in which the loops are nested.

So which query plan is better? It turns out that the answer depends on what kind of data is found in the node and edge tables.

Let the number of alice nodes be M and the number of bob nodes be N. Consider two scenarios. In the first scenario, M and N are both 2 but there are thousands of edges on each node. In this case, option 1 is preferred. With option 1, the inner loop checks for the existence of an edge between a pair of nodes and outputs the result if found. But because there are only 2 alice and bob nodes each, the inner loop only has to run 4 times and the query is very quick. Option 2 would take much longer here. The outer loop of option 2 only executes twice, but because there are a large number of edges leaving each alice node, the middle loop has to iterate many thousands of times. It will be much slower. So in the first scenario, we prefer to use option 1.

Now consider the case where M and N are both 3500. Alice nodes are abundant. But suppose each of these nodes is connected by only one or two edges. In this case, option 2 is preferred. With option 2, the outer loop still has to run 3500 times, but the middle loop only runs once or twice for each outer loop and the inner loop will only run once for each middle loop, if at all. So the total number of iterations of the inner loop is around 7000. Option 1, on the other hand, has to run both its outer loop and its middle loop 3500 times each, resulting in 12 million iterations of the middle loop. Thus in the second scenario, option 2 is nearly 2000 times faster than option 1.

So you can see that depending on how the data is structured in the table, either query plan 1 or query plan 2 might be better. Which plan does SQLite choose by default? As of version 3.6.18, without running ANALYZE, SQLite will choose option 2. But if the ANALYZE command is run in order to gather statistics, a different choice might be made if the statistics indicate that the alternative is likely to run faster.

5.2 Manual Control Of Query Plans

SQLite provides the ability for advanced programmers to exercise control over the query plan chosen by the optimizer. One method for doing this is to fudge the ANALYZE results in the sqlite_stat1 and sqlite_stat2 tables. That approach is not recommended except for the one scenario described in the following paragraph.

For a program that uses an SQLite database as its application file format, when a new database instances is first created the ANALYZE command is ineffective because the database contain no data from which to gather statistics. In that case, one could construct a large prototype database containing typical data during development and run the ANALYZE command on this prototype database to gather statistics, then save the prototype statistics as part of the application. After deployment, when the application goes to create a new database file, it can run the ANALYZE command in order to create the sqlite_stat1 and sqlite_stat2 tables, then copy the precomputed statistics obtained from the prototype database into these new statistics tables. In that way, statistics from large working data sets can be preloaded into newly created application files.

If you really must take manual control of join loop nesting order, the preferred method is to use some peculiar (though valid) SQL syntax to specify the join. If you use the keyword CROSS in a join, then the two tables connected by that join will not be reordered. So in the query, the optimizer is free to reorder the tables of the FROM clause anyway it sees fit:

SELECT *

FROM node AS n1,

edge AS e,

node AS n2

WHERE n1.name = 'alice'

AND n2.name = 'bob'

AND e.orig = n1.id

AND e.dest = n2.id;

But in the following logically equivalent formulation of the query, the substitution of "CROSS JOIN" for the "," means that the order of tables must be N1, E, N2.

SELECT *

FROM node AS n1 CROSS JOIN

edge AS e CROSS JOIN

node AS n2

WHERE n1.name = 'alice'

AND n2.name = 'bob'

AND e.orig = n1.id

AND e.dest = n2.id;

Hence, in the second form, the query plan must be option 2. Note that you must use the keyword CROSS in order to disable the table reordering optimization; INNER JOIN, NATURAL JOIN, JOIN, and other similar combinations work just like a comma join in that the optimizer is free to reorder tables as it sees fit. (Table reordering is also disabled on an outer join, but that is because outer joins are not associative or commutative. Reordering tables in outer joins changes the result.)

참고한 자료 : http://flashcafe.org/?mid=database_study&listStyle=webzine&document_srl=4383

http://www.sqlite.org/optoverview.html

http://blogger.pe.kr/231 데이터가 없을 경우 0으로 표시하는 쿼리, 이미지 있음.

'개발 일반' 카테고리의 다른 글

git 사용법 (0)	2013.04.10
애자일 방법론 자료 (0)	2013.04.09
구글 short URL (0)	2012.05.03
DBViewer Plugin for Eclipse (이클립스용 DB 연동 플러그인) (0)	2012.04.05
HTML 파싱을 위한 이클립스, 웹서버 세팅 방법(jericho 사용시 세팅 방법) (0)	2012.02.15

안드로이 스토리

sql Join 이란?

'개발 일반' 카테고리의 다른 글

+ Recent posts

티스토리툴바