์ธ๊ณต์ง€๋Šฅ ์ •๋ฆฌ [๋ณธ๋ก 7] :: ๊นŠ์–ด์ง„ ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ๋ฌธ์ œ์ ?

๊นŠ์–ด์ง„ ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ๋ฌธ์ œ์ 

Generalization

: training์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์€ data์— ๋Œ€ํ•œ ์„ฑ๋Šฅ

Data set

  1. Training set

    training์— ์‚ฌ์šฉํ•˜๋Š” data set

  2. Validation set

    ์ฃผ์–ด์ง„ data set ์ค‘ ๋นผ๋†“์•˜๋‹ค๊ฐ€ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” data set

  3. Test set

    ์ฃผ์–ด์ง€์ง€ ์•Š์•˜๋˜ ์ ‘ํ•œ ์  ์—†๋Š” data set

 

 ํ•™์Šต์ด training set์œผ๋กœ ์ง„ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šต์„ ๋ฐ˜๋ณตํ•  ์ˆ˜๋ก training set์— ๋Œ€ํ•œ ์ •ํ™•๋„๋Š” ๋†’์•„์ง€๊ณ  ์˜ค๋ฅ˜์œจ์€ ๋‚ฎ์•„์ง„๋‹ค. ํ•˜์ง€๋งŒ validation set์— ๋Œ€ํ•œ ์˜ค๋ฅ˜์œจ์€ ๋‚ฎ์•„์ง€๋‹ค๊ฐ€ ๋†’์•„์ง€๋Š” ํ˜„์ƒ์„ ๋„๋Š”๋ฐ, ์ด๋Š” ํ•™์Šต์ด ๋„ˆ๋ฌด training set์—๋งŒ ์ ํ•ฉํ•˜๊ฒŒ ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋ฅผ training set์— overfitting(๊ณผ์ ํ•ฉ)๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

 ์ด์ฒ˜๋Ÿผ ์ฃผ์–ด์ง„ data ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์•„์ง ํ•™์Šตํ•ด๋ณด์ง€ ์•Š์€ data๋“ค๋„ ์„ฑ๋Šฅ์ƒ ์•ˆ์ •์ ์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ธ๊ณผ ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜์˜ ๊ด€๊ณ„๋ฅผ ์ ์ ˆํžˆ ๋งž์ถ”์–ด์•ผ ํ•œ๋‹ค.

 ์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ ๊ต์ˆ˜๋‹˜์ด ๋“œ์‹  ์˜ˆ๋ฅผ ํ™œ์šฉํ•˜์ž๋ฉด, ๋จธ๋ฆฌ๊ฐ€ ์ข‹์€ ํ•™์ƒ์—๊ฒŒ (์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ์—๊ฒŒ) ์ ์ ˆํ•œ ์ˆ˜์˜ ๋ฌธ์ œ์™€ ๋‹ต์•ˆ์ง€๋ฅผ ์ฃผ๊ณ  ๋ฌธ์ œ๋ฅผ ํ’€๋ผ๊ณ  ํ•˜๋ฉด (training data) ๋‹ต์„ ๋‹ค ์™ธ์›Œ์„œ ๋ฌธ์ œ์ง€๋ฅผ ํ’€ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ํ•™์ƒ์€ ๋‹ค๋ฅธ ๋ฌธ์ œ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ํ’€์ง€ ๋ชปํ•  ๊ฒƒ์ด๋‹ค (Overfitting). ๊ทธ๋Ÿฐ๋ฐ ๋งŒ์•ฝ ์ด ํ•™์ƒ์—๊ฒŒ ์™ธ์šฐ์ง€ ๋ชปํ•  ์ •๋„๋กœ ๋งŽ์€ ๋ฌธ์ œ์™€ ๋‹ต์•ˆ์ง€๋ฅผ ์ค€๋‹ค๋ฉด ๋ฌธ์ œ๋ฅผ ์™ธ์šฐ์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ‘ธ๋Š” ๋ฐฉ๋ฒ•์„ ์ตํž ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ด ํ•™์ƒ์€ ๋‹ค๋ฅธ ๋ฌธ์ œ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ๋„ ํ’€ ์ˆ˜ ์žˆ๋‹ค.

 ๊ฒฐ๊ตญ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ๋น„ํ•ด์„œ ์ฃผ์–ด์ง„ ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ์ ์œผ๋ฉด ์ด ๋ชจ๋ธ์€ ๋„ˆ๋ฌด ์‰ฌ์›Œ์„œ ๋‹ค ์™ธ์›Œ๋ฒ„๋ฆฐ๋‹ค. (Overfitting ๋˜์–ด๋ฒ„๋ฆฐ๋‹ค.) ๋”ฐ๋ผ์„œ ์„ฑ๋Šฅ์— ๋น„ํ•ด ๋” ๋งŽ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜๋ฅผ ์ฃผ์–ด์•ผ ์ ์ ˆํ•œ ํ•™์Šต์„ ์ด๋ฃฐ ์ˆ˜ ์žˆ๋‹ค.

 

deep neural network์˜ ๋ฌธ์ œ์ 

์šฐ๋ฆฌ๋Š” ์ด์ œ ๊นŠ์€ ์ธ๊ณต์‹ ๊ฒฝ๋ง์—์„œ๋„ back propagation์„ ํ†ตํ•ด weight๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

  1. ๋ฐฉ๊ธˆ ์„ค๋ช…ํ•œ Overfitting ๋ฌธ์ œ

    data๋ฅผ ๋งŽ์ด ๋ชจ์•„๋„ ๋ถ€์กฑํ•˜๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ?

    ์„ฑ๋Šฅ์„ ์ค„์—ฌ์•ผํ•œ๋‹ค. (weight ์ˆ˜ ์ค„์ด๊ธฐ)

  2. ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ

    weight๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ๋ฏธ๋ถ„์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •์—์„œ ๊ณ„์† activation function์˜ ๋ฏธ๋ถ„ ๊ฐ’์ด ๊ณฑํ•ด์ง€๊ฒŒ ๋œ๋‹ค. sigmoid ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„ ๊ฐ’์€ 0๊ณผ 1์‚ฌ์ด ์ด๋ฏ€๋กœ ์ž‘์€ ์ˆ˜๊ฐ€ ๊ณ„์† ๊ณฑํ•ด์ ธ์„œ ๊ฒฐ๊ตญ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์œ„๋ฅผ ๋„˜์–ด์„œ๊ฒŒ ๋˜๊ณ  ๊ฐ’์ด 0์ด ๋˜๋Š” (๊ธฐ์šธ๊ธฐ๊ฐ€ ์†Œ์‹ค๋˜๋Š”) ๋ฌธ์ œ๊ฐ€ ์ด๊ฒƒ์ด๋‹ค.

    ๊ธฐ์šธ๊ธฐ๊ฐ€ 1์ด์ƒ์ธ activation function์„ ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค.

  3. local minimum ๋ฌธ์ œ

    error์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ์•„์•ผํ•˜๋Š” ๋ฌธ์ œ์—์„œ ๊ทน์†Œ๊ฐ’์— ๋น ์ ธ์„œ ๋”์ด์ƒ ์›€์ง์ด์ง€ ์•Š๋Š” ์ด ๋ฌธ์ œ๋Š” ์•ž์„  ํŒŒํŠธ์—์„œ ๊ทธ๋ž˜ํ”„๋กœ ์„ค๋ช…ํ•  ๋•Œ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

    ์ด ๋ฌธ์ œ๋Š” ์ฃผ์–ด์ง„ data๋ฅผ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•ด๊ฒฐ ๊ฐ€๋Šฅํ•˜๋‹ค.

 

๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

  1. Bottom-up layerwise unsupervised pre-training

    ์ธต๋งˆ๋‹ค bottom-up ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต

  2. ReLU

    activation ํ•จ์ˆ˜ ๋ฐ”๊พธ๊ธฐ

  3. Gradient Flow

    output์—์„œ์˜ ๋ฏธ๋ถ„ ๊ฐ’์ด ์ด์ „์˜ ์–ด๋–ค layer๋กœ ํ๋ฅด๋Š” ํ†ต๋กœ๋ฅผ ๋งŒ๋“ค์–ด์„œ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋ฅผ ์•ˆ ๊ฑฐ์น˜๊ฒŒ ํ•ด์คŒ

Bottom-up layerwise unsupervised pre-training

unsupervised pre-training

์—ฌ๋Ÿฌ ํ•™์Šต๋ฒ•์—์„œ label์ด ์žˆ๋Š” ํ•™์Šต data๊ฐ€ ๋ถ€์กฑํ•  ๋•Œ, ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด์„œ ์‚ฌ์ „ํ•™์Šต(unsupervised pre-training)์‹œํ‚ค๊ณ  ์ดํ›„์— label์ด ์žˆ๋Š” data๋ฅผ ์ด์šฉํ•ด supervised fine-tuningํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์“ด๋‹ค.

 

Greedy layer-wise training

์ž‘์€๋ถ€๋ถ„์„ ์ฐจ๋ก€๋กœ ํ•™์Šตํ•˜๋Š” fine-tuning ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ, AE(AutoEncoder) ์ƒ์—์„œ ๋™์ž‘

๊ธฐ์กด์—๋Š” ๊ธฐ๋Œ€๊ฐ’๊ณผ ์‹ค์ œ ์ถœ๋ ฅ๊ฐ’์˜ ์ฐจ๋ฅผ ์—ญ์ „ํŒŒ ์‹œํ‚ค๋Š” ์ง€๋„ํ•™์Šต์„ ํ–‰ํ–ˆ์ง€๋งŒ, AE์—์„œ๋Š” ์ž…๋ ฅ๊ณผ ์‹ค์ฒด ์ถœ๋ ฅ๊ฐ’์˜ ์ฐจ๋ฅผ ์ด์šฉํ•œ๋‹ค. (์—ญ์ „ํŒŒ ๋ฐฉ์‹์€ ๋น„์Šท)

hidden layer๊ฐ€ ์—ฌ๋Ÿฌ ์ธต์ผ๋•Œ๋Š” ์ธต ๋ณ„๋กœ ํƒ์š•์Šค๋Ÿฝ๊ฒŒ(Greedy) ํ•™์Šต์„ ์‹œํ‚จ๋‹ค.

์ฐธ๊ณ  : https://m.blog.naver.com/PostView.nhn?blogId=laonple&logNo=220884698923&proxyReferer=https%3A%2F%2Fwww.google.com%2F

ReLU (Rectified linear unit)

Residual(๋‚˜๋จธ์ง€) net

Gradient flow๋ฅผ ์œ„ํ•œ ํ†ต๋กœ๋ฅผ ๋งŒ๋“ค์–ด ๊นŠ์€ ์ธต์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Œ

์–‘์ชฝ์œผ๋กœ ๊ฐ€๋Š” ๊ธฐ์šธ๊ธฐ๋ฅผ ๋”ํ•ด์„œ ์‚ฌ์šฉํ•จ

์ž…๋ ฅ(x)๊ณผ ์ถœ๋ ฅ(f(x))์„ ํ•จ๊ป˜ ๋‚ด๋„๋ก(f(x)+x) ํ•™์Šต๋จ

์ธต๋ณ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ช‡ ๊ฐœ์˜ ์ธต๋งŒ ์ œ๋Œ€๋กœ ๋™์ž‘ํ•˜๋ฉด ์ •์ƒ์ ์œผ๋กœ ์ž‘๋™ํ•จ

๋ฐ˜์‘ํ˜•