์ธ๊ณต์ง€๋Šฅ ์ •๋ฆฌ [๋ณธ๋ก 8] :: ์ธ๊ณต์‹ ๊ฒฝ๋ง ์„ค๊ณ„ ์‹œ ๊ณ ๋ ค์‚ฌํ•ญ ์ •๋ฆฌ!

์ธ๊ณต์‹ ๊ฒฝ๋ง ์„ค๊ณ„ ์‹œ ๊ณ ๋ ค์‚ฌํ•ญ

  1. Network topology

    ๋„คํŠธ์›Œํฌ์˜ ๋ชจ์–‘ (feed forward, feed backward)

  2. Activation function

    ์ถœ๋ ฅ์˜ ํ˜•ํƒœ

  3. Objectives

    ๋ถ„๋ฅ˜? ํšŒ๊ท€?

    Loss function, Error๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ

  4. Optimizers

    weight update

  5. Generalization

    Overfitting ๋ฐฉ์ง€

 

2. activation function

์ถœ๋ ฅ์˜ ํ˜•ํƒœ ๊ฒฐ์ •

1. one-hot vector

์—ฌ๋Ÿฌ ๊ฐ’ ์ค‘ ํ•˜๋‚˜์˜ ๊ฐ’๋งŒ ์ถœ๋ ฅ

ex_ ์ˆซ์ž ์‹๋ณ„

 

2. softmax function

ํ•ด๋‹น ์ถœ๋ ฅ์ด ๋‚˜์˜ฌ ํ™•๋ฅ ๋กœ ํ‘œํ˜„

 

 

3. objective function

 

 

๊ธฐํƒ€ ๋ชฉ์ ํ•จ์ˆ˜

  • Mean absolute error / mae
  • Mean absolute percentage error / mape
  • Mean squared logarithmic error / msle
  • Hinge = max(0,1− tโˆ™f(x))
  • Squared hinge
  • Sparse categorical cross entropy
  • Kullback leibler divergence / kld
  • Poisson: mean of (predictions - targets * log(predictions))
  • Cosine proximity: the opposite (negative) of the mean cosine proximity between predictions and targets.

 

๋ฌธ์ œ์˜ ์œ ํ˜•

ํšŒ๊ธฐ (regression)

์ถœ๋ ฅ์ด ์—ฐ์†๊ฐ’์„ ๊ฐ–๋Š” ํ•จ์ˆ˜๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ›ˆ๋ จ data๋ฅผ ์ž˜ ์žฌํ˜„ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ฐพ๋Š” ๊ฒƒ

Linear activation function + MSE Objective function

ex_ ๋งค์ถœ์•ก ์ถ”์ •, ์ƒ์‚ฐ ์†๋„์— ๋”ฐ๋ฅธ ๋ถˆ๋Ÿ‰๋ฅ  ์ถ”์ •

Binary classification

์ถœ๋ ฅ์„ 0,1 ๋‘ ๊ฐ€์ง€๋กœ ํŒ๋‹จ (์ถœ๋ ฅ์˜ ํ˜•ํƒœ๊ฐ€ one-hot vector)

d = 0 or 1

d=1์ด๋ฉด ํ™•๋ฅ ์€

 

 

0์ด๋ฉด

 

 

ํ™•๋ฅ ๋“ค์˜ ํ•ฉ์„ ๊ฐ€์žฅ ํฌ๊ฒŒ ๋งŒ๋“œ๋Š” (์˜ค๋ฅ˜๋ฅผ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š”) w๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ 

๋ชฉ์ ํ•จ์ˆ˜ (Loss function)

๊ณฑ์…ˆ์„ ๋งŽ์ดํ•˜๋ฉด underflow๊ฐ€ ์ƒ๊ธฐ๋ฏ€๋กœ log ๋ถ™์—ฌ์ฃผ๊ธฐ

์ตœ๋Œ€ํ™”๋ฌธ์ œ => ์ตœ์†Œํ™”๋ฌธ์ œ

Logistic activation function + Binary cross Entropy Objective function

 

multi-class classification

๊ฐ ํด๋ž˜์Šค์— ํ•ด๋‹นํ•˜๋Š” ํ™•๋ฅ ์„ ์ถœ๋ ฅ

softmax activation function + categorical cross entropy Objective function

n๋ฒˆ์งธ testcase์˜ k๋ฒˆ์งธ ์ถœ๋ ฅ์— ๋Œ€ํ•œ ํ™•๋ฅ 

binary๋ฅผ k๊ฐœ์˜ ์ถœ๋ ฅ์— ๋Œ€ํ•ด ๋ชจ๋‘ ๋‚˜ํƒ€๋‚ธ ๋ฐฉ์‹

ex_ ์ˆซ์ž ํ•„๊ธฐ ์ธ์‹

 

4. Optimizers

: weight update ๋ฐฉ๋ฒ•

  • Stochastic gradient descent

    ์ด ์•„๋ž˜๋กœ๋Š” SGD์˜ ์‘์šฉ๋“ค์ž„

  • Momentum

  • Learning rate decay

  • Rmsprop

  • Adagrad

  • Adadelta

  • Adam

  • Adamax

  • Nadam

  • Nesterov

 

weight ์—…๋ฐ์ดํŠธ

  1. oneline

    ํ•™์Šต ์‹œ weight๋ฅผ ์ƒ˜ํ”Œ ํ•˜๋‚˜๋งˆ๋‹ค Error์— ๋Œ€ํ•ด ๋ถ„์„ํ•ด์„œ updateํ•˜๋Š” ๋ฐฉ์‹

  2. batch or deterministic gradient methods

    ๋ชจ๋“  ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ๊ณ„์‚ฐ๋˜๋Š” error function์„ ์‚ฌ์šฉ

    ๋ชจ๋“  input์— ๋Œ€ํ•ด์„œ weight์˜ ๋ณ€ํ™” ๊ฐ’์„ ๊ตฌํ•ด์„œ ํ‰๊ท ์„ ์‚ฌ์šฉ

    ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆผ

  3. minibatch or minibatch stochastic(ํ™•๋ฅ ๋ก ์ ) methods

    ์ƒ˜ํ”Œ์˜ ์ผ๋ถ€๋งŒ์„ ์ด์šฉํ•ด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ update

    ๋ณดํ†ต 50-100๊ฐœ

Stochastic gradient descent

m๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ๊ฐ€์ง€๊ณ  training ์‹œํ‚ด

=> ๊ธฐ์šธ๊ธฐ์˜ ํ‰๊ท ์„ ํƒํ•ด์„œ

=> learning rate๋ฅผ ๊ณฑํ•˜๊ณ  weight update

์žฅ์ 
  1. ๋น„์Šทํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์“ธ ๊ฒฝ์šฐ (์ค‘๋ณต์„ฑ์ด ์žˆ์„ ๋•Œ) ํ•™์Šต์ด ๋น ๋ฆ„

  2. local minimum์— ๋น ์งˆ ํ™•๋ฅ ์ด ์ค„์–ด๋“ฌ

    ๋งค๋ฒˆ mini-batch๋ฅผ ๋žœ๋ค์œผ๋กœ ๋ฝ‘๊ธฐ ๋•Œ๋ฌธ

  3. update๊ฐ€ ์ž‘์€ ํฌ๊ธฐ๋กœ ์ง„ํ–‰

    ํ•™์Šต์˜ ๊ฒฝ๊ณผ ๊ด€์ฐฐ ์šฉ์ด

  4. Online ํ•™์Šต

    ๋ฐ์ดํ„ฐ๊ฐ€ ๊ณ„์† ๊ฐฑ์‹ ๋  ๋•Œ, ์ˆ˜์ง‘๊ณผ ์ตœ์ ํ™” ๋™์‹œ ์ง„ํ–‰ ์šฉ์ด

์›๋ฆฌ

โ€‹ ๊ธฐ์šธ๊ธฐ ๋ฒกํ„ฐ๊ฐ€ ์ •ํ™•ํ•˜๊ฒŒ ๊ณ„์‚ฐ๋˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๊ทน์†Œ๊ฐ’์— ๋น ์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ

๋‹จ์ 

weight์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ทธ๋Œ€๋กœ ์ ์šฉ๋˜์–ด ์ด๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ง„๋™์œผ๋กœ ํ•™์Šต์— ๋งŽ์€ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Œ

=> ํ•ด๊ฒฐ๋ฐฉ๋ฒ• : ๋ณผ์— ๊ณต์„ ๋‹ด์•„๋‘์—ˆ๋‹ค๊ณ  ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ƒ๊ฐํ•ด๋ณด๊ธฐ

=> Momentum

 

Momentum

๊ฐ€์†๋„ ๊ฐœ๋…์„ ์ ์šฉํ•ด์„œ ์ด์ „ weight์˜ ๋ณ€ํ™”๋ฅผ ์–ด๋Š์ •๋„ ์œ ์ง€ํ•จ

๋ณ€ํ™”๋Ÿ‰์€ ์ด์ „ ๋ณ€ํ™”๋Ÿ‰์— ๊ฐ€์†๋„ ์ƒ์ˆ˜๋ฅผ ๊ณฑํ•œ ํ›„, ํ˜„์žฌ ๋ณ€ํ™”๋Ÿ‰์„ ๋นผ์ค€ ๊ฒƒ

SGD์— ๋Œ€๋น„ํ•ด์„œ Momentum์œผ๋กœ ์ˆ˜์ •๋œ ๋ชจ์Šต

 

Nesterov momentum

momentum์˜ ์‘์šฉ๋ฒ„์ „.

๊ธฐ์กด Momentum ๋ฐฉ๋ฒ•๊ณผ์˜ ์ฐจ์ด์ ์€ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋ฏธ๋ฆฌ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

weight์— ์ด๋ฏธ ์ด๋™ํ•œ ์ •๋„(av)๋งŒํผ์„ ์ด๋™ํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ ์ƒํƒœ๋กœ learning ์ง„ํ–‰

์ด ๋ถ€๋ถ„์€ Momentum๊ณผ ๋‹ค๋ฆ„

learning ํ›„ av๋งŒํผ์„ ๋˜ ๋”ํ•ด์„œ weight update

์ด ๋ถ€๋ถ„์€ Momentum๊ณผ ๊ฐ™์Œ

๊ธฐ์กด Momentum ๋ฐฉ๋ฒ•์—์„œ correction factor๋ฅผ ๋”ํ•œ ๋ฐฉ์‹์ด๋ผ๊ณ  ์ผ์ปฌ์–ด์ง

correction factor: equation์„ ์„ฑ๋ฆฝ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๊ณฑํ•ด์ง€๋Š” ์š”์ธ

https://ruder.io/optimizing-gradient-descent/

 

An overview of gradient descent optimization algorithms

Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad,

ruder.io

 

AdaGrad

์ง€๊ธˆ๊นŒ์ง€ ์—…๋ฐ์ดํŠธํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ €์žฅํ•ด๋‘์—ˆ๋‹ค๊ฐ€, ๋งŽ์ด ๋‚˜์˜จ ๊ธฐ์šธ๊ธฐ ์„ฑ๋ถ„์„ ์ค„์ด๊ธฐ

๋“œ๋ฌผ๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ๊ธฐ์šธ๊ธฐ ์„ฑ๋ถ„์„ ๋” ์ค‘์‹œ => learning rate๋ฅผ ํฌ๊ฒŒ

t'=1 ~t

์ฒ˜์Œ ์‹œ์ž‘ ๋ถ€ํ„ฐ ํ˜„์žฌ๊นŒ์ง€

g^2(t',i)

wi๊ฐ€ ์ด๋™ํ•œ ๊ธฐ์šธ๊ธฐ ๋ฒกํ„ฐ (๊ธฐ์šธ๊ธฐ ๋ฐฉํ–ฅ)

์ž์ฃผ ๋‚˜์˜ค๋Š” ๊ธฐ์šธ๊ธฐ ๋ฒกํ„ฐ๋“ค์€ ํฐ ์ˆ˜๊ฐ€ ๋˜์–ด learning rate๋ฅผ ๋‚˜๋ˆ„๊ฒŒ ๋˜๋ฏ€๋กœ learning rate๊ฐ€ ์ž‘์•„์ง

๋ฌธ์ œ

๋ชฉํ‘œ์ ์„ ์ƒ์‹คํ•  ์ˆ˜ ์žˆ์Œ

 

Rmsprop(Root Mean Square Propagation)

Adagrad์—์„œ ์•„๋ž˜ ๊ทธ๋ฆผ์— ํ‘œ์‹œ๋œ ์‹๋งŒ ๋ฐ”๋€œ

๊ธฐ์šธ๊ธฐ์˜ ๋นˆ๋„๋ฅผ ๋”ฐ์ง€๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๊ณผ๊ฑฐ ํ˜„์žฌ๋ฅผ ๋”ฐ์ ธ์„œ ์ตœ๊ทผ์˜ ๊ธฐ์šธ๊ธฐ, ๊ณผ๊ฑฐ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ณ ๋ คํ•  ๊ฒƒ์ธ์ง€ ํŒ๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•

r์€ ๊ณผ๊ฑฐ ๊ธฐ์šธ๊ธฐ, g๋Š” ํ˜„์žฌ ๊ธฐ์šธ๊ธฐ ๋ฒกํ„ฐ์ด๋ฏ€๋กœ ρ(๋กœ์šฐ)๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ณผ๊ฑฐ๋ฅผ ๊ธฐ์–ตํ•  ๊ฒƒ์ธ์ง€๋ฅผ ์˜๋ฏธํ•จ

๋ฌธ์ œ

์‹œ๊ฐ„์ด ํ๋ฅผ ์ˆ˜๋ก update ์–‘์ด ์ž‘์•„์ง

 

The Adam algorithm

Adam = Adagrad + momentum

๋„ˆ๋ฌด ์–ด๋ ต๋‹ค ... ๋‚˜์ค‘์— ๋‹ค์‹œ ๊ณต๋ถ€

*์นœ ๋ถ€๋ถ„ : ์ฒ˜์Œ์—๋Š” learning์ด ๋น ๋ฅด๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ฒซ ์‹œ๋„ ์‹œ g์— ๊ณฑํ•ด์ง€๋Š” ์ƒ์ˆ˜๋ฅผ ์—†์• ์คŒ

Adam = Rmsprop + momentum

์žฅ์ 

๊ฒฐ์ •ํ•ด์ฃผ๋Š” ์ƒ์ˆ˜ ๊ฐ’์— ๋”ฐ๋ผ์„œ ์„ฑ๋Šฅ์ด ๋งŽ์ด ๋‹ฌ๋ผ์ง€์ง€ ์•Š์Œ

 

5. Generalization

: training ์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์€ data (Test set)์— ๋Œ€ํ•œ ์„ฑ๋Šฅ

Overfitting (์ฃผ์–ด์ง„ data์— ๋Œ€ํ•ด์„œ๋งŒ ๊ณผ๋„ํ•˜๊ฒŒ ์ ํ•ฉ) ๋  ์ˆ˜๋„ ์žˆ์Œ

Overfitting

์ด์œ 

weight (model parameter) ์— ๋น„ํ•ด์„œ ํ•™์Šต data๊ฐ€ ๋„ˆ๋ฌด ์ ์€ ๊ฒฝ์šฐ์— ๋ฐœ์ƒ

weight ์ˆ˜์™€ ํ•™์Šต data ์ˆ˜๋Š” ๋น„๋ก€ํ•ด์•ผํ•จ

๋˜‘๋˜‘ํ•œ ์ธ๊ณต์ง€๋Šฅ์„ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ data๊ฐ€ ํ•„์š”ํ•จ

ํ•ด๊ฒฐ
  1. weight์˜ ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ (์„ฑ๋Šฅ ๋–จ์–ด๋œจ๋ฆฌ๊ธฐ)

    Regularization

  2. training data์˜ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๊ธฐ

์ˆ˜์—…์‹œ๊ฐ„์— ๋‚˜์™”๋˜ ์งˆ๋ฌธ

๊ทธ๋ ‡๋‹ค๋ฉด data๊ฐ€ ๋ถ€์กฑํ•œ ์ƒํ™ฉ์—์„œ๋Š” test set์— data๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค training set์— data๋ฅผ ๋ชฐ์•„์ฃผ๋Š”๊ฒŒ ๋” ์˜๋ฏธ์žˆ์ง€ ์•Š์„๊นŒ?

N-fold ๋ผ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ data set์„ N๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด์„œ train/valid/test๋ฅผ ๊ณ„์† ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ testํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Œ

Regularization

Overfitting์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋ชฉ์ ํ•จ์ˆ˜(loss function)์— complexity penalty(regularization ๋ชฉ์ ํ•จ์ˆ˜) ์ถ”๊ฐ€

์—๋Ÿฌ๋„ ์ค„์ด๊ณ  regularization๋„ ํ•ด์•ผํ•จ

  • L2 Regularization

    J(w) = MSE + λwTw

    wTw : w vector์˜ ๋‚ด์ 

    ์•ž์— ๊ณฑํ•ด์ง€๋Š” ์ƒ์ˆ˜๋Š” w๊ฐ€ ์ ์„ ์ˆ˜๋ก ์„ ํ˜ธํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋œปํ•จ

    ๋งŽ์„ ์ˆ˜๋ก ๋ชฉ์ ํ•จ์ˆ˜ ๊ฐ’์ด ๋†’์•„์ง => ๋งŒ์กฑํ•˜๊ธฐ ์–ด๋ ค์›€

  • L1 Regularization

    J(w) = MSE + λ||w||1

(โˆฅ ๐‘ค โˆฅ1 = σ๐‘—=1 ๐ท ๐‘ค๐‘— )

๊ทธ๋ƒฅ ๋”ํ•ด์„œ ์‚ฌ์šฉํ•˜๋ฉด ์Œ์ˆ˜๊ฐ€ ๋‚˜์™€์„œ ์ œ๊ณฑ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋‚˜์Œ

=> L2๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Œ

๋ฐ˜์‘ํ˜•