[STBDA] 07wk: Logistic regression

Author

김보람

Published

June 26, 2023

해당 강의노트는 전북대학교 최규빈교수님 STBDA2022 자료임

imports

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
2023-06-26 15:05:17.002455: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
tnp.experimental_enable_numpy_behavior()
import graphviz
def gv(s): return graphviz.Source('digraph G{ rankdir="LR"'+s + '; }')

piece-wise linear regression

model: \(y_i=\begin{cases} x_i +0.3\epsilon_i & x\leq 0 \\ 3.5x_i +0.3\epsilon_i & x>0 \end{cases}\)

np.random.seed(43052)
N=100
x = np.linspace(-1,1,N)
lamb = lambda x: x*1+np.random.normal()*0.3 if x<0 else x*3.5+np.random.normal()*0.3
y= np.array(list(map(lamb,x)))
y
array([-0.88497385, -0.65454563, -0.61676249, -0.84702584, -0.84785569,
       -0.79220455, -1.3777105 , -1.27341781, -1.41643729, -1.26404671,
       -0.79590224, -0.78824395, -0.86064773, -0.52468679, -1.18247354,
       -0.29327295, -0.69373049, -0.90561768, -1.07554911, -0.7225404 ,
       -0.69867774, -0.34811037,  0.11188474, -1.05046296, -0.03840085,
       -0.38356861, -0.24299798, -0.58403161, -0.20344022, -0.13872303,
       -0.529586  , -0.27814478, -0.10852781, -0.38294596,  0.02669763,
       -0.23042603, -0.77720364, -0.34287396, -0.04512022, -0.30180793,
       -0.26711438, -0.51880349, -0.53939672, -0.32052379, -0.32080763,
        0.28917092,  0.18175206, -0.48988124, -0.08084459,  0.37706178,
        0.14478908,  0.07621827, -0.071864  ,  0.05143365,  0.33932009,
       -0.35071776,  0.87742867,  0.51370399,  0.34863976,  0.55855514,
        1.14196717,  0.86421076,  0.72957843,  0.57342304,  1.54803332,
        0.98840018,  1.11129366,  1.42410801,  1.44322465,  1.25926455,
        1.12940772,  1.46516829,  1.16365096,  1.45560853,  1.9530553 ,
        2.45940445,  1.52921129,  1.8606463 ,  1.86406718,  1.5866523 ,
        1.49033473,  2.35242686,  2.12246412,  2.41951931,  2.43615052,
        1.96024441,  2.65843789,  2.46854394,  2.76381882,  2.78547462,
        2.56568465,  3.15212157,  3.11482949,  3.17901774,  3.31268904,
        3.60977818,  3.40949166,  3.30306495,  3.74590922,  3.85610433])
plt.plot(x,y,'.')

풀이1: 단순회귀모형

x= x.reshape(N,1)
y= y.reshape(N,1)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1))
net.compile(optimizer=tf.optimizers.SGD(0.1),loss='mse')
net.fit(x,y,batch_size=N,epochs=1000,verbose=0) # numpy로 해도 돌아감
<keras.callbacks.History at 0x7fa2800be6a0>
net.weights
[<tf.Variable 'dense/kernel:0' shape=(1, 1) dtype=float32, numpy=array([[2.2616348]], dtype=float32)>,
 <tf.Variable 'dense/bias:0' shape=(1,) dtype=float32, numpy=array([0.6069048], dtype=float32)>]
yhat = x * 2.2616348 + 0.6069048
yhat = net.predict(x)
4/4 [==============================] - 0s 502us/step
plt.plot(x,y,'.')
plt.plot(x,yhat,'--')

- 실패: 이 모형은 epoch을 10억번 돌려도 실패할 모형임 - 왜? 아키텍처 설계자체가 틀렸음 - 꺽인부분을 표현하기에는 아키텍처의 표현력이 너무 부족하다 -> under fit의 문제

풀이2: 비선형 활성화 함수의 도입

- 여기에서 비선형 활성화 함수는 relu

- 네트워크를 아래와 같이 수정하자.

(수정전) hat은 생략

#collapse
gv('''
"x" -> "x*w,    bias=True"[label="*w"] ;
"x*w,    bias=True" -> "y"[label="indentity"] ''')

(수정후) hat은 생략

#collapse
gv('''
"x" -> "x*w,    bias=True"[label="*w"] ;
"x*w,    bias=True" -> "y"[label="relu"] ''')

  • 마지막에 \(f(x)=x\) 라는 함수대신에 relu를 취하는 것으로 구조를 약간 변경
  • 활성화함수(acitivation function)를 indentity에서 relu로 변경

- relu함수란?

_x = np.linspace(-1,1,100)
tf.nn.relu(_x)
<tf.Tensor: shape=(100,), dtype=float64, numpy=
array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.01010101, 0.03030303, 0.05050505, 0.07070707, 0.09090909,
       0.11111111, 0.13131313, 0.15151515, 0.17171717, 0.19191919,
       0.21212121, 0.23232323, 0.25252525, 0.27272727, 0.29292929,
       0.31313131, 0.33333333, 0.35353535, 0.37373737, 0.39393939,
       0.41414141, 0.43434343, 0.45454545, 0.47474747, 0.49494949,
       0.51515152, 0.53535354, 0.55555556, 0.57575758, 0.5959596 ,
       0.61616162, 0.63636364, 0.65656566, 0.67676768, 0.6969697 ,
       0.71717172, 0.73737374, 0.75757576, 0.77777778, 0.7979798 ,
       0.81818182, 0.83838384, 0.85858586, 0.87878788, 0.8989899 ,
       0.91919192, 0.93939394, 0.95959596, 0.97979798, 1.        ])>
plt.plot(_x,_x)
plt.plot(_x,tf.nn.relu(_x))

  • 파란색을 주황색으로 바꿔주는 것이 렐루함수임
  • \(f(x)=\max(0,x)=\begin{cases} 0 & x\leq 0 \\ x & x>0 \end{cases}\)

- 아키텍처: \(\hat{y}_i=relu(\hat{w}_0+\hat{w}_1x_i)\), \(relu(x)=\max(0,x)\)

- 풀이시작

1단계

net2 = tf.keras.Sequential()

2단계

tf.random.set_seed(43053)
l1 = tf.keras.layers.Dense(1, input_shape=(1,))
a1 = tf.keras.layers.Activation(tf.nn.relu)
net2.add(l1)
net2.layers
[<keras.layers.core.dense.Dense at 0x7fa264620d90>]
net2.add(a1)
net2.layers
[<keras.layers.core.dense.Dense at 0x7fa264620d90>,
 <keras.layers.core.activation.Activation at 0x7fa337f9caf0>]
l1.get_weights()
[array([[1.6202813]], dtype=float32), array([0.], dtype=float32)]
net2.get_weights()
[array([[1.6202813]], dtype=float32), array([0.], dtype=float32)]

(네트워크 상황 확인)

u1= l1(x)
#u1= x@l1.weights[0] + l1.weights[1]
v1= a1(u1)
#v1= tf.nn.relu(u1)
plt.plot(x,x)
plt.plot(x,u1,'--r')
plt.plot(x,v1,'--b')

3단계

net2.compile(optimizer=tf.optimizers.SGD(0.1),loss='mse')

4단계

net2.fit(x,y,epochs=1000,verbose=0,batch_size=N)
<keras.callbacks.History at 0x7fa337ecbd60>

- result

yhat = tf.nn.relu(x@l1.weights[0] + l1.weights[1])
yhat = net2.predict(x)
yhat = net2(x)
yhat = a1(l1(x))
yhat = net2.layers[1](net2.layers[0](x))
4/4 [==============================] - 0s 519us/step
  • 위는 다 같은 코드.
plt.plot(x,y,'.')
plt.plot(x,yhat,'--')

- discussion - 이것 역시 수백억번 에폭을 반복해도 이 이상 적합이 힘들다 \(\to\) 모형의 표현력이 떨어진다. - 해결책: 주황색점선이 2개 있다면 어떨까?

풀이3: 노드수추가 + 레이어추가

목표: 2개의 주황색 점선을 만들자.

1단계

net3 = tf.keras.Sequential()

2단계

tf.random.set_seed(43053)
l1 = tf.keras.layers.Dense(2,input_shape=(1,)) # 2로 하면 직선이 두개
a1 = tf.keras.layers.Activation(tf.nn.relu)
net3.add(l1)
net3.add(a1)

(네트워크 상황 확인)

l1(x).shape
# l1(x) : (100,1) -> (100,2)
TensorShape([100, 2])
plt.plot(x,x)
plt.plot(x,l1(x),'--')

plt.plot(x,x)
plt.plot(x,a1(l1(x)),'--')

- 이 상태에서는 yhat이 안나온다. 왜? - 차원이 안맞음. a1(l1(x))의 차원은 (N,2)인데 최종적인 yhat의 차원은 (N,1)이어야 함. - 차원이 어찌저찌 맞다고 쳐도 relu를 통과하면 항상 yhat>0 임. 따라서 음수값을 가지는 y는 0으로 밖에 맞출 수 없음.

- 해결책: a1(l1(x))에 연속으로(Sequential하게!) 또 다른 레이어를 설계! (N,2) -> (N,1) 이 되도록! - yhat= bias + weight1 * a1(l1(x))[0] + weight2 * a1(l1(x))[1]

- 즉 a1(l1(x)) 를 새로운 입력으로 해석하고 출력을 만들어주는 선형모형을 다시태우면 된다. - 입력차원: 2 - 출력차원: 1

net3.layers
[<keras.layers.core.dense.Dense at 0x7fa337e5f6a0>,
 <keras.layers.core.activation.Activation at 0x7fa337f9cac0>]
tf.random.set_seed(43053)
l2 = tf.keras.layers.Dense(1, input_shape=(2,))
net3.add(l2)
net3.layers
[<keras.layers.core.dense.Dense at 0x7fa337e5f6a0>,
 <keras.layers.core.activation.Activation at 0x7fa337f9cac0>,
 <keras.layers.core.dense.Dense at 0x7fa337c887f0>]
net3.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_2 (Dense)             (None, 2)                 4         
                                                                 
 activation_1 (Activation)   (None, 2)                 0         
                                                                 
 dense_3 (Dense)             (None, 1)                 3         
                                                                 
=================================================================
Total params: 7
Trainable params: 7
Non-trainable params: 0
_________________________________________________________________

- 추정해야할 파라메터수가 4,0,3으로 나온다.

- 수식표현: \(X \to X@W^{(1)}+b^{(1)} \to relu(X@W^{(1)}+b^{(1)}) \to relu(X@W^{(1)}+b^{(1)})@W^{(2)}+b^{(2)}=yhat\)

  • \(X\): (N,1)
  • \(W^{(1)}\): (1,2) ==> 파라메터 2개 추정
  • \(b^{(1)}\): (2,) ==> 파라메터 2개가 추가 // 여기까지 추정할 파라메터는 4개
  • \(W^{(2)}\): (2,1) ==> 파라메터 2개 추정
  • \(b^{(2)}\): (1,) ==> 파라메터 1개가 추가 // 따라서 3개

- 참고: 추정할 파라메터수가 많다 = 복잡한 모형이다. - 초거대AI: 추정할 파라메터수가 엄청 많은..

net3.weights
[<tf.Variable 'dense_2/kernel:0' shape=(1, 2) dtype=float32, numpy=array([[0.98630846, 0.59210145]], dtype=float32)>,
 <tf.Variable 'dense_2/bias:0' shape=(2,) dtype=float32, numpy=array([0., 0.], dtype=float32)>,
 <tf.Variable 'dense_3/kernel:0' shape=(2, 1) dtype=float32, numpy=
 array([[0.52757335],
        [0.33660662]], dtype=float32)>,
 <tf.Variable 'dense_3/bias:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>]
l1.weights
[<tf.Variable 'dense_2/kernel:0' shape=(1, 2) dtype=float32, numpy=array([[0.98630846, 0.59210145]], dtype=float32)>,
 <tf.Variable 'dense_2/bias:0' shape=(2,) dtype=float32, numpy=array([0., 0.], dtype=float32)>]
l2.weights
[<tf.Variable 'dense_3/kernel:0' shape=(2, 1) dtype=float32, numpy=
 array([[0.52757335],
        [0.33660662]], dtype=float32)>,
 <tf.Variable 'dense_3/bias:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>]

- 좀 더 간단한 수식표현: \(X \to (u_1 \to v_1) \to (u_2 \to v_2) = yhat\) - \(u_1= X@W^{(1)}+b^{(1)}\) - \(v_1= relu(u_1)\) - \(u_2= v_1@W^{(2)}+b^{(2)}\) - \(v_2= indentity(u_2):=yhat\)

#collapse
gv('''
subgraph cluster_1{
    style=filled;
    color=lightgrey;
    "X"
    label = "Layer 0"
}
subgraph cluster_2{
    style=filled;
    color=lightgrey;
    "X" -> "u1[:,0]"[label="*W1[0,0]"]
    "X" -> "u1[:,1]"[label="*W1[0,1]"]
    "u1[:,0]" -> "v1[:,0]"[label="relu"]
    "u1[:,1]" -> "v1[:,1]"[label="relu"]
    label = "Layer 1"
}
subgraph cluster_3{
    style=filled;
    color=lightgrey;
    "v1[:,0]" -> "yhat"[label="*W2[0,0]"]
    "v1[:,1]" -> "yhat"[label="*W2[1,0]"]
    label = "Layer 2"
}
''')

#collapse
gv('''
subgraph cluster_1{
    style=filled;
    color=lightgrey;
    "X"
    label = "Layer 0"
}
subgraph cluster_2{
    style=filled;
    color=lightgrey;
    "X" -> "node1"
    "X" -> "node2"
    label = "Layer 1: relu"
}
subgraph cluster_3{
    style=filled;
    color=lightgrey;
    "node1" -> "yhat"
    "node2" -> "yhat"
    label = "Layer 2"
}
''')

3단계

net3.compile(loss='mse',optimizer=tf.optimizers.SGD(0.1))

4단계

net3.fit(x,y,epochs=1000,verbose=0, batch_size=N)
<keras.callbacks.History at 0x7fa337ca7610>

- 결과확인

net3.weights
[<tf.Variable 'dense_2/kernel:0' shape=(1, 2) dtype=float32, numpy=array([[1.7125574 , 0.96457523]], dtype=float32)>,
 <tf.Variable 'dense_2/bias:0' shape=(2,) dtype=float32, numpy=array([-0.10849824,  0.80890274], dtype=float32)>,
 <tf.Variable 'dense_3/kernel:0' shape=(2, 1) dtype=float32, numpy=
 array([[1.5033181],
        [1.1611973]], dtype=float32)>,
 <tf.Variable 'dense_3/bias:0' shape=(1,) dtype=float32, numpy=array([-0.9116387], dtype=float32)>]
plt.plot(x,y,'.')
plt.plot(x,net3(x),'--')

- 분석

plt.plot(x,y,'.')
plt.plot(x,l1(x),'--')

plt.plot(x,y,'.')
plt.plot(x,a1(l1(x)),'--')

plt.plot(x,y,'.')
plt.plot(x,l2(a1(l1(x))),'--')

- 마지막 2개의 그림을 분석

l2.weights
[<tf.Variable 'dense_3/kernel:0' shape=(2, 1) dtype=float32, numpy=
 array([[1.5033181],
        [1.1611973]], dtype=float32)>,
 <tf.Variable 'dense_3/bias:0' shape=(1,) dtype=float32, numpy=array([-0.9116387], dtype=float32)>]
fig, (ax1,ax2,ax3) = plt.subplots(1,3)
fig.set_figwidth(12)
ax1.plot(x,y,'.')
ax1.plot(x,a1(l1(x))[:,0],'--r')
ax1.plot(x,a1(l1(x))[:,1],'--b')
ax2.plot(x,y,'.')
ax2.plot(x,a1(l1(x))[:,0]*1.5033181,'--r')
ax2.plot(x,a1(l1(x))[:,1]*(1.1611973)-0.9116387,'--b')
ax3.plot(x,y,'.')
ax3.plot(x,a1(l1(x))[:,0]*1.5033181+a1(l1(x))[:,1]*(1.1611973)-0.9116387,'--')

풀이3의 실패

tf.random.set_seed(43054)
## 1단계
net3 = tf.keras.Sequential()
## 2단계
net3.add(tf.keras.layers.Dense(2))
net3.add(tf.keras.layers.Activation('relu'))
net3.add(tf.keras.layers.Dense(1))
## 3단계
net3.compile(optimizer=tf.optimizers.SGD(0.1),loss='mse')
## 4단계
net3.fit(x,y,epochs=1000,verbose=0,batch_size=N)
<keras.callbacks.History at 0x7fa305a1a310>
plt.plot(x,y,'.')
plt.plot(x,net3(x),'--')

- 엥? 에폭이 부족한가?

net3.fit(x,y,epochs=10000,verbose=0,batch_size=N)
plt.plot(x,y,'.')
plt.plot(x,net3(x),'--')

- 실패분석

l1,a1,l2 = net3.layers
l2.weights
[<tf.Variable 'dense_7/kernel:0' shape=(2, 1) dtype=float32, numpy=
 array([[ 1.7770029],
        [-0.7268499]], dtype=float32)>,
 <tf.Variable 'dense_7/bias:0' shape=(1,) dtype=float32, numpy=array([-0.60076195], dtype=float32)>]
fig, (ax1,ax2,ax3,ax4) = plt.subplots(1,4)
fig.set_figwidth(16)
ax1.plot(x,y,'.')
ax1.plot(x,l1(x)[:,0],'--r')
ax1.plot(x,l1(x)[:,1],'--b')
ax2.plot(x,y,'.')
ax2.plot(x,a1(l1(x))[:,0],'--r')
ax2.plot(x,a1(l1(x))[:,1],'--b')
ax3.plot(x,y,'.')
ax3.plot(x,a1(l1(x))[:,0]*1.7770029,'--r')
ax3.plot(x,a1(l1(x))[:,1]*(-0.7268499)+(-0.60076195),'--b')
ax4.plot(x,y,'.')
ax4.plot(x,a1(l1(x))[:,0]*1.7770029+a1(l1(x))[:,1]*(-0.7268499)+(-0.60076195),'--')

  • 보니까 파란색선이 하는 역할을 없음
  • 그런데 생각해보니까 이 상황에서는 파란색선이 할수 있는 일이 별로 없음
  • 왜? 지금은 나름 빨간색선에 의해서 최적화가 된 상태임 \(\to\) 빨간선이 뭔가 하려고하면 최적화된 상태가 깨질 수 있음 (loss 증가)
  • 즉 이 상황 자체가 나름 최적회된 상태이다. 이러한 현상을 “global minimum을 찾지 못하고 local minimum에 빠졌다”라고 표현한다.

확인:

net3.weights
[<tf.Variable 'dense_6/kernel:0' shape=(1, 2) dtype=float32, numpy=array([[1.9579618 , 0.46560898]], dtype=float32)>,
 <tf.Variable 'dense_6/bias:0' shape=(2,) dtype=float32, numpy=array([ 0.34100613, -0.4658857 ], dtype=float32)>,
 <tf.Variable 'dense_7/kernel:0' shape=(2, 1) dtype=float32, numpy=
 array([[ 1.7770029],
        [-0.7268499]], dtype=float32)>,
 <tf.Variable 'dense_7/bias:0' shape=(1,) dtype=float32, numpy=array([-0.60076195], dtype=float32)>]
W1= tf.Variable(tnp.array([[1.9579618,  0.46560898 ]]))
b1= tf.Variable(tnp.array([0.34100613,  -0.4658857 ]))
W2= tf.Variable(tnp.array([[1.7770029],[-0.7268499 ]]))
b2= tf.Variable(tnp.array([-0.60076195]))
with tf.GradientTape() as tape:
    u = tf.constant(x) @ W1 + b1
    v = tf.nn.relu(u)
    yhat = v@W2 + b2
    loss = tf.losses.mse(y,yhat)
tape.gradient(loss,[W1,b1,W2,b2])
[<tf.Tensor: shape=(1, 2), dtype=float64, numpy=array([[-6.01630956e-05,  0.00000000e+00]])>,
 <tf.Tensor: shape=(2,), dtype=float64, numpy=array([-1.22677221e-05,  0.00000000e+00])>,
 <tf.Tensor: shape=(2, 1), dtype=float64, numpy=
 array([[-6.86439011e-05],
        [ 0.00000000e+00]])>,
 <tf.Tensor: shape=(1,), dtype=float64, numpy=array([-3.2899797e-05])>]

예상대로 계수값이 거의 다 0이다.

풀이4: 노드수를 더 추가한다면?

- 노드수를 더 추가해보면 어떻게 될까? (주황색 점선이 더 여러개 있다면?)

#collapse
gv('''
subgraph cluster_1{
    style=filled;
    color=lightgrey;
    "X"
    label = "Layer 0"
}
subgraph cluster_2{
    style=filled;
    color=lightgrey;
    "X" -> "node1"
    "X" -> "node2"
    "X" -> "..."
    "X" -> "node512"
    label = "Layer 1: relu"
}
subgraph cluster_3{
    style=filled;
    color=lightgrey;
    "node1" -> "yhat"
    "node2" -> "yhat"
    "..." -> "yhat"
    "node512" -> "yhat"
    label = "Layer 2"
}
''')

tf.random.set_seed(43056)
net4= tf.keras.Sequential()
net4.add(tf.keras.layers.Dense(512,activation='relu')) # 이렇게 해도됩니다.
net4.add(tf.keras.layers.Dense(1))
net4.compile(loss='mse',optimizer=tf.optimizers.SGD(0.1))
net4.fit(x,y,epochs=1000,verbose=0,batch_size=N)
<keras.callbacks.History at 0x7fa30336f880>
plt.plot(x,y,'.')
plt.plot(x,net4(x),'--')

  • 잘된다..
  • 한두개의 노드가 역할을 못해도 다른노드들이 잘 보완해주는듯!

- 노드수가 많으면 무조건 좋다? -> 대부분 나쁘지 않음. 그런데 종종 맞추지 말아야할것도 맞춤.. (overfit)

np.random.seed(43052)
N=100
_x = np.linspace(0,1,N).reshape(N,1)
_y = np.random.normal(loc=0,scale=0.001,size=(N,1))
plt.plot(_x,_y)

tf.random.set_seed(43052)
net4 = tf.keras.Sequential()
net4.add(tf.keras.layers.Dense(512,activation='relu'))
net4.add(tf.keras.layers.Dense(1))
net4.compile(loss='mse',optimizer=tf.optimizers.SGD(0.5))
net4.fit(_x,_y,epochs=1000,verbose=0,batch_size=N)
<keras.callbacks.History at 0x7fa324f1d040>
plt.plot(_x,_y)
plt.plot(_x,net4(_x),'--')

  • 이 예제는 추후 다시 공부할 예정

Logistic regression

motive

- 현실에서 이런 경우가 많음 - \(x\)가 커질수록 (혹은 작아질수록) 성공확률이 올라간다.

- 이러한 모형은 아래와 같이 설계할 수 있음 <– 외우세요!! - \(y_i \sim Ber(\pi_i)\), where \(\pi_i=\frac{\exp(w_0+w_1x_i)}{1+\exp(w_0+w_1x_i)}\)

  • \(\hat{y}_i =\frac{\exp(\hat{w}_0+\hat{w}_1x_i)}{1+\exp(\hat{w}_0+\hat{w}_1x_i)}=\frac{1}{1+exp(-\hat{w}_0-\hat{w}_1x_i)}\)

  • \(loss=-\frac{1}{n}\sum_{i=1}^{n}\big(y_i\log(\hat{y}_i)+(1-y_i)\log(1-\hat{y}_i)\big)\)

- 위와 같은 손실함수를 BCEloss라고 부른다. (BCE는 Binary Cross Entropy의 약자)

예제

N = 2000
x = tnp.linspace(-1,1,N).reshape(N,1)
w0 = -1
w1 = 5
u = w0 + x*w1
#v = tf.constant(np.exp(u)/(1+np.exp(u))) # v=πi
v = tf.nn.sigmoid(u)
y = tf.constant(np.random.binomial(1,v),dtype=tf.float64)
plt.plot(x,y,'.',alpha=0.02)
plt.plot(x,v,'--r')

- 이 아키텍처(yhat을 얻어내는 과정)를 다어어그램으로 나타내면 아래와 같다.

#collapse
gv('''
subgraph cluster_1{
    style=filled;
    color=lightgrey;
    "x"
    label = "Layer 0"
}
subgraph cluster_2{
    style=filled;
    color=lightgrey;
    "x" -> "x*w, bias=True"[label="*w"]
    "x*w, bias=True" -> "yhat"[label="sigmoid"]
    label = "Layer 1"
}
''')

- 또는 간단하게 아래와 같이 쓸 수 있다.

#collapse
gv('''
subgraph cluster_1{
    style=filled;
    color=lightgrey;
    x
    label = "Layer 0"
}
subgraph cluster_2{
    style=filled;
    color=lightgrey;
    x -> "node1=yhat"
    label = "Layer 1: sigmoid"
}
''')

- 케라스를 이용하여 적합을 해보면

  • \(loss=-\frac{1}{n}\sum_{i=1}^{n}\big(y_i\log(\hat{y}_i)+(1-y_i)\log(1-\hat{y}_i)\big)\)
tf.random.set_seed(43052)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1,activation='sigmoid'))
bceloss_fn = lambda y,yhat: -tf.reduce_mean(y*tnp.log(yhat) + (1-y)*tnp.log(1-yhat))
net.compile(loss=bceloss_fn, optimizer=tf.optimizers.SGD(0.1))
net.fit(x,y,epochs=1000,verbose=0,batch_size=N)
WARNING:tensorflow:From /home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
<keras.callbacks.History at 0x7fa324cddbb0>
net.weights
[<tf.Variable 'dense_12/kernel:0' shape=(1, 1) dtype=float32, numpy=array([[4.307486]], dtype=float32)>,
 <tf.Variable 'dense_12/bias:0' shape=(1,) dtype=float32, numpy=array([-0.82411796], dtype=float32)>]
plt.plot(x,y,'.',alpha=0.1)
plt.plot(x,v,'--r')
plt.plot(x,net(x),'--b')

MSE loss?

- mse loss를 쓰면 왜 안되는지?

tf.random.set_seed(43052)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1,activation='sigmoid'))
mseloss_fn = lambda y,yhat: tf.reduce_mean((y-yhat)**2)
net.compile(loss=mseloss_fn, optimizer=tf.optimizers.SGD(0.1))
net.fit(x,y,epochs=1000,verbose=0,batch_size=N)
<keras.callbacks.History at 0x7fa325764a00>
plt.plot(x,y,'.',alpha=0.1)
plt.plot(x,v,'--r')
plt.plot(x,net(x),'--b')

  • 일단 BCE loss와 비교해보니까 동일 초기값, 동일 epochs에서 적합이 별로임

MSE loss vs BCE loss

- MSEloss, BCEloss의 시각화

w0, w1 = np.meshgrid(np.arange(-10,3,0.2), np.arange(-1,10,0.2), indexing='ij')
w0, w1 = w0.reshape(-1), w1.reshape(-1)

def loss_fn1(w0,w1):
    u = w0+w1*x
    yhat = np.exp(u)/(np.exp(u)+1)
    return mseloss_fn(y,yhat)

def loss_fn2(w0,w1):
    u = w0+w1*x
    yhat = np.exp(u)/(np.exp(u)+1)
    return bceloss_fn(y,yhat)

loss1 = list(map(loss_fn1,w0,w1))
loss2 = list(map(loss_fn2,w0,w1))
fig = plt.figure()
fig.set_figwidth(9)
fig.set_figheight(9)
ax1=fig.add_subplot(1,2,1,projection='3d')
ax2=fig.add_subplot(1,2,2,projection='3d')
ax1.elev=15
ax2.elev=15
ax1.azim=75
ax2.azim=75
ax1.scatter(w0,w1,loss1,s=0.1)
ax2.scatter(w0,w1,loss2,s=0.1)
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7fa3254a7cd0>

  • 왼쪽곡면(MSEloss)보다 오른쪽곡면(BCEloss)이 좀더 예쁘게 생김 -> 오른쪽 곡면에서 더 학습이 잘될것 같음

학습과정 시각화예시1

- 파라메터학습과정 시각화 // 옵티마이저: SGD, 초기값: (w0,w1) = (-3.0,-1.0)

  1. 데이터정리
X = tf.concat([tf.ones(N,dtype=tf.float64).reshape(N,1),x],axis=1)
X
<tf.Tensor: shape=(2000, 2), dtype=float64, numpy=
array([[ 1.       , -1.       ],
       [ 1.       , -0.9989995],
       [ 1.       , -0.997999 ],
       ...,
       [ 1.       ,  0.997999 ],
       [ 1.       ,  0.9989995],
       [ 1.       ,  1.       ]])>
  1. 1ter돌려봄
net_mse = tf.keras.Sequential()
net_mse.add(tf.keras.layers.Dense(1,use_bias=False,activation='sigmoid'))
net_mse.compile(optimizer=tf.optimizers.SGD(0.1),loss=mseloss_fn)
net_mse.fit(X,y,epochs=1,batch_size=N)
1/1 [==============================] - 0s 85ms/step - loss: 0.2281
<keras.callbacks.History at 0x7fa3253f1fd0>
net_bce = tf.keras.Sequential()
net_bce.add(tf.keras.layers.Dense(1,use_bias=False,activation='sigmoid'))
net_bce.compile(optimizer=tf.optimizers.SGD(0.1),loss=bceloss_fn)
net_bce.fit(X,y,epochs=1,batch_size=N)
1/1 [==============================] - 0s 94ms/step - loss: 0.7711
<keras.callbacks.History at 0x7fa32546da30>
net_mse.get_weights(), net_bce.get_weights()
([array([[0.19067296],
         [0.35189584]], dtype=float32)],
 [array([[-1.0962652 ],
         [-0.14414385]], dtype=float32)])
net_mse.set_weights([tnp.array([[-3.0 ],[ -1.0]],dtype=tf.float32)])
net_bce.set_weights([tnp.array([[-3.0 ],[ -1.0]],dtype=tf.float32)])
net_mse.get_weights(), net_bce.get_weights()
([array([[-3.],
         [-1.]], dtype=float32)],
 [array([[-3.],
         [-1.]], dtype=float32)])
  1. 학습과정기록: 15에폭마다 기록
What_mse = tnp.array([[-3.0 ],[ -1.0]],dtype=tf.float32)
What_bce = tnp.array([[-3.0 ],[ -1.0]],dtype=tf.float32)
for k in range(29):
    net_mse.fit(X,y,epochs=15,batch_size=N,verbose=0)
    net_bce.fit(X,y,epochs=15,batch_size=N,verbose=0)
    What_mse = tf.concat([What_mse,net_mse.weights[0]],axis=1)
    What_bce = tf.concat([What_bce,net_bce.weights[0]],axis=1)
  1. 시각화
from matplotlib import animation
plt.rcParams["animation.html"] = "jshtml"
fig = plt.figure()
fig.set_figwidth(6)
fig.set_figheight(6)
fig.suptitle("SGD, Winit=(-3,-1)")
ax1=fig.add_subplot(2,2,1,projection='3d')
ax2=fig.add_subplot(2,2,2,projection='3d')
ax1.elev=15;ax2.elev=15;ax1.azim=75;ax2.azim=75
ax3=fig.add_subplot(2,2,3)
ax4=fig.add_subplot(2,2,4)

ax1.scatter(w0,w1,loss1,s=0.1);ax1.scatter(-1,5,loss_fn1(-1,5),color='red',marker='*',s=200)
ax2.scatter(w0,w1,loss2,s=0.1);ax2.scatter(-1,5,loss_fn2(-1,5),color='red',marker='*',s=200)

ax3.plot(x,y,','); ax3.plot(x,v,'--r');
line3, = ax3.plot(x,1/(1+np.exp(-X@What_mse[:,0])),'--b')
ax4.plot(x,y,','); ax4.plot(x,v,'--r')
line4, = ax4.plot(x,1/(1+np.exp(-X@What_bce[:,0])),'--b')

def animate(i):
    _w0_mse,_w1_mse = What_mse[:,i]
    _w0_bce,_w1_bce = What_bce[:,i]
    ax1.scatter(_w0_mse, _w1_mse, loss_fn1(_w0_mse, _w1_mse),color='gray')
    ax2.scatter(_w0_bce, _w1_bce, loss_fn2(_w0_bce, _w1_bce),color='gray')
    line3.set_ydata(1/(1+np.exp(-X@What_mse[:,i])))
    line4.set_ydata(1/(1+np.exp(-X@What_bce[:,i])))

ani = animation.FuncAnimation(fig, animate, frames=30)
plt.close()
ani

학습과정 시각화예시2

- 파라메터학습과정 시각화 // 옵티마이저: Adam, 초기값: (w0,w1) = (-3.0,-1.0)

  1. 데이터정리
X = tf.concat([tf.ones(N,dtype=tf.float64).reshape(N,1),x],axis=1)
X
<tf.Tensor: shape=(2000, 2), dtype=float64, numpy=
array([[ 1.       , -1.       ],
       [ 1.       , -0.9989995],
       [ 1.       , -0.997999 ],
       ...,
       [ 1.       ,  0.997999 ],
       [ 1.       ,  0.9989995],
       [ 1.       ,  1.       ]])>
  1. 1ter돌려봄
net_mse = tf.keras.Sequential()
net_mse.add(tf.keras.layers.Dense(1,use_bias=False,activation='sigmoid'))
net_mse.compile(optimizer=tf.optimizers.Adam(0.1),loss=mseloss_fn)
net_mse.fit(X,y,epochs=1,batch_size=N)
1/1 [==============================] - 0s 102ms/step - loss: 0.3403
<keras.callbacks.History at 0x7fa32518ef10>
net_bce = tf.keras.Sequential()
net_bce.add(tf.keras.layers.Dense(1,use_bias=False,activation='sigmoid'))
net_bce.compile(optimizer=tf.optimizers.Adam(0.1),loss=bceloss_fn)
net_bce.fit(X,y,epochs=1,batch_size=N)
1/1 [==============================] - 0s 106ms/step - loss: 0.8690
<keras.callbacks.History at 0x7fa324bbaa00>
net_mse.get_weights(), net_bce.get_weights()
([array([[1.2018752 ],
         [0.73809683]], dtype=float32)],
 [array([[-0.9399656],
         [-0.5219858]], dtype=float32)])
net_mse.set_weights([tnp.array([[-3.0 ],[ -1.0]],dtype=tf.float32)])
net_bce.set_weights([tnp.array([[-3.0 ],[ -1.0]],dtype=tf.float32)])
net_mse.get_weights(), net_bce.get_weights()
([array([[-3.],
         [-1.]], dtype=float32)],
 [array([[-3.],
         [-1.]], dtype=float32)])
  1. 학습과정기록: 15에폭마다 기록
What_mse = tnp.array([[-3.0 ],[ -1.0]],dtype=tf.float32)
What_bce = tnp.array([[-3.0 ],[ -1.0]],dtype=tf.float32)
for k in range(29):
    net_mse.fit(X,y,epochs=15,batch_size=N,verbose=0)
    net_bce.fit(X,y,epochs=15,batch_size=N,verbose=0)
    What_mse = tf.concat([What_mse,net_mse.weights[0]],axis=1)
    What_bce = tf.concat([What_bce,net_bce.weights[0]],axis=1)
  1. 시각화
from matplotlib import animation
plt.rcParams["animation.html"] = "jshtml"
fig = plt.figure()
fig.set_figwidth(6)
fig.set_figheight(6)
fig.suptitle("Adam, Winit=(-3,-1)")
ax1=fig.add_subplot(2,2,1,projection='3d')
ax2=fig.add_subplot(2,2,2,projection='3d')
ax1.elev=15;ax2.elev=15;ax1.azim=75;ax2.azim=75
ax3=fig.add_subplot(2,2,3)
ax4=fig.add_subplot(2,2,4)

ax1.scatter(w0,w1,loss1,s=0.1);ax1.scatter(-1,5,loss_fn1(-1,5),color='red',marker='*',s=200)
ax2.scatter(w0,w1,loss2,s=0.1);ax2.scatter(-1,5,loss_fn2(-1,5),color='red',marker='*',s=200)

ax3.plot(x,y,','); ax3.plot(x,v,'--r');
line3, = ax3.plot(x,1/(1+np.exp(-X@What_mse[:,0])),'--b')
ax4.plot(x,y,','); ax4.plot(x,v,'--r')
line4, = ax4.plot(x,1/(1+np.exp(-X@What_bce[:,0])),'--b')

def animate(i):
    _w0_mse,_w1_mse = What_mse[:,i]
    _w0_bce,_w1_bce = What_bce[:,i]
    ax1.scatter(_w0_mse, _w1_mse, loss_fn1(_w0_mse, _w1_mse),color='gray')
    ax2.scatter(_w0_bce, _w1_bce, loss_fn2(_w0_bce, _w1_bce),color='gray')
    line3.set_ydata(1/(1+np.exp(-X@What_mse[:,i])))
    line4.set_ydata(1/(1+np.exp(-X@What_bce[:,i])))

ani = animation.FuncAnimation(fig, animate, frames=30)
plt.close()
ani

학습과정 시각화예시3

- 파라메터학습과정 시각화 // 옵티마이저: Adam, 초기값: (w0,w1) = (-10.0,-1.0)

  1. 데이터정리
X = tf.concat([tf.ones(N,dtype=tf.float64).reshape(N,1),x],axis=1)
X
<tf.Tensor: shape=(2000, 2), dtype=float64, numpy=
array([[ 1.       , -1.       ],
       [ 1.       , -0.9989995],
       [ 1.       , -0.997999 ],
       ...,
       [ 1.       ,  0.997999 ],
       [ 1.       ,  0.9989995],
       [ 1.       ,  1.       ]])>
  1. 1ter돌려봄
net_mse = tf.keras.Sequential()
net_mse.add(tf.keras.layers.Dense(1,use_bias=False,activation='sigmoid'))
net_mse.compile(optimizer=tf.optimizers.Adam(0.1),loss=mseloss_fn)
net_mse.fit(X,y,epochs=1,batch_size=N)
1/1 [==============================] - 0s 100ms/step - loss: 0.4499
<keras.callbacks.History at 0x7fa324a3d0d0>
net_bce = tf.keras.Sequential()
net_bce.add(tf.keras.layers.Dense(1,use_bias=False,activation='sigmoid'))
net_bce.compile(optimizer=tf.optimizers.Adam(0.1),loss=bceloss_fn)
net_bce.fit(X,y,epochs=1,batch_size=N)
1/1 [==============================] - 0s 114ms/step - loss: 1.0827
<keras.callbacks.History at 0x7fa303477850>
net_mse.get_weights(), net_bce.get_weights()
([array([[ 0.61489564],
         [-1.2362169 ]], dtype=float32)],
 [array([[ 0.89960575],
         [-0.63551056]], dtype=float32)])
net_mse.set_weights([tnp.array([[-10.0 ],[ -1.0]],dtype=tf.float32)])
net_bce.set_weights([tnp.array([[-10.0 ],[ -1.0]],dtype=tf.float32)])
net_mse.get_weights(), net_bce.get_weights()
([array([[-10.],
         [ -1.]], dtype=float32)],
 [array([[-10.],
         [ -1.]], dtype=float32)])
  1. 학습과정기록: 15에폭마다 기록
What_mse = tnp.array([[-10.0 ],[ -1.0]],dtype=tf.float32)
What_bce = tnp.array([[-10.0 ],[ -1.0]],dtype=tf.float32)
for k in range(29):
    net_mse.fit(X,y,epochs=15,batch_size=N,verbose=0)
    net_bce.fit(X,y,epochs=15,batch_size=N,verbose=0)
    What_mse = tf.concat([What_mse,net_mse.weights[0]],axis=1)
    What_bce = tf.concat([What_bce,net_bce.weights[0]],axis=1)
  1. 시각화
from matplotlib import animation
plt.rcParams["animation.html"] = "jshtml"
fig = plt.figure()
fig.set_figwidth(6)
fig.set_figheight(6)
fig.suptitle("Adam, Winit=(-10,-1)")
ax1=fig.add_subplot(2,2,1,projection='3d')
ax2=fig.add_subplot(2,2,2,projection='3d')
ax1.elev=15;ax2.elev=15;ax1.azim=75;ax2.azim=75
ax3=fig.add_subplot(2,2,3)
ax4=fig.add_subplot(2,2,4)

ax1.scatter(w0,w1,loss1,s=0.1);ax1.scatter(-1,5,loss_fn1(-1,5),color='red',marker='*',s=200)
ax2.scatter(w0,w1,loss2,s=0.1);ax2.scatter(-1,5,loss_fn2(-1,5),color='red',marker='*',s=200)

ax3.plot(x,y,','); ax3.plot(x,v,'--r');
line3, = ax3.plot(x,1/(1+np.exp(-X@What_mse[:,0])),'--b')
ax4.plot(x,y,','); ax4.plot(x,v,'--r')
line4, = ax4.plot(x,1/(1+np.exp(-X@What_bce[:,0])),'--b')

def animate(i):
    _w0_mse,_w1_mse = What_mse[:,i]
    _w0_bce,_w1_bce = What_bce[:,i]
    ax1.scatter(_w0_mse, _w1_mse, loss_fn1(_w0_mse, _w1_mse),color='gray')
    ax2.scatter(_w0_bce, _w1_bce, loss_fn2(_w0_bce, _w1_bce),color='gray')
    line3.set_ydata(1/(1+np.exp(-X@What_mse[:,i])))
    line4.set_ydata(1/(1+np.exp(-X@What_bce[:,i])))

ani = animation.FuncAnimation(fig, animate, frames=30)
plt.close()
ani
  • 아무리 아담이라고 해도 이건 힘듬

- discussion

  • mse_loss는 경우에 따라서 엄청 수렴속도가 느릴수도 있음.

  • 근본적인 문제점: mse_loss일 경우 loss function의 곡면이 예쁘지 않음. (전문용어로 convex가 아니라고 말함)

  • 좋은 옵티마지어를 이용하면 mse_loss일 경우에도 수렴속도를 올릴 수 있음 (학습과정 시각화예시2). 그렇지만 이는 근본적인 해결책은 아님. (학습과정 시각화예시3)

- 요약: 왜 logistic regression에서 mse loss를 쓰면 안되는가?

  • mse loss를 사용하면 손실함수가 convex하지 않으니까!

  • 그리고 bce loss를 사용하면 손실함수가 convex하니까!