視覺化

郭耀仁

視覺化的力量

He rose to international celebrity status after producing a Ted Talk in which he promoted the use of data to explore development issues. Hans Rosling - Wikipedia

視覺化的力量(2)

  • 200 年
  • 200+ 個國家
  • 120,000+ 列資料
  • 4 分鐘
  • 1 個視覺化

視覺化的力量(3)

  • 視覺化在資料科學常見的應用場景:
    • 提升溝通效率(Business Intelligence、Infographics)
    • 幫助暸解模型、演算法

Matplotlib

  • Python 基礎的繪圖套件
%matplotlib inline # 讓圖能夠在 Notebook 中顯示,不一定需要
import matplotlib.pyplot as plt

Matplotlib(2)

  • 向 Matplotlib 說嗨
  • 圖、子圖
  • 不同的圖形

向 Matplotlib 說嗨

  • 使用預設值
In [1]:
import numpy as np
import matplotlib.pyplot as plt

# 建立 data
x = np.linspace(-np.pi, np.pi, 100)
sin_x = np.sin(x)

# 作圖
plt.plot(x, sin_x)

# 顯示
plt.show()

向 Matplotlib 說嗨(2)

  • 試著更換線的寬度與顏色
In [2]:
import numpy as np
import matplotlib.pyplot as plt

# 建立 data
x = np.linspace(-np.pi, np.pi, 100)
sin_x = np.sin(x)

# 作圖
plt.plot(x, sin_x, color = "blue", linewidth = 3, linestyle = "-")

# 顯示
plt.show()

向 Matplotlib 說嗨(3)

  • 試著更換 X 軸與 Y 軸的範圍
In [3]:
import numpy as np
import matplotlib.pyplot as plt

# 建立 data
x = np.linspace(-np.pi, np.pi, 100)
sin_x = np.sin(x)

# 作圖
plt.plot(x, sin_x, color = "blue", linewidth = 3, linestyle = "-")
plt.xlim(x.min()*1.2, x.max()*1.2)
plt.ylim(sin_x.min()*1.2, sin_x.max()*1.2)

# 顯示
plt.show()

向 Matplotlib 說嗨(4)

  • 試著更換座標軸刻度
In [4]:
import numpy as np
import matplotlib.pyplot as plt

# 建立 data
x = np.linspace(-np.pi, np.pi, 100)
sin_x = np.sin(x)

# 作圖
plt.plot(x, sin_x, color = "blue", linewidth = 3, linestyle = "-")
plt.xlim(x.min()*1.2, x.max()*1.2)
plt.ylim(sin_x.min()*1.2, sin_x.max()*1.2)
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
plt.yticks([-1, 0, 1])

# 顯示
plt.show()

向 Matplotlib 說嗨(5)

  • 試著更換座標軸刻度的顯示標籤
In [5]:
import numpy as np
import matplotlib.pyplot as plt

# 建立 data
x = np.linspace(-np.pi, np.pi, 100)
sin_x = np.sin(x)

# 作圖
plt.plot(x, sin_x, color = "blue", linewidth = 3, linestyle = "-")
plt.xlim(x.min()*1.2, x.max()*1.2)
plt.ylim(sin_x.min()*1.2, sin_x.max()*1.2)
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi], ["$-\pi$", "$-\pi/2$", "$0$", "$\pi/2$", "$\pi$"])
plt.yticks([-1, 0, 1], ["$-1$", "$0$", "$+1$"])

# 顯示
plt.show()

向 Matplotlib 說嗨(6)

  • 試著將圖的外框(spine)移動
    • 隱藏一個水平、一個垂直
    • 將剩餘的兩個移動至原點
In [6]:
import numpy as np
import matplotlib.pyplot as plt

# 建立 data
x = np.linspace(-np.pi, np.pi, 100)
cos_x, sin_x = np.cos(x), np.sin(x)

# 作圖
plt.plot(x, sin_x, color = "blue", linewidth = 3, linestyle = "-")
plt.xlim(x.min()*1.2, x.max()*1.2)
plt.ylim(sin_x.min()*1.2, sin_x.max()*1.2)
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi], ["$-\pi$", "$-\pi/2$", "$0$", "$\pi/2$", "$\pi$"])
plt.yticks([-1, 0, 1], ["$-1$", "$0$", "$+1$"])
ax = plt.gca() # get current axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
#ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
#ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
In [7]:
# 顯示
plt.show()

向 Matplotlib 說嗨(7)

  • 增加圖例
In [8]:
import numpy as np
import matplotlib.pyplot as plt

# 建立 data
x = np.linspace(-np.pi, np.pi, 100)
sin_x = np.sin(x)

# 作圖
plt.plot(x, sin_x, color = "blue", linewidth = 3, linestyle = "-", label = "sin(x)")
plt.xlim(x.min()*1.2, x.max()*1.2)
plt.ylim(sin_x.min()*1.2, sin_x.max()*1.2)
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi], ["$-\pi$", "$-\pi/2$", "$0$", "$\pi/2$", "$\pi$"])
plt.yticks([-1, 0, 1], ["$-1$", "$0$", "$+1$"])
ax = plt.gca() # get current axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
#ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
#ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
plt.legend(loc = 'upper left', frameon = False)
Out[8]:
<matplotlib.legend.Legend at 0x11ab5ae80>
In [9]:
# 顯示
plt.show()

向 Matplotlib 說嗨(8)

  • 隨堂練習:畫出下圖
In [11]:
plt.show()

向 Matplotlib 說嗨(9)

  • 線圖最多的應用是有日期時間的資料
In [12]:
import pandas as pd

url = "https://storage.googleapis.com/py_ml_datasets/aapl.csv"
data = pd.read_csv(url, index_col = 0, parse_dates = True)
print(data.index)
DatetimeIndex(['2017-06-28', '2017-06-29', '2017-06-30', '2017-07-03',
               '2017-07-05', '2017-07-06', '2017-07-07', '2017-07-10',
               '2017-07-11', '2017-07-12', '2017-07-13', '2017-07-14',
               '2017-07-17', '2017-07-18', '2017-07-19', '2017-07-20',
               '2017-07-21', '2017-07-24', '2017-07-25', '2017-07-26',
               '2017-07-27'],
              dtype='datetime64[ns]', name='Date', freq=None)
In [13]:
# plt.plot_date()
import matplotlib.pyplot as plt

plt.plot_date(x = data.index, y = data['Close'].values, fmt = "r-")
plt.show()
In [14]:
url_m = "https://storage.googleapis.com/py_ml_datasets/msft.csv"
msft = pd.read_csv(url_m, index_col = 0, parse_dates = True)
plt.plot_date(x = data.index, y = data['Close'].values, fmt = "r-", label = "APPL")
plt.plot_date(x = msft.index, y = msft['Close'].values, fmt = "b-", label = "MSFT")
plt.legend(loc = 'center right', frameon = False)
plt.show()

向 Matplotlib 說嗨(10)

  • PandasMatplotlib 本來都有 Yahoo Finance、Google Finance 的 API,但此刻都已經關閉
  • Quandl 是我用來擷取前面範例資料的 API

圖、子圖

  • plt.figure()
    • figsize 是圖片的長、寬(吋)
    • dpi 是圖片的解析度(Dots per inch)
In [15]:
import numpy as np
import matplotlib.pyplot as plt

# 建立 data
x = np.linspace(-np.pi, np.pi, 100)
cos_x, sin_x = np.cos(x), np.sin(x)

plt.figure() # 試著加入 figsize 與 dpi 參數調整看看
plt.plot(x, cos_x)
plt.plot(x, sin_x)
plt.show()

圖、子圖(2)

  • plt.subplot()
    • 列數
    • 欄數
    • 第幾個圖
  • 注意 plt.subplot() 的索引值要從 1 開始!
In [16]:
import matplotlib.pyplot as plt

plt.subplot(2,1,1)
plt.subplot(2,1,2)

plt.show()
In [17]:
import matplotlib.pyplot as plt

plt.subplot(2, 1, 1)
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 1', ha = 'center',va = 'center', size = 16)
plt.subplot(2, 1, 2)
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 2', ha = 'center',va = 'center', size = 16)

plt.show()
In [18]:
import matplotlib.pyplot as plt

plt.subplot(2, 1 ,1)
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 1', ha = 'center',va = 'center', size = 16)
plt.subplot(2, 3, 4)
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 2', ha = 'center',va = 'center', size = 16)
plt.subplot(2, 3, 5)
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 3', ha = 'center',va = 'center', size = 16)
plt.subplot(2, 3, 6)
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 4', ha = 'center',va = 'center', size = 16)
plt.show()

圖、子圖(3)

  • 運用 gridspec 進行更進階的子圖切割
In [19]:
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

G = gridspec.GridSpec(3, 3)

plt.subplot(G[0, :])
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 1', ha = 'center',va = 'center', size = 18)

plt.subplot(G[1, :2])
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 2', ha = 'center',va = 'center', size = 18)

plt.subplot(G[1:, 2])
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 3', ha = 'center',va = 'center', size = 18)

plt.subplot(G[2, 0])
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 4', ha = 'center',va = 'center', size = 18)

plt.subplot(G[2, 1])
plt.xticks([]), plt.yticks([])
plt.text(0.5, 0.5, 'subplot 5', ha = 'center',va = 'center', size = 18)
Out[19]:
<matplotlib.text.Text at 0x11d0b89e8>
In [20]:
plt.show()

不同的圖形

繪圖函數 繪圖類型
plt.scatter() 散佈圖 scatter plot
plt.bar() 長條圖 bar plot
plt.hist() 直方圖 histogram
plt.contour() 等高線圖 contour plot
Axes3D() 3D 平面圖
In [21]:
# plt.scatter()

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(-5, 6)
y = x**2
plt.scatter(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("$y = x^2$") # 支援 LaTex Math 語法
plt.show()
In [22]:
# plt.bar()
import numpy as np
import matplotlib.pyplot as plt

N = 5
ice_cream_sales = np.random.randint(50, 100, size = 5)
ind = np.arange(1, N + 1)
width = 0.5
plt.bar(ind, ice_cream_sales, width, facecolor = 'g')
plt.title('Ice cream sales by flavor')
plt.xticks(ind, ['Matcha', 'Chocolate', 'Vanilla', 'Coco', 'Strawberry'])
for x, y in zip(ind, ice_cream_sales):
    plt.text(x, y + 2, '%i' % y, ha = 'center', va = 'bottom')
plt.ylim(0, ice_cream_sales.max() * 1.2)
plt.yticks([])
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.spines['left'].set_color('none')
In [23]:
plt.show()
In [24]:
# plt.hist()

import numpy as np
import matplotlib.pyplot as plt

data = np.random.randn(10000)
plt.hist(data)
plt.title('Normal Distribution')
plt.ylabel("Freq")
plt.show()

等高線圖

  • numpy.meshgrid() 可以幫我們把一維數列分別往水平/垂直方向延伸為矩陣
In [25]:
import numpy as np

u = np.arange(start = -5, stop = 5, step = 0.1)
x, y = np.meshgrid(u, u)
print(x)
print("\n")
print(y)
[[-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 ..., 
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]]


[[-5.  -5.  -5.  ..., -5.  -5.  -5. ]
 [-4.9 -4.9 -4.9 ..., -4.9 -4.9 -4.9]
 [-4.8 -4.8 -4.8 ..., -4.8 -4.8 -4.8]
 ..., 
 [ 4.7  4.7  4.7 ...,  4.7  4.7  4.7]
 [ 4.8  4.8  4.8 ...,  4.8  4.8  4.8]
 [ 4.9  4.9  4.9 ...,  4.9  4.9  4.9]]

等高線圖(2)

  • 等高線圖需要第三個維度:z 來表示高度(顏色)
In [26]:
import numpy as np

u = np.arange(start = -5, stop = 5, step = 0.1)
x, y = np.meshgrid(u, u)
z = x**2 + y**2
print(z)
[[ 50.    49.01  48.04 ...,  47.09  48.04  49.01]
 [ 49.01  48.02  47.05 ...,  46.1   47.05  48.02]
 [ 48.04  47.05  46.08 ...,  45.13  46.08  47.05]
 ..., 
 [ 47.09  46.1   45.13 ...,  44.18  45.13  46.1 ]
 [ 48.04  47.05  46.08 ...,  45.13  46.08  47.05]
 [ 49.01  48.02  47.05 ...,  46.1   47.05  48.02]]

等高線圖(3)

  • contour() 是沒有填色的
In [27]:
import numpy as np

u = np.arange(start = -5, stop = 5, step = 0.1)
x, y = np.meshgrid(u, u)
z = x**2 + y**2
plt.contour(x, y, z)
plt.colorbar()
plt.show()

等高線圖(4)

  • contourf() 是有填色的
In [28]:
import numpy as np
import matplotlib.pyplot as plt

u = np.arange(start = -5, stop = 5, step = 0.1)
x, y = np.meshgrid(u, u)
z = x**2 + y**2
cf = plt.contourf(x, y, z)
plt.colorbar(cf)
plt.show()

等高線圖(5)

  • 為什麼要知道等高線圖怎麼畫?

3D 平面圖

  • from mpl_toolkits.mplot3d import Axes3D
In [29]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

u = np.arange(start = -5, stop = 5, step = 0.1)
x, y = np.meshgrid(u, u)
z = x**2 + y**2
surf = ax.plot_surface(x, y, z, cmap = plt.cm.coolwarm)
fig.colorbar(surf, shrink = 0.5)
plt.show()

圖片視覺化

  • 從 PNG 轉換為 ndarray
  • 呈現 ndarray

圖片視覺化(2)

  • 從 PNG 轉換為 ndarray
In [30]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np

url = "https://storage.googleapis.com/py_ml_images/luffy_wanted.png"
img = mpimg.imread(url)
print(type(img))
print(img.shape)
<class 'numpy.ndarray'>
(560, 384, 4)
In [31]:
plt.imshow(img)
plt.show()
In [32]:
plt.figure(figsize = (10, 6), dpi = 80)
for i in range(img.shape[-1]):
    plt.subplot(1, img.shape[-1], i + 1)
    lum_img = img[:, :, i]
    plt.imshow(lum_img)
plt.tight_layout()
plt.show()

圖片視覺化(3)

  • 呈現 ndarray
In [33]:
from sklearn import datasets

digits = datasets.load_digits()
print(digits.images.shape)
print(digits.images[0].shape)
(1797, 8, 8)
(8, 8)

圖片視覺化(4)

In [34]:
# 視覺化像素(2)
# 使用 `imshow()` 方法

from sklearn import datasets
import matplotlib.pyplot as plt

digits = datasets.load_digits()
plt.imshow(digits.images[0], cmap = "binary")
plt.xticks([]), plt.yticks([])
plt.show()
In [35]:
# 視覺化像素(3)
# 看 10 張

from sklearn import datasets
import matplotlib.pyplot as plt

digits = datasets.load_digits()
for idx, val in enumerate(range(1, 11)):
    plt.subplot(2, 5, val)
    plt.imshow(digits.images[idx], cmap = "binary")
    plt.xticks([]), plt.yticks([])
plt.tight_layout()
plt.show()
In [36]:
# 視覺化像素(4)
# 看 10 張
# cmap 改用 "gray"

from sklearn import datasets
import matplotlib.pyplot as plt

digits = datasets.load_digits()
for idx, val in enumerate(range(1, 11)):
    plt.subplot(2, 5, val)
    plt.imshow(digits.images[idx], cmap = "gray")
    plt.xticks([]), plt.yticks([])
plt.tight_layout()
plt.show()

隨堂練習

  • 建立一個 (1, 2) 的網格畫布
  • 個別畫出 S 函數與 sign 函數
$$S(x) = \frac{1}{1 + e^{-x}}$$$$ sign(x) = \begin{cases} -1 & \quad \text{if } x < 0\\ 0 & \quad \text{if } x = 0\\ 1 & \quad \text{if } x > 0\\ \end{cases} $$

Future of Visualization Tools

Quoted from William McKinney, author of Pandas:

The Future of Visualization Tools?

Visualizations built on web technologies (that is, JavaScript-based) appear to be the inevitable future. Doubtlessly you have used many different kinds of static or interactive visualizations built in Flash or JavaScript over the years. New toolkits (such as d3.js and its numerous off-shoot projects) for building such displays are appearing all the time. In contrast, development in non web-based visualization has slowed significantly in recent years. This holds true of Python as well as other data analysis and statistical computing environments like R.