かねてから挑戦して見たかったKaggleに挑戦する（つもりだった...）
そのコンペは終了してしまったため、Submitはできないのだが、せっかく着手していたのでやりきりたい。

対象のコンペは、Recruit Restaurant Visitor Forecasting。
飲食店の来客数を予想する。

Recruit Restaurant Visitor Forecasting | Kaggle

これから数回に分けて、基礎分析から機械学習を用いた予測まで、一連の作業を進めていく。

最新のソースはGitHubで公開、随時更新していく。

今回は与えられたデータの確認１回目。
与えられたcsvデータどうしを結合せずに分析していきます。

０．sample_submission.csv [Submitサンプル]

store_idに、アンダースコア、訪問日を付与する必要がある
2017-04-23〜2017-05-31が予測対象期間
visitorsに予測値を出力する

import numpy as np
import pandas as pd
from pandas import DataFrame
from datetime import datetime, timedelta
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# .csvをDataFrame化
sample = pd.read_csv('Data/csv/sample_submission.csv')
sample.head(3)

	id	visitors
0	air_00a91d42b08b08d9_2017-04-23	0
1	air_00a91d42b08b08d9_2017-04-24	0
2	air_00a91d42b08b08d9_2017-04-25	0

# 他のデータセットと形式を合わせておく
sample_edit = sample.copy()
sample_edit['air_store_id'] = sample_edit.id.map(lambda x: '_'.join(x.split('_')[:-1]))
sample_edit['calendar_date'] = sample_edit.id.map(lambda x: x.split('_')[2])
sample_edit['calendar_date'] = sample_edit.calendar_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
sample_edit.drop('id',axis=1)
sample_edit = sample_edit.loc[:,['air_store_id','calendar_date','visitors']]
sample_edit.head(3)

	air_store_id	calendar_date
0	air_00a91d42b08b08d9	2017-04-23
1	air_00a91d42b08b08d9	2017-04-24
2	air_00a91d42b08b08d9	2017-04-25

# データ型を確認
sample_edit.dtypes

air_store_id             object
calendar_date    datetime64[ns]
visitors                  int64
dtype: object

# 統計量を確認
sample_edit.describe(include='all')

	air_store_id	calendar_date	visitors
count	32019	32019	32019.0
unique	821	39	NaN
top	air_b88192b35ac03c24	2017-05-26 00:00:00	NaN
freq	39	821	NaN
first	NaN	2017-04-23 00:00:00	NaN
last	NaN	2017-05-31 00:00:00	NaN
mean	NaN	NaN	0.0
std	NaN	NaN	0.0
min	NaN	NaN	0.0
25%	NaN	NaN	0.0
50%	NaN	NaN	0.0
75%	NaN	NaN	0.0
max	NaN	NaN	0.0

１．air_reserve.csv [Air リザーブの予約データ]

92,378レコード
314 /829店舗分の予約データ
訪問予定は、2016-01-01 19:00:00 〜 2017-05-31 21:00:00
予約時刻は、2016-01-01 01:00:00 〜 2017-04-22 23:00:00

# .csvをDataFrame化
df1 = pd.read_csv('Data/csv/air_reserve.csv')
df1.head(3)

	air_store_id	visit_datetime	reserve_datetime	reserve_visitors
0	air_877f79706adbfb06	2016-01-01 19:00:00	2016-01-01 16:00:00	1
1	air_db4b38ebe7a7ceff	2016-01-01 19:00:00	2016-01-01 19:00:00	3
2	air_db4b38ebe7a7ceff	2016-01-01 19:00:00	2016-01-01 19:00:00	6

# データ型を確認
df1.dtypes

air_store_id        object
visit_datetime      object
reserve_datetime    object
reserve_visitors     int64
dtype: object

# データ型を修正
df1['visit_datetime'] = df1.visit_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df1['reserve_datetime'] = df1.reserve_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df1.dtypes

air_store_id                object
visit_datetime      datetime64[ns]
reserve_datetime    datetime64[ns]
reserve_visitors             int64
dtype: object

# 統計量を確認
df1.describe(include='all')

	air_store_id	visit_datetime	reserve_datetime	reserve_visitors
count	92378	92378	92378	92378.000000
unique	314	4975	7513	NaN
top	air_8093d0b565e9dbdf	2016-12-24 19:00:00	2016-11-24 18:00:00	NaN
freq	2263	255	106	NaN
first	NaN	2016-01-01 19:00:00	2016-01-01 01:00:00	NaN
last	NaN	2017-05-31 21:00:00	2017-04-22 23:00:00	NaN
mean	NaN	NaN	NaN	4.481749
std	NaN	NaN	NaN	4.919669
min	NaN	NaN	NaN	1.000000
25%	NaN	NaN	NaN	2.000000
50%	NaN	NaN	NaN	3.000000
75%	NaN	NaN	NaN	5.000000
max	NaN	NaN	NaN	100.000000

１−１．月間予約数

#  予約数を月次で集計
df1_monthly = df1.copy()
df1_monthly['reserve_datetime'] = df1_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['air_store_id','visit_datetime']
df1_monthly = df1_monthly.rename(columns={'reserve_datetime': 'reserve_month'}).drop(drop_col,axis=1)
df1_monthly = df1_monthly.groupby(['reserve_month']).count().reset_index().rename(columns={'reserve_visitors':'reserve'})
df1_monthly.head(3)

	reserve_month	reserve
0	2016-01	2936
1	2016-02	2999
2	2016-03	3643

# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df1_monthly.reserve.as_matrix(), tick_label=df1_monthly.reserve_month, align="center", color='deepskyblue')
plt.title("Monthly_air_reserve")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve")
plt.grid(True)

f:id:select_from_where:20180217221028p:plain

１−２．月別訪問予定者数

#  訪問予定者数を月次で集計
df1_monthly = df1.copy()
df1_monthly['visit_datetime'] = df1_monthly.visit_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['air_store_id','reserve_datetime']
df1_monthly = df1_monthly.rename(columns={'visit_datetime': 'visit_month'}).drop(drop_col,axis=1)
df1_monthly = df1_monthly.groupby(['visit_month'])['reserve_visitors'].sum().reset_index()
df1_monthly.head(3)

	visit_month	reserve_visitors
0	2016-01	10335
1	2016-02	11579
2	2016-03	17071

# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,18), df1_monthly.reserve_visitors.as_matrix(), tick_label=df1_monthly.visit_month, align="center", color='deepskyblue')
plt.title("Monthly_air_reserve_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve_visitors")
plt.grid(True)

f:id:select_from_where:20180217221023p:plain

１−３．予約日から訪問予定日までの日数

#  予約日から訪問予定日までの日数を追加
df1_monthly_diff = df1.copy()
df1_monthly_diff['rsrv_2_vst'] = df1_monthly_diff.visit_datetime - df1_monthly_diff.reserve_datetime
df1_monthly_diff['rsrv_2_vst'] = df1_monthly_diff.rsrv_2_vst.map(lambda x: x.days)

df1_monthly_diff['reserve_datetime'] = df1_monthly_diff.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
df1_monthly_diff = df1_monthly_diff.rename(columns={'reserve_datetime': 'reserve_month'})

df1_monthly_diff.tail(3)

	air_store_id	visit_datetime	reserve_month	reserve_visitors	rsrv_2_vst
92375	air_877f79706adbfb06	2017-05-31 20:00:00	2017-04	3	48
92376	air_900d755ebd2f7bbd	2017-05-31 20:00:00	2017-04	10	42
92377	air_3cad29d1a23209d2	2017-05-31 21:00:00	2017-04	3	39

plt.figure(figsize=(14, 8))
sns.boxplot(x="reserve_month", y="rsrv_2_vst", data=df1_monthly_diff)

f:id:select_from_where:20180217221016p:plain

１−４．気づき

2016年9月以前の訪問者数がそれ以降に比べて少ない

2016年7月まで先行展開、2016年10月から本格展開の可能性がある
データがサンプリングされている可能性がある

忘年会、納会シーズンの予約が多い一方、歓迎会シーズンはそれほどでもないように見える

休日祝日の日数による影響がある可能性がある

５月以降の訪問予定になっている予約データはおそらく、意図的に除外されている

１年前、10か月前から予約が可能な人気店がある
５月以降の訪問予定になっている予約データはおそらく、意図的に除外されている（除外されていないレコードも5,000弱存在はする）

２．air_visit_data.csv [Air リザーブの訪問データ]

252,108レコード
829店舗全ての履歴データがある
訪問日時は、2016-01-01 〜 2017-04-22

# .csvをDataFrame化
df3 = pd.read_csv('Data/csv/air_visit_data.csv')
df3.head(3)

	air_store_id	visit_date	visitors
0	air_ba937bf13d40fb24	2016-01-13	25
1	air_ba937bf13d40fb24	2016-01-14	32
2	air_ba937bf13d40fb24	2016-01-15	29

# データ型を確認
df3.dtypes

air_store_id    object
visit_date      object
visitors         int64
dtype: object

# データ型を修正
df3['visit_date'] = df3.visit_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
df3.dtypes

air_store_id            object
visit_date      datetime64[ns]
visitors                 int64
dtype: object

# 統計量を確認
df3.describe(include='all')

	air_store_id	visit_date	visitors
count	252108	252108	252108.000000
unique	829	478	NaN
top	air_5c817ef28f236bdf	2017-03-17 00:00:00	NaN
freq	477	799	NaN
first	NaN	2016-01-01 00:00:00	NaN
last	NaN	2017-04-22 00:00:00	NaN
mean	NaN	NaN	20.973761
std	NaN	NaN	16.757007
min	NaN	NaN	1.000000
25%	NaN	NaN	9.000000
50%	NaN	NaN	17.000000
75%	NaN	NaN	29.000000
max	NaN	NaN	877.000000

２−１．月間訪問者数

#  訪問データを月次に加工
df3_monthly = df3.copy()
df3_monthly['visit_date'] = df3_monthly.visit_date.map(lambda x: x.strftime('%Y-%m'))
df3_monthly = df3_monthly.rename(columns={'visit_date': 'visit_month'})
df3_monthly = df3_monthly.groupby(['visit_month'])['visitors'].sum().reset_index()
df3_monthly.head(3)

	visit_month	visitors
0	2016-01	152924
1	2016-02	159934
2	2016-03	189292

# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df3_monthly.visitors.as_matrix(), tick_label=df3_monthly.visit_month, align="center", color='navy')
plt.title("Monthly_air_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("visitors")
plt.grid(True)

f:id:select_from_where:20180217221013p:plain

２−２．一日の訪問者数

#  訪問予定月を追加
df3_monthly_diff = df3.copy()
df3_monthly_diff['visit_month'] = df3_monthly_diff.visit_date.map(lambda x: x.strftime('%Y-%m'))

df3_monthly_diff.tail(3)

	air_store_id	visit_date	visitors	visit_month
252105	air_24e8414b9b07decb	2017-04-20	7	2017-04
252106	air_24e8414b9b07decb	2017-04-21	8	2017-04
252107	air_24e8414b9b07decb	2017-04-22	5	2017-04

plt.figure(figsize=(14, 8))
sns.boxplot(x="visit_month", y="visitors", data=df3_monthly_diff)

f:id:select_from_where:20180217221010p:plain

２−３．予約データと並べて予約率を見てみる

※予約と実訪問日をキーに結合した値ではないため、参考程度に

# 棒グラフと折れ線グラフで描画する
height1 = df1_monthly.reserve_visitors.as_matrix()
height2 = df3_monthly.visitors.as_matrix()
line_height = (df1_monthly.reserve_visitors / df3_monthly.visitors * 100).as_matrix()
w1 = 0.4
w2 = w1 / 2
left1 = np.arange(len(height1))
left2 = np.arange(len(height2))

fig, ax1 = plt.subplots(figsize=(14, 8))
p1 = ax1.bar(left1, height1, align='center', width=w1, tick_label=df1_monthly.visit_month, color='deepskyblue')
p2 = ax1.bar(left2 + w1, height2, align='center', width=w1, tick_label=df3_monthly.visit_month, color='navy')
ax1.set_ylabel('Num of people')

ax2 = ax1.twinx()
p3 = ax2.plot(left1 + w2, line_height, color="red")
ax2.set_ylabel('Reserve Rate')
ax2.set_ylim(0, np.nanmax(line_height)*1.75)

plt.legend((p1[0], p2[0], p3[0]), ("reserve", "visit", "reserve_rate"), loc='upper right', bbox_to_anchor=(0.7, 0.5, 0.5, 0.5))

f:id:select_from_where:20180217221129p:plain

２–４．気づき

大口の予約が混ざっている

100名を超えるような大口の予約も混ざっている

予約データと訪問データで充実している期間が異なる

予約データは2016年11月以降、訪問データは2016年7月以降で充実している
- 2016年7月まで先行展開、2016年10月から本格展開の可能性は否定すべきか
- 予約機能と決済機能の展開時期が異なっていたと考えることもできる

３．air_store_info.csv [Air リザーブのレストラン情報]

829店舗
14ジャンル
103エリア

# .csvをDataFrame化
df2 = pd.read_csv('Data/csv/air_store_info.csv')
df2.head(3)

	air_store_id	air_genre_name	air_area_name	latitude	longitude
0	air_0f0cdeee6c9bf3d7	Italian/French	Hyōgo-ken Kōbe-shi Kumoidōri	34.695124	135.197852
1	air_7cc17a324ae5c7dc	Italian/French	Hyōgo-ken Kōbe-shi Kumoidōri	34.695124	135.197852
2	air_fee8dcf4d619598e	Italian/French	Hyōgo-ken Kōbe-shi Kumoidōri	34.695124	135.197852

# データ型を確認
df2.dtypes

air_store_id       object
air_genre_name     object
air_area_name      object
latitude          float64
longitude         float64
dtype: object

# 統計量を確認
df2.describe(include='all')

	air_store_id	air_genre_name	air_area_name	latitude	longitude
count	829	829	829	829.000000	829.000000
unique	829	14	103	NaN	NaN
top	air_eb5788dba285e725	Izakaya	Fukuoka-ken Fukuoka-shi Daimyō	NaN	NaN
freq	1	197	64	NaN	NaN
mean	NaN	NaN	NaN	35.647042	137.415311
std	NaN	NaN	NaN	2.084659	3.650971
min	NaN	NaN	NaN	33.211967	130.195555
25%	NaN	NaN	NaN	34.695124	135.341564
50%	NaN	NaN	NaN	35.658068	139.685474
75%	NaN	NaN	NaN	35.694003	139.751599
max	NaN	NaN	NaN	44.020632	144.273398

３−１．ジャンルごとの店舗数

# ジャンルの内訳と頻度
df2.groupby('air_genre_name').size().sort_values(ascending=False).describe()

count     14.000000
mean      59.214286
std       66.013027
min        2.000000
25%       13.250000
50%       25.000000
75%       96.250000
max      197.000000
dtype: float64

# 棒グラフで描画
df2_air_genre_name = df2.groupby('air_genre_name').size().sort_values()
plt.barh(np.arange(1,15), df2_air_genre_name.values, tick_label=df2_air_genre_name.index, align="center", color='dodgerblue')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221127p:plain

３−２．エリアごとの店舗数

# エリアの内訳と頻度
df2.groupby('air_area_name').size().sort_values(ascending=False).describe()

count    103.000000
mean       8.048544
std       11.332219
min        2.000000
25%        2.000000
50%        4.000000
75%        7.500000
max       64.000000
dtype: float64

３−２−１．市区町村ごとの店舗数

# エリアを市区町村レベルに変更、plot時の文字化け対策で置換
df2_city = df2.copy()
df2_city['air_area_name'] = df2_city.air_area_name.map(lambda x: ' '.join(x.split(' ')[:2]).translate(str.maketrans('Ōōū', 'Oou')))
df2_city.head(3)

	air_store_id	air_genre_name	air_area_name	latitude	longitude
0	air_0f0cdeee6c9bf3d7	Italian/French	Hyogo-ken Kobe-shi	34.695124	135.197852
1	air_7cc17a324ae5c7dc	Italian/French	Hyogo-ken Kobe-shi	34.695124	135.197852
2	air_fee8dcf4d619598e	Italian/French	Hyogo-ken Kobe-shi	34.695124	135.197852

# 棒グラフで描画
df2_air_area_name = df2_city.groupby('air_area_name').size().sort_values()
plt.figure(figsize=(12, 14))
plt.barh(np.arange(1,56), df2_air_area_name.values, tick_label=df2_air_area_name.index, align="center", color='aqua')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221122p:plain

３−２−２．都道府県ごとの店舗数

# エリアを都道府県レベルに変更
df2_pref = df2_city.copy()
df2_pref['air_area_name'] = df2_pref.air_area_name.map(lambda x: ' '.join(x.split(' ')[:1]))
df2_pref.head(3)

	air_store_id	air_genre_name	air_area_name	latitude	longitude
0	air_0f0cdeee6c9bf3d7	Italian/French	Hyogo-ken	34.695124	135.197852
1	air_7cc17a324ae5c7dc	Italian/French	Hyogo-ken	34.695124	135.197852
2	air_fee8dcf4d619598e	Italian/French	Hyogo-ken	34.695124	135.197852

# 棒グラフで描画
df2_air_area_name = df2_pref.groupby('air_area_name').size().sort_values()
plt.barh(np.arange(1,10), df2_air_area_name.values, tick_label=df2_air_area_name.index, align="center", color='aqua')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221117p:plain

３−３．エリア x ジャンルの分布

# データフレームをピボット
df2_pref_heat = pd.pivot_table(data=df2_pref, values='air_store_id', 
                                  columns='air_area_name', index='air_genre_name', aggfunc='count').sort_values(by='Tokyo-to',ascending=False)
df2_pref_heat = df2_pref_heat.T.sort_values(by='Izakaya',ascending=False)
df2_pref_heat.head(3)

air_genre_name	Cafe/Sweets	Izakaya	Dining bar	Italian/French	Bar/Cocktail	Japanese food	Other	Western food	Creative cuisine	Okonomiyaki/Monja/Teppanyaki	Yakiniku/Korean food	Asian	International cuisine	Karaoke/Party
air_area_name
Tokyo-to	87.0	87.0	65.0	65.0	54.0	38.0	16.0	14.0	5.0	5.0	4.0	2.0	2.0	NaN
Fukuoka-ken	45.0	27.0	14.0	11.0	9.0	11.0	5.0	NaN	2.0	NaN	3.0	NaN	NaN	NaN
Hyogo-ken	10.0	24.0	4.0	9.0	NaN	2.0	NaN	NaN	4.0	2.0	2.0	NaN	NaN	NaN

# ヒートマップを描画
plt.figure(figsize=(8, 4))
sns.heatmap(df2_pref_heat, annot=True, center=30, fmt='g', cmap='Blues')

f:id:select_from_where:20180217221113p:plain

３–４．気づき

ジャンル x エリアの分布に規則性が見られる

何らかのルールでサンプリングされていそう

４．hpg_reserve.csv [ホットペッパーグルメの予約データ]

2,000,320レコード
13,325 /4,690店舗の予約データ???
訪問予定は、2016-01-01 11:00:00 〜 2017-05-31 23:00:00
予約時刻は、2016-01-01 00:00:00 〜 2017-04-22 23:00:00

# .csvをDataFrame化
df4 = pd.read_csv('Data/csv/hpg_reserve.csv')
df4.head(3)

	hpg_store_id	visit_datetime	reserve_datetime	reserve_visitors
0	hpg_c63f6f42e088e50f	2016-01-01 11:00:00	2016-01-01 09:00:00	1
1	hpg_dac72789163a3f47	2016-01-01 13:00:00	2016-01-01 06:00:00	3
2	hpg_c8e24dcf51ca1eb5	2016-01-01 16:00:00	2016-01-01 14:00:00	2

# データ型を確認
df4.dtypes

hpg_store_id        object
visit_datetime      object
reserve_datetime    object
reserve_visitors     int64
dtype: object

# データ型を修正
df4['visit_datetime'] = df4.visit_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df4['reserve_datetime'] = df4.reserve_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df4.dtypes

hpg_store_id                object
visit_datetime      datetime64[ns]
reserve_datetime    datetime64[ns]
reserve_visitors             int64
dtype: object

# 統計量を確認
df4.describe(include='all')

	hpg_store_id	visit_datetime	reserve_datetime	reserve_visitors
count	2000320	2000320	2000320	2.000320e+06
unique	13325	9847	11450	NaN
top	hpg_2afd5b187409eeb4	2016-12-16 19:00:00	2016-12-12 21:00:00	NaN
freq	1155	10528	907	NaN
first	NaN	2016-01-01 11:00:00	2016-01-01 00:00:00	NaN
last	NaN	2017-05-31 23:00:00	2017-04-22 23:00:00	NaN
mean	NaN	NaN	NaN	5.073785e+00
std	NaN	NaN	NaN	5.416172e+00
min	NaN	NaN	NaN	1.000000e+00
25%	NaN	NaN	NaN	2.000000e+00
50%	NaN	NaN	NaN	3.000000e+00
75%	NaN	NaN	NaN	6.000000e+00
max	NaN	NaN	NaN	1.000000e+02

４−１．月間予約数

#  予約数を月次で集計
df4_monthly = df4.copy()
df4_monthly['reserve_datetime'] = df4_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['hpg_store_id','visit_datetime']
df4_monthly = df4_monthly.rename(columns={'reserve_datetime': 'reserve_month'}).drop(drop_col,axis=1)
df4_monthly = df4_monthly.groupby(['reserve_month']).count().reset_index().rename(columns={'reserve_visitors':'reserve'})
df4_monthly.head(3)

	reserve_month	reserve
0	2016-01	80905
1	2016-02	79934
2	2016-03	98276

# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df4_monthly.reserve.as_matrix(), tick_label=df4_monthly.reserve_month, align="center", color='pink')
plt.title("Monthly_hpg_reserve")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve")
plt.grid(True)

f:id:select_from_where:20180217221226p:plain

４−２．月別訪問予定者数

#  訪問予定者数を月次で集計
df4_monthly = df4.copy()
df4_monthly['visit_datetime'] = df4_monthly.visit_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['hpg_store_id','reserve_datetime']
df4_monthly = df4_monthly.rename(columns={'visit_datetime': 'visit_month'}).drop(drop_col,axis=1)
df4_monthly = df4_monthly.groupby(['visit_month'])['reserve_visitors'].sum().reset_index()
df4_monthly.head(3)

	visit_month	reserve_visitors
0	2016-01	313180
1	2016-02	359597
2	2016-03	546433

# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,18), df4_monthly.reserve_visitors.as_matrix(), tick_label=df4_monthly.visit_month, align="center", color='pink')
plt.title("Monthly_hpg_reserve_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve_visitors")
plt.grid(True)

f:id:select_from_where:20180217221223p:plain

４−３．予約日から訪問予定日までの日数

#  予約日から訪問予定日までの日数を追加
df4_monthly = df4.copy()
df4_monthly['rsrv_2_vst'] = df4_monthly.visit_datetime - df4_monthly.reserve_datetime
df4_monthly['rsrv_2_vst'] = df4_monthly.rsrv_2_vst.map(lambda x: x.days)

df4_monthly['reserve_datetime'] = df4_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
df4_monthly = df4_monthly.rename(columns={'reserve_datetime': 'reserve_month'})

df4_monthly.tail(3)

	hpg_store_id	visit_datetime	reserve_month	reserve_visitors	rsrv_2_vst
2000317	hpg_e9151de687b93aa3	2017-05-31 21:00:00	2017-04	49	47
2000318	hpg_fae6c96189b4a122	2017-05-31 21:00:00	2017-04	9	42
2000319	hpg_0b70de808b55ad1e	2017-05-31 23:00:00	2017-04	6	40

plt.figure(figsize=(14, 8))
sns.boxplot(x="reserve_month", y="rsrv_2_vst", data=df4_monthly)

f:id:select_from_where:20180217221223p:plain

４−４．気づき

2016年10月以降のデータ量が多い

純粋な増加なのか、データ収集基盤の改善か、サンプリングによるものかは不明

11月の予約から訪問までの日差は大きい

忘年会に備えた事前予約数が多い

５．hpg_store_info.csv [ホットペッパーグルメのレストラン情報]

4,690の店舗データ
34ジャンル
119エリア

# .csvをDataFrame化
df5 = pd.read_csv('Data/csv/hpg_store_info.csv')
df5.head(3)

	hpg_store_id	hpg_genre_name	hpg_area_name	latitude	longitude
0	hpg_6622b62385aec8bf	Japanese style	Tōkyō-to Setagaya-ku Taishidō	35.643675	139.668221
1	hpg_e9e068dd49c5fa00	Japanese style	Tōkyō-to Setagaya-ku Taishidō	35.643675	139.668221
2	hpg_2976f7acb4b3a3bc	Japanese style	Tōkyō-to Setagaya-ku Taishidō	35.643675	139.668221

# データ型を確認
df5.dtypes

hpg_store_id       object
hpg_genre_name     object
hpg_area_name      object
latitude          float64
longitude         float64
dtype: object

# 統計量を確認
df5.describe(include='all')

	hpg_store_id	hpg_genre_name	hpg_area_name	latitude	longitude
count	4690	4690	4690	4690.000000	4690.000000
unique	4690	34	119	NaN	NaN
top	hpg_b6fcb5aea05b867b	Japanese style	Tōkyō-to Shinjuku-ku None	NaN	NaN
freq	1	1750	257	NaN	NaN
mean	NaN	NaN	NaN	35.810261	137.675816
std	NaN	NaN	NaN	2.138755	3.197233
min	NaN	NaN	NaN	33.311645	130.339313
25%	NaN	NaN	NaN	34.692109	135.498859
50%	NaN	NaN	NaN	35.659214	139.495733
75%	NaN	NaN	NaN	35.703381	139.737998
max	NaN	NaN	NaN	43.774846	143.714585

５−１．ジャンルごとの店舗数

# ジャンルの内訳と頻度
df5.groupby('hpg_genre_name').size().sort_values(ascending=False).describe()

count      34.000000
mean      137.941176
std       323.163055
min         1.000000
25%         5.000000
50%        25.500000
75%        79.750000
max      1750.000000
dtype: float64

# 棒グラフで描画
df5_hpg_genre_name = df5.groupby('hpg_genre_name').size().sort_values()
plt.figure(figsize=(8, 10))
plt.barh(np.arange(1,35), df5_hpg_genre_name.values, tick_label=df5_hpg_genre_name.index, align="center", color='orange')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221218p:plain

５−２．エリアごとの店舗数

# エリアの内訳と頻度
df5.groupby('hpg_area_name').size().sort_values(ascending=False).describe()

count    119.000000
mean      39.411765
std       45.887004
min        1.000000
25%        8.000000
50%       20.000000
75%       58.000000
max      257.000000
dtype: float64

５−２−１．市区町村ごとの店舗数

# エリアを市区町村レベルに変更、plot時の文字化け対策で置換
df5_city = df5.copy()
df5_city['hpg_area_name'] = df5_city.hpg_area_name.map(lambda x: ' '.join(x.split(' ')[:2]).translate(str.maketrans('Ōōū', 'Oou')))
df5_city.head(3)

	hpg_store_id	hpg_genre_name	hpg_area_name	latitude	longitude
0	hpg_6622b62385aec8bf	Japanese style	Tokyo-to Setagaya-ku	35.643675	139.668221
1	hpg_e9e068dd49c5fa00	Japanese style	Tokyo-to Setagaya-ku	35.643675	139.668221
2	hpg_2976f7acb4b3a3bc	Japanese style	Tokyo-to Setagaya-ku	35.643675	139.668221

# 棒グラフで描画
df5_hpg_area_name = df5_city.groupby('hpg_area_name').size().sort_values()
plt.figure(figsize=(12, 22))
plt.barh(np.arange(1,74), df5_hpg_area_name.values, tick_label=df5_hpg_area_name.index, align="center", color='yellow')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221213p:plain

５−２−２．都道府県ごとの店舗数

# エリアを都道府県レベルに変更
df5_pref = df5_city.copy()
df5_pref['hpg_area_name'] = df5_pref.hpg_area_name.map(lambda x: ' '.join(x.split(' ')[:1]))
df5_pref.head(3)

	hpg_store_id	hpg_genre_name	hpg_area_name	latitude	longitude
0	hpg_6622b62385aec8bf	Japanese style	Tokyo-to	35.643675	139.668221
1	hpg_e9e068dd49c5fa00	Japanese style	Tokyo-to	35.643675	139.668221
2	hpg_2976f7acb4b3a3bc	Japanese style	Tokyo-to	35.643675	139.668221

# 棒グラフで描画
df5_hpg_area_name = df5_pref.groupby('hpg_area_name').size().sort_values()
plt.barh(np.arange(1,14), df5_hpg_area_name.values, tick_label=df5_hpg_area_name.index, align="center", color='yellow')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221347p:plain

５−３．エリア x ジャンルの分布

# データフレームをピボット
df5_pref_heat = pd.pivot_table(data=df5_pref, values='hpg_store_id', 
                                  columns='hpg_area_name', index='hpg_genre_name', aggfunc='count').sort_values(by='Tokyo-to',ascending=False)
df5_pref_heat = df5_pref_heat.T.sort_values(by='Japanese style',ascending=False)
df5_pref_heat.head()

hpg_genre_name	Japanese style	International cuisine	Creation	Seafood	Grilled meat	Italian	Spain Bar/Italian Bar	Chinese general	Karaoke	Korean cuisine	...	Dim Sum/Dumplings	Spain/Mediterranean cuisine	Sushi	Sweets	Taiwanese/Hong Kong cuisine	Western food	Amusement bar	Bar/Cocktail	Cantonese food	Udon/Soba
hpg_area_name
Tokyo-to	805.0	370.0	152.0	121.0	118.0	116.0	109.0	45.0	28.0	27.0	...	1.0	1.0	1.0	1.0	1.0	1.0	NaN	NaN	NaN	NaN
Osaka-fu	196.0	44.0	63.0	35.0	57.0	36.0	23.0	9.0	3.0	4.0	...	NaN	NaN	1.0	1.0	NaN	1.0	NaN	NaN	NaN	NaN
Fukuoka-ken	143.0	65.0	42.0	32.0	41.0	19.0	18.0	4.0	2.0	NaN	...	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Hyogo-ken	106.0	29.0	30.0	24.0	35.0	13.0	8.0	13.0	1.0	4.0	...	NaN	NaN	1.0	NaN	NaN	1.0	NaN	1.0	4.0	NaN
Shizuoka-ken	92.0	32.0	22.0	24.0	7.0	10.0	11.0	3.0	2.0	NaN	...	NaN	NaN	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 34 columns

# ヒートマップを描画
plt.figure(figsize=(18, 6))
sns.heatmap(df5_pref_heat, annot=True, fmt='g', center=300, cmap='Oranges')

f:id:select_from_where:20180217221343p:plain

５–４．気づき

サンプリングされているとして、均一に取られていそう

ランダムにサンプリングか

６．store_id_relation.csv [id横断データ]

150店舗、１対１

# .csvをDataFrame化
df7 = pd.read_csv('Data/csv/store_id_relation.csv')
df7.head(3)

	air_store_id	hpg_store_id
0	air_63b13c56b7201bd9	hpg_4bc649e72e2a239a
1	air_a24bf50c3e90d583	hpg_c34b496d0305a809
2	air_c7f78b4f3cba33ff	hpg_cd8ae0d9bbd58ff9

# データ型を確認
df7.dtypes

air_store_id    object
hpg_store_id    object
dtype: object

# 統計量を確認
df7.describe()

	air_store_id	hpg_store_id
count	150	150
unique	150	150
top	air_622375b4815cf5cb	hpg_eaa312b3100ae94b
freq	1	1

７．date_info.csv [カレンダー]

2016-01-01〜2017-05-31

# .csvをDataFrame化
df6 = pd.read_csv('Data/csv/date_info.csv')
df6.head(3)

	calendar_date	day_of_week	holiday_flg
0	2016-01-01	Friday	1
1	2016-01-02	Saturday	1
2	2016-01-03	Sunday	1

# データ型を確認
df6.dtypes

calendar_date    object
day_of_week      object
holiday_flg       int64
dtype: object

# データ型を修正
df6['calendar_date'] = df6.calendar_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
df6.dtypes

calendar_date    datetime64[ns]
day_of_week              object
holiday_flg               int64
dtype: object

# 統計量を確認
df6.describe(include='all')

	calendar_date	day_of_week	holiday_flg
count	517	517	517.000000
unique	517	7	NaN
top	2016-09-08 00:00:00	Sunday	NaN
freq	1	74	NaN
first	2016-01-01 00:00:00	NaN	NaN
last	2017-05-31 00:00:00	NaN	NaN
mean	NaN	NaN	0.067698
std	NaN	NaN	0.251471
min	NaN	NaN	0.000000
25%	NaN	NaN	0.000000
50%	NaN	NaN	0.000000
75%	NaN	NaN	0.000000
max	NaN	NaN	1.000000

続きはこちら

select-from-where.hatenablog.com