かねてから挑戦して見たかったKaggleに挑戦する(つもりだった...)
そのコンペは終了してしまったため、Submitはできないのだが、せっかく着手していたのでやりきりたい。
対象のコンペは、Recruit Restaurant Visitor Forecasting。
飲食店の来客数を予想する。
Recruit Restaurant Visitor Forecasting | Kaggle
これから数回に分けて、基礎分析から機械学習を用いた予測まで、一連の作業を進めていく。
最新のソースはGitHubで公開、随時更新していく。
github.com
今回は与えられたデータの確認1回目。
与えられたcsvデータどうしを結合せずに分析していきます。
0.sample_submission.csv [Submitサンプル]
- store_idに、アンダースコア、訪問日を付与する必要がある
- 2017-04-23〜2017-05-31が予測対象期間
- visitorsに予測値を出力する
import numpy as np
import pandas as pd
from pandas import DataFrame
from datetime import datetime, timedelta
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sample = pd.read_csv('Data/csv/sample_submission.csv')
sample.head(3)
|
id |
visitors |
0 |
air_00a91d42b08b08d9_2017-04-23 |
0 |
1 |
air_00a91d42b08b08d9_2017-04-24 |
0 |
2 |
air_00a91d42b08b08d9_2017-04-25 |
0 |
sample_edit = sample.copy()
sample_edit['air_store_id'] = sample_edit.id.map(lambda x: '_'.join(x.split('_')[:-1]))
sample_edit['calendar_date'] = sample_edit.id.map(lambda x: x.split('_')[2])
sample_edit['calendar_date'] = sample_edit.calendar_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
sample_edit.drop('id',axis=1)
sample_edit = sample_edit.loc[:,['air_store_id','calendar_date','visitors']]
sample_edit.head(3)
|
air_store_id |
calendar_date |
visitors |
0 |
air_00a91d42b08b08d9 |
2017-04-23 |
0 |
1 |
air_00a91d42b08b08d9 |
2017-04-24 |
0 |
2 |
air_00a91d42b08b08d9 |
2017-04-25 |
0 |
sample_edit.dtypes
air_store_id object
calendar_date datetime64[ns]
visitors int64
dtype: object
sample_edit.describe(include='all')
|
air_store_id |
calendar_date |
visitors |
count |
32019 |
32019 |
32019.0 |
unique |
821 |
39 |
NaN |
top |
air_b88192b35ac03c24 |
2017-05-26 00:00:00 |
NaN |
freq |
39 |
821 |
NaN |
first |
NaN |
2017-04-23 00:00:00 |
NaN |
last |
NaN |
2017-05-31 00:00:00 |
NaN |
mean |
NaN |
NaN |
0.0 |
std |
NaN |
NaN |
0.0 |
min |
NaN |
NaN |
0.0 |
25% |
NaN |
NaN |
0.0 |
50% |
NaN |
NaN |
0.0 |
75% |
NaN |
NaN |
0.0 |
max |
NaN |
NaN |
0.0 |
- 92,378レコード
- 314 /829店舗分の予約データ
- 訪問予定は、2016-01-01 19:00:00 〜 2017-05-31 21:00:00
- 予約時刻は、2016-01-01 01:00:00 〜 2017-04-22 23:00:00
df1 = pd.read_csv('Data/csv/air_reserve.csv')
df1.head(3)
|
air_store_id |
visit_datetime |
reserve_datetime |
reserve_visitors |
0 |
air_877f79706adbfb06 |
2016-01-01 19:00:00 |
2016-01-01 16:00:00 |
1 |
1 |
air_db4b38ebe7a7ceff |
2016-01-01 19:00:00 |
2016-01-01 19:00:00 |
3 |
2 |
air_db4b38ebe7a7ceff |
2016-01-01 19:00:00 |
2016-01-01 19:00:00 |
6 |
df1.dtypes
air_store_id object
visit_datetime object
reserve_datetime object
reserve_visitors int64
dtype: object
df1['visit_datetime'] = df1.visit_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df1['reserve_datetime'] = df1.reserve_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df1.dtypes
air_store_id object
visit_datetime datetime64[ns]
reserve_datetime datetime64[ns]
reserve_visitors int64
dtype: object
df1.describe(include='all')
|
air_store_id |
visit_datetime |
reserve_datetime |
reserve_visitors |
count |
92378 |
92378 |
92378 |
92378.000000 |
unique |
314 |
4975 |
7513 |
NaN |
top |
air_8093d0b565e9dbdf |
2016-12-24 19:00:00 |
2016-11-24 18:00:00 |
NaN |
freq |
2263 |
255 |
106 |
NaN |
first |
NaN |
2016-01-01 19:00:00 |
2016-01-01 01:00:00 |
NaN |
last |
NaN |
2017-05-31 21:00:00 |
2017-04-22 23:00:00 |
NaN |
mean |
NaN |
NaN |
NaN |
4.481749 |
std |
NaN |
NaN |
NaN |
4.919669 |
min |
NaN |
NaN |
NaN |
1.000000 |
25% |
NaN |
NaN |
NaN |
2.000000 |
50% |
NaN |
NaN |
NaN |
3.000000 |
75% |
NaN |
NaN |
NaN |
5.000000 |
max |
NaN |
NaN |
NaN |
100.000000 |
1−1.月間予約数
df1_monthly = df1.copy()
df1_monthly['reserve_datetime'] = df1_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['air_store_id','visit_datetime']
df1_monthly = df1_monthly.rename(columns={'reserve_datetime': 'reserve_month'}).drop(drop_col,axis=1)
df1_monthly = df1_monthly.groupby(['reserve_month']).count().reset_index().rename(columns={'reserve_visitors':'reserve'})
df1_monthly.head(3)
|
reserve_month |
reserve |
0 |
2016-01 |
2936 |
1 |
2016-02 |
2999 |
2 |
2016-03 |
3643 |
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df1_monthly.reserve.as_matrix(), tick_label=df1_monthly.reserve_month, align="center", color='deepskyblue')
plt.title("Monthly_air_reserve")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve")
plt.grid(True)

1−2.月別訪問予定者数
df1_monthly = df1.copy()
df1_monthly['visit_datetime'] = df1_monthly.visit_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['air_store_id','reserve_datetime']
df1_monthly = df1_monthly.rename(columns={'visit_datetime': 'visit_month'}).drop(drop_col,axis=1)
df1_monthly = df1_monthly.groupby(['visit_month'])['reserve_visitors'].sum().reset_index()
df1_monthly.head(3)
|
visit_month |
reserve_visitors |
0 |
2016-01 |
10335 |
1 |
2016-02 |
11579 |
2 |
2016-03 |
17071 |
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,18), df1_monthly.reserve_visitors.as_matrix(), tick_label=df1_monthly.visit_month, align="center", color='deepskyblue')
plt.title("Monthly_air_reserve_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve_visitors")
plt.grid(True)

1−3.予約日から訪問予定日までの日数
df1_monthly_diff = df1.copy()
df1_monthly_diff['rsrv_2_vst'] = df1_monthly_diff.visit_datetime - df1_monthly_diff.reserve_datetime
df1_monthly_diff['rsrv_2_vst'] = df1_monthly_diff.rsrv_2_vst.map(lambda x: x.days)
df1_monthly_diff['reserve_datetime'] = df1_monthly_diff.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
df1_monthly_diff = df1_monthly_diff.rename(columns={'reserve_datetime': 'reserve_month'})
df1_monthly_diff.tail(3)
|
air_store_id |
visit_datetime |
reserve_month |
reserve_visitors |
rsrv_2_vst |
92375 |
air_877f79706adbfb06 |
2017-05-31 20:00:00 |
2017-04 |
3 |
48 |
92376 |
air_900d755ebd2f7bbd |
2017-05-31 20:00:00 |
2017-04 |
10 |
42 |
92377 |
air_3cad29d1a23209d2 |
2017-05-31 21:00:00 |
2017-04 |
3 |
39 |
plt.figure(figsize=(14, 8))
sns.boxplot(x="reserve_month", y="rsrv_2_vst", data=df1_monthly_diff)

1−4.気づき
2016年9月以前の訪問者数がそれ以降に比べて少ない
- 2016年7月まで先行展開、2016年10月から本格展開の可能性がある
- データがサンプリングされている可能性がある
忘年会、納会シーズンの予約が多い一方、歓迎会シーズンはそれほどでもないように見える
5月以降の訪問予定になっている予約データはおそらく、意図的に除外されている
- 1年前、10か月前から予約が可能な人気店がある
- 5月以降の訪問予定になっている予約データはおそらく、意図的に除外されている(除外されていないレコードも5,000弱存在はする)
- 252,108レコード
- 829店舗全ての履歴データがある
- 訪問日時は、2016-01-01 〜 2017-04-22
df3 = pd.read_csv('Data/csv/air_visit_data.csv')
df3.head(3)
|
air_store_id |
visit_date |
visitors |
0 |
air_ba937bf13d40fb24 |
2016-01-13 |
25 |
1 |
air_ba937bf13d40fb24 |
2016-01-14 |
32 |
2 |
air_ba937bf13d40fb24 |
2016-01-15 |
29 |
df3.dtypes
air_store_id object
visit_date object
visitors int64
dtype: object
df3['visit_date'] = df3.visit_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
df3.dtypes
air_store_id object
visit_date datetime64[ns]
visitors int64
dtype: object
df3.describe(include='all')
|
air_store_id |
visit_date |
visitors |
count |
252108 |
252108 |
252108.000000 |
unique |
829 |
478 |
NaN |
top |
air_5c817ef28f236bdf |
2017-03-17 00:00:00 |
NaN |
freq |
477 |
799 |
NaN |
first |
NaN |
2016-01-01 00:00:00 |
NaN |
last |
NaN |
2017-04-22 00:00:00 |
NaN |
mean |
NaN |
NaN |
20.973761 |
std |
NaN |
NaN |
16.757007 |
min |
NaN |
NaN |
1.000000 |
25% |
NaN |
NaN |
9.000000 |
50% |
NaN |
NaN |
17.000000 |
75% |
NaN |
NaN |
29.000000 |
max |
NaN |
NaN |
877.000000 |
2−1.月間訪問者数
df3_monthly = df3.copy()
df3_monthly['visit_date'] = df3_monthly.visit_date.map(lambda x: x.strftime('%Y-%m'))
df3_monthly = df3_monthly.rename(columns={'visit_date': 'visit_month'})
df3_monthly = df3_monthly.groupby(['visit_month'])['visitors'].sum().reset_index()
df3_monthly.head(3)
|
visit_month |
visitors |
0 |
2016-01 |
152924 |
1 |
2016-02 |
159934 |
2 |
2016-03 |
189292 |
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df3_monthly.visitors.as_matrix(), tick_label=df3_monthly.visit_month, align="center", color='navy')
plt.title("Monthly_air_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("visitors")
plt.grid(True)

2−2.一日の訪問者数
df3_monthly_diff = df3.copy()
df3_monthly_diff['visit_month'] = df3_monthly_diff.visit_date.map(lambda x: x.strftime('%Y-%m'))
df3_monthly_diff.tail(3)
|
air_store_id |
visit_date |
visitors |
visit_month |
252105 |
air_24e8414b9b07decb |
2017-04-20 |
7 |
2017-04 |
252106 |
air_24e8414b9b07decb |
2017-04-21 |
8 |
2017-04 |
252107 |
air_24e8414b9b07decb |
2017-04-22 |
5 |
2017-04 |
plt.figure(figsize=(14, 8))
sns.boxplot(x="visit_month", y="visitors", data=df3_monthly_diff)

2−3.予約データと並べて予約率を見てみる
※予約と実訪問日をキーに結合した値ではないため、参考程度に
height1 = df1_monthly.reserve_visitors.as_matrix()
height2 = df3_monthly.visitors.as_matrix()
line_height = (df1_monthly.reserve_visitors / df3_monthly.visitors * 100).as_matrix()
w1 = 0.4
w2 = w1 / 2
left1 = np.arange(len(height1))
left2 = np.arange(len(height2))
fig, ax1 = plt.subplots(figsize=(14, 8))
p1 = ax1.bar(left1, height1, align='center', width=w1, tick_label=df1_monthly.visit_month, color='deepskyblue')
p2 = ax1.bar(left2 + w1, height2, align='center', width=w1, tick_label=df3_monthly.visit_month, color='navy')
ax1.set_ylabel('Num of people')
ax2 = ax1.twinx()
p3 = ax2.plot(left1 + w2, line_height, color="red")
ax2.set_ylabel('Reserve Rate')
ax2.set_ylim(0, np.nanmax(line_height)*1.75)
plt.legend((p1[0], p2[0], p3[0]), ("reserve", "visit", "reserve_rate"), loc='upper right', bbox_to_anchor=(0.7, 0.5, 0.5, 0.5))

2–4.気づき
大口の予約が混ざっている
予約データと訪問データで充実している期間が異なる
- 予約データは2016年11月以降、訪問データは2016年7月以降で充実している
- 2016年7月まで先行展開、2016年10月から本格展開の可能性は否定すべきか
- 予約機能と決済機能の展開時期が異なっていたと考えることもできる
df2 = pd.read_csv('Data/csv/air_store_info.csv')
df2.head(3)
|
air_store_id |
air_genre_name |
air_area_name |
latitude |
longitude |
0 |
air_0f0cdeee6c9bf3d7 |
Italian/French |
Hyōgo-ken Kōbe-shi Kumoidōri |
34.695124 |
135.197852 |
1 |
air_7cc17a324ae5c7dc |
Italian/French |
Hyōgo-ken Kōbe-shi Kumoidōri |
34.695124 |
135.197852 |
2 |
air_fee8dcf4d619598e |
Italian/French |
Hyōgo-ken Kōbe-shi Kumoidōri |
34.695124 |
135.197852 |
df2.dtypes
air_store_id object
air_genre_name object
air_area_name object
latitude float64
longitude float64
dtype: object
df2.describe(include='all')
|
air_store_id |
air_genre_name |
air_area_name |
latitude |
longitude |
count |
829 |
829 |
829 |
829.000000 |
829.000000 |
unique |
829 |
14 |
103 |
NaN |
NaN |
top |
air_eb5788dba285e725 |
Izakaya |
Fukuoka-ken Fukuoka-shi Daimyō |
NaN |
NaN |
freq |
1 |
197 |
64 |
NaN |
NaN |
mean |
NaN |
NaN |
NaN |
35.647042 |
137.415311 |
std |
NaN |
NaN |
NaN |
2.084659 |
3.650971 |
min |
NaN |
NaN |
NaN |
33.211967 |
130.195555 |
25% |
NaN |
NaN |
NaN |
34.695124 |
135.341564 |
50% |
NaN |
NaN |
NaN |
35.658068 |
139.685474 |
75% |
NaN |
NaN |
NaN |
35.694003 |
139.751599 |
max |
NaN |
NaN |
NaN |
44.020632 |
144.273398 |
3−1.ジャンルごとの店舗数
df2.groupby('air_genre_name').size().sort_values(ascending=False).describe()
count 14.000000
mean 59.214286
std 66.013027
min 2.000000
25% 13.250000
50% 25.000000
75% 96.250000
max 197.000000
dtype: float64
df2_air_genre_name = df2.groupby('air_genre_name').size().sort_values()
plt.barh(np.arange(1,15), df2_air_genre_name.values, tick_label=df2_air_genre_name.index, align="center", color='dodgerblue')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

3−2.エリアごとの店舗数
df2.groupby('air_area_name').size().sort_values(ascending=False).describe()
count 103.000000
mean 8.048544
std 11.332219
min 2.000000
25% 2.000000
50% 4.000000
75% 7.500000
max 64.000000
dtype: float64
3−2−1.市区町村ごとの店舗数
df2_city = df2.copy()
df2_city['air_area_name'] = df2_city.air_area_name.map(lambda x: ' '.join(x.split(' ')[:2]).translate(str.maketrans('Ōōū', 'Oou')))
df2_city.head(3)
|
air_store_id |
air_genre_name |
air_area_name |
latitude |
longitude |
0 |
air_0f0cdeee6c9bf3d7 |
Italian/French |
Hyogo-ken Kobe-shi |
34.695124 |
135.197852 |
1 |
air_7cc17a324ae5c7dc |
Italian/French |
Hyogo-ken Kobe-shi |
34.695124 |
135.197852 |
2 |
air_fee8dcf4d619598e |
Italian/French |
Hyogo-ken Kobe-shi |
34.695124 |
135.197852 |
df2_air_area_name = df2_city.groupby('air_area_name').size().sort_values()
plt.figure(figsize=(12, 14))
plt.barh(np.arange(1,56), df2_air_area_name.values, tick_label=df2_air_area_name.index, align="center", color='aqua')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

3−2−2.都道府県ごとの店舗数
df2_pref = df2_city.copy()
df2_pref['air_area_name'] = df2_pref.air_area_name.map(lambda x: ' '.join(x.split(' ')[:1]))
df2_pref.head(3)
|
air_store_id |
air_genre_name |
air_area_name |
latitude |
longitude |
0 |
air_0f0cdeee6c9bf3d7 |
Italian/French |
Hyogo-ken |
34.695124 |
135.197852 |
1 |
air_7cc17a324ae5c7dc |
Italian/French |
Hyogo-ken |
34.695124 |
135.197852 |
2 |
air_fee8dcf4d619598e |
Italian/French |
Hyogo-ken |
34.695124 |
135.197852 |
df2_air_area_name = df2_pref.groupby('air_area_name').size().sort_values()
plt.barh(np.arange(1,10), df2_air_area_name.values, tick_label=df2_air_area_name.index, align="center", color='aqua')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

3−3.エリア x ジャンルの分布
df2_pref_heat = pd.pivot_table(data=df2_pref, values='air_store_id',
columns='air_area_name', index='air_genre_name', aggfunc='count').sort_values(by='Tokyo-to',ascending=False)
df2_pref_heat = df2_pref_heat.T.sort_values(by='Izakaya',ascending=False)
df2_pref_heat.head(3)
air_genre_name |
Cafe/Sweets |
Izakaya |
Dining bar |
Italian/French |
Bar/Cocktail |
Japanese food |
Other |
Western food |
Creative cuisine |
Okonomiyaki/Monja/Teppanyaki |
Yakiniku/Korean food |
Asian |
International cuisine |
Karaoke/Party |
air_area_name |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tokyo-to |
87.0 |
87.0 |
65.0 |
65.0 |
54.0 |
38.0 |
16.0 |
14.0 |
5.0 |
5.0 |
4.0 |
2.0 |
2.0 |
NaN |
Fukuoka-ken |
45.0 |
27.0 |
14.0 |
11.0 |
9.0 |
11.0 |
5.0 |
NaN |
2.0 |
NaN |
3.0 |
NaN |
NaN |
NaN |
Hyogo-ken |
10.0 |
24.0 |
4.0 |
9.0 |
NaN |
2.0 |
NaN |
NaN |
4.0 |
2.0 |
2.0 |
NaN |
NaN |
NaN |
plt.figure(figsize=(8, 4))
sns.heatmap(df2_pref_heat, annot=True, center=30, fmt='g', cmap='Blues')

3–4.気づき
ジャンル x エリアの分布に規則性が見られる
4.hpg_reserve.csv [ホットペッパーグルメの予約データ]
- 2,000,320レコード
- 13,325 /4,690店舗の予約データ???
- 訪問予定は、2016-01-01 11:00:00 〜 2017-05-31 23:00:00
- 予約時刻は、2016-01-01 00:00:00 〜 2017-04-22 23:00:00
df4 = pd.read_csv('Data/csv/hpg_reserve.csv')
df4.head(3)
|
hpg_store_id |
visit_datetime |
reserve_datetime |
reserve_visitors |
0 |
hpg_c63f6f42e088e50f |
2016-01-01 11:00:00 |
2016-01-01 09:00:00 |
1 |
1 |
hpg_dac72789163a3f47 |
2016-01-01 13:00:00 |
2016-01-01 06:00:00 |
3 |
2 |
hpg_c8e24dcf51ca1eb5 |
2016-01-01 16:00:00 |
2016-01-01 14:00:00 |
2 |
df4.dtypes
hpg_store_id object
visit_datetime object
reserve_datetime object
reserve_visitors int64
dtype: object
df4['visit_datetime'] = df4.visit_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df4['reserve_datetime'] = df4.reserve_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df4.dtypes
hpg_store_id object
visit_datetime datetime64[ns]
reserve_datetime datetime64[ns]
reserve_visitors int64
dtype: object
df4.describe(include='all')
|
hpg_store_id |
visit_datetime |
reserve_datetime |
reserve_visitors |
count |
2000320 |
2000320 |
2000320 |
2.000320e+06 |
unique |
13325 |
9847 |
11450 |
NaN |
top |
hpg_2afd5b187409eeb4 |
2016-12-16 19:00:00 |
2016-12-12 21:00:00 |
NaN |
freq |
1155 |
10528 |
907 |
NaN |
first |
NaN |
2016-01-01 11:00:00 |
2016-01-01 00:00:00 |
NaN |
last |
NaN |
2017-05-31 23:00:00 |
2017-04-22 23:00:00 |
NaN |
mean |
NaN |
NaN |
NaN |
5.073785e+00 |
std |
NaN |
NaN |
NaN |
5.416172e+00 |
min |
NaN |
NaN |
NaN |
1.000000e+00 |
25% |
NaN |
NaN |
NaN |
2.000000e+00 |
50% |
NaN |
NaN |
NaN |
3.000000e+00 |
75% |
NaN |
NaN |
NaN |
6.000000e+00 |
max |
NaN |
NaN |
NaN |
1.000000e+02 |
4−1.月間予約数
df4_monthly = df4.copy()
df4_monthly['reserve_datetime'] = df4_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['hpg_store_id','visit_datetime']
df4_monthly = df4_monthly.rename(columns={'reserve_datetime': 'reserve_month'}).drop(drop_col,axis=1)
df4_monthly = df4_monthly.groupby(['reserve_month']).count().reset_index().rename(columns={'reserve_visitors':'reserve'})
df4_monthly.head(3)
|
reserve_month |
reserve |
0 |
2016-01 |
80905 |
1 |
2016-02 |
79934 |
2 |
2016-03 |
98276 |
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df4_monthly.reserve.as_matrix(), tick_label=df4_monthly.reserve_month, align="center", color='pink')
plt.title("Monthly_hpg_reserve")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve")
plt.grid(True)

4−2.月別訪問予定者数
df4_monthly = df4.copy()
df4_monthly['visit_datetime'] = df4_monthly.visit_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['hpg_store_id','reserve_datetime']
df4_monthly = df4_monthly.rename(columns={'visit_datetime': 'visit_month'}).drop(drop_col,axis=1)
df4_monthly = df4_monthly.groupby(['visit_month'])['reserve_visitors'].sum().reset_index()
df4_monthly.head(3)
|
visit_month |
reserve_visitors |
0 |
2016-01 |
313180 |
1 |
2016-02 |
359597 |
2 |
2016-03 |
546433 |
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,18), df4_monthly.reserve_visitors.as_matrix(), tick_label=df4_monthly.visit_month, align="center", color='pink')
plt.title("Monthly_hpg_reserve_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve_visitors")
plt.grid(True)

4−3.予約日から訪問予定日までの日数
df4_monthly = df4.copy()
df4_monthly['rsrv_2_vst'] = df4_monthly.visit_datetime - df4_monthly.reserve_datetime
df4_monthly['rsrv_2_vst'] = df4_monthly.rsrv_2_vst.map(lambda x: x.days)
df4_monthly['reserve_datetime'] = df4_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
df4_monthly = df4_monthly.rename(columns={'reserve_datetime': 'reserve_month'})
df4_monthly.tail(3)
|
hpg_store_id |
visit_datetime |
reserve_month |
reserve_visitors |
rsrv_2_vst |
2000317 |
hpg_e9151de687b93aa3 |
2017-05-31 21:00:00 |
2017-04 |
49 |
47 |
2000318 |
hpg_fae6c96189b4a122 |
2017-05-31 21:00:00 |
2017-04 |
9 |
42 |
2000319 |
hpg_0b70de808b55ad1e |
2017-05-31 23:00:00 |
2017-04 |
6 |
40 |
plt.figure(figsize=(14, 8))
sns.boxplot(x="reserve_month", y="rsrv_2_vst", data=df4_monthly)

4−4.気づき
2016年10月以降のデータ量が多い
- 純粋な増加なのか、データ収集基盤の改善か、サンプリングによるものかは不明
11月の予約から訪問までの日差は大きい
5.hpg_store_info.csv [ホットペッパーグルメのレストラン情報]
- 4,690の店舗データ
- 34ジャンル
- 119エリア
df5 = pd.read_csv('Data/csv/hpg_store_info.csv')
df5.head(3)
|
hpg_store_id |
hpg_genre_name |
hpg_area_name |
latitude |
longitude |
0 |
hpg_6622b62385aec8bf |
Japanese style |
Tōkyō-to Setagaya-ku Taishidō |
35.643675 |
139.668221 |
1 |
hpg_e9e068dd49c5fa00 |
Japanese style |
Tōkyō-to Setagaya-ku Taishidō |
35.643675 |
139.668221 |
2 |
hpg_2976f7acb4b3a3bc |
Japanese style |
Tōkyō-to Setagaya-ku Taishidō |
35.643675 |
139.668221 |
df5.dtypes
hpg_store_id object
hpg_genre_name object
hpg_area_name object
latitude float64
longitude float64
dtype: object
df5.describe(include='all')
|
hpg_store_id |
hpg_genre_name |
hpg_area_name |
latitude |
longitude |
count |
4690 |
4690 |
4690 |
4690.000000 |
4690.000000 |
unique |
4690 |
34 |
119 |
NaN |
NaN |
top |
hpg_b6fcb5aea05b867b |
Japanese style |
Tōkyō-to Shinjuku-ku None |
NaN |
NaN |
freq |
1 |
1750 |
257 |
NaN |
NaN |
mean |
NaN |
NaN |
NaN |
35.810261 |
137.675816 |
std |
NaN |
NaN |
NaN |
2.138755 |
3.197233 |
min |
NaN |
NaN |
NaN |
33.311645 |
130.339313 |
25% |
NaN |
NaN |
NaN |
34.692109 |
135.498859 |
50% |
NaN |
NaN |
NaN |
35.659214 |
139.495733 |
75% |
NaN |
NaN |
NaN |
35.703381 |
139.737998 |
max |
NaN |
NaN |
NaN |
43.774846 |
143.714585 |
5−1.ジャンルごとの店舗数
df5.groupby('hpg_genre_name').size().sort_values(ascending=False).describe()
count 34.000000
mean 137.941176
std 323.163055
min 1.000000
25% 5.000000
50% 25.500000
75% 79.750000
max 1750.000000
dtype: float64
df5_hpg_genre_name = df5.groupby('hpg_genre_name').size().sort_values()
plt.figure(figsize=(8, 10))
plt.barh(np.arange(1,35), df5_hpg_genre_name.values, tick_label=df5_hpg_genre_name.index, align="center", color='orange')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

5−2.エリアごとの店舗数
df5.groupby('hpg_area_name').size().sort_values(ascending=False).describe()
count 119.000000
mean 39.411765
std 45.887004
min 1.000000
25% 8.000000
50% 20.000000
75% 58.000000
max 257.000000
dtype: float64
5−2−1.市区町村ごとの店舗数
df5_city = df5.copy()
df5_city['hpg_area_name'] = df5_city.hpg_area_name.map(lambda x: ' '.join(x.split(' ')[:2]).translate(str.maketrans('Ōōū', 'Oou')))
df5_city.head(3)
|
hpg_store_id |
hpg_genre_name |
hpg_area_name |
latitude |
longitude |
0 |
hpg_6622b62385aec8bf |
Japanese style |
Tokyo-to Setagaya-ku |
35.643675 |
139.668221 |
1 |
hpg_e9e068dd49c5fa00 |
Japanese style |
Tokyo-to Setagaya-ku |
35.643675 |
139.668221 |
2 |
hpg_2976f7acb4b3a3bc |
Japanese style |
Tokyo-to Setagaya-ku |
35.643675 |
139.668221 |
df5_hpg_area_name = df5_city.groupby('hpg_area_name').size().sort_values()
plt.figure(figsize=(12, 22))
plt.barh(np.arange(1,74), df5_hpg_area_name.values, tick_label=df5_hpg_area_name.index, align="center", color='yellow')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

5−2−2.都道府県ごとの店舗数
df5_pref = df5_city.copy()
df5_pref['hpg_area_name'] = df5_pref.hpg_area_name.map(lambda x: ' '.join(x.split(' ')[:1]))
df5_pref.head(3)
|
hpg_store_id |
hpg_genre_name |
hpg_area_name |
latitude |
longitude |
0 |
hpg_6622b62385aec8bf |
Japanese style |
Tokyo-to |
35.643675 |
139.668221 |
1 |
hpg_e9e068dd49c5fa00 |
Japanese style |
Tokyo-to |
35.643675 |
139.668221 |
2 |
hpg_2976f7acb4b3a3bc |
Japanese style |
Tokyo-to |
35.643675 |
139.668221 |
df5_hpg_area_name = df5_pref.groupby('hpg_area_name').size().sort_values()
plt.barh(np.arange(1,14), df5_hpg_area_name.values, tick_label=df5_hpg_area_name.index, align="center", color='yellow')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

5−3.エリア x ジャンルの分布
df5_pref_heat = pd.pivot_table(data=df5_pref, values='hpg_store_id',
columns='hpg_area_name', index='hpg_genre_name', aggfunc='count').sort_values(by='Tokyo-to',ascending=False)
df5_pref_heat = df5_pref_heat.T.sort_values(by='Japanese style',ascending=False)
df5_pref_heat.head()
hpg_genre_name |
Japanese style |
International cuisine |
Creation |
Seafood |
Grilled meat |
Italian |
Spain Bar/Italian Bar |
Chinese general |
Karaoke |
Korean cuisine |
... |
Dim Sum/Dumplings |
Spain/Mediterranean cuisine |
Sushi |
Sweets |
Taiwanese/Hong Kong cuisine |
Western food |
Amusement bar |
Bar/Cocktail |
Cantonese food |
Udon/Soba |
hpg_area_name |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tokyo-to |
805.0 |
370.0 |
152.0 |
121.0 |
118.0 |
116.0 |
109.0 |
45.0 |
28.0 |
27.0 |
... |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
NaN |
NaN |
NaN |
NaN |
Osaka-fu |
196.0 |
44.0 |
63.0 |
35.0 |
57.0 |
36.0 |
23.0 |
9.0 |
3.0 |
4.0 |
... |
NaN |
NaN |
1.0 |
1.0 |
NaN |
1.0 |
NaN |
NaN |
NaN |
NaN |
Fukuoka-ken |
143.0 |
65.0 |
42.0 |
32.0 |
41.0 |
19.0 |
18.0 |
4.0 |
2.0 |
NaN |
... |
NaN |
NaN |
1.0 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
Hyogo-ken |
106.0 |
29.0 |
30.0 |
24.0 |
35.0 |
13.0 |
8.0 |
13.0 |
1.0 |
4.0 |
... |
NaN |
NaN |
1.0 |
NaN |
NaN |
1.0 |
NaN |
1.0 |
4.0 |
NaN |
Shizuoka-ken |
92.0 |
32.0 |
22.0 |
24.0 |
7.0 |
10.0 |
11.0 |
3.0 |
2.0 |
NaN |
... |
NaN |
NaN |
4.0 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
5 rows × 34 columns
plt.figure(figsize=(18, 6))
sns.heatmap(df5_pref_heat, annot=True, fmt='g', center=300, cmap='Oranges')

5–4.気づき
サンプリングされているとして、均一に取られていそう
6.store_id_relation.csv [id横断データ]
df7 = pd.read_csv('Data/csv/store_id_relation.csv')
df7.head(3)
|
air_store_id |
hpg_store_id |
0 |
air_63b13c56b7201bd9 |
hpg_4bc649e72e2a239a |
1 |
air_a24bf50c3e90d583 |
hpg_c34b496d0305a809 |
2 |
air_c7f78b4f3cba33ff |
hpg_cd8ae0d9bbd58ff9 |
df7.dtypes
air_store_id object
hpg_store_id object
dtype: object
df7.describe()
|
air_store_id |
hpg_store_id |
count |
150 |
150 |
unique |
150 |
150 |
top |
air_622375b4815cf5cb |
hpg_eaa312b3100ae94b |
freq |
1 |
1 |
7.date_info.csv [カレンダー]
df6 = pd.read_csv('Data/csv/date_info.csv')
df6.head(3)
|
calendar_date |
day_of_week |
holiday_flg |
0 |
2016-01-01 |
Friday |
1 |
1 |
2016-01-02 |
Saturday |
1 |
2 |
2016-01-03 |
Sunday |
1 |
df6.dtypes
calendar_date object
day_of_week object
holiday_flg int64
dtype: object
df6['calendar_date'] = df6.calendar_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
df6.dtypes
calendar_date datetime64[ns]
day_of_week object
holiday_flg int64
dtype: object
df6.describe(include='all')
|
calendar_date |
day_of_week |
holiday_flg |
count |
517 |
517 |
517.000000 |
unique |
517 |
7 |
NaN |
top |
2016-09-08 00:00:00 |
Sunday |
NaN |
freq |
1 |
74 |
NaN |
first |
2016-01-01 00:00:00 |
NaN |
NaN |
last |
2017-05-31 00:00:00 |
NaN |
NaN |
mean |
NaN |
NaN |
0.067698 |
std |
NaN |
NaN |
0.251471 |
min |
NaN |
NaN |
0.000000 |
25% |
NaN |
NaN |
0.000000 |
50% |
NaN |
NaN |
0.000000 |
75% |
NaN |
NaN |
0.000000 |
max |
NaN |
NaN |
1.000000 |
続きはこちら
select-from-where.hatenablog.com