一端の何かになれるか

一生懸命は眩しい

Kaggleに挑戦する - 与えられたデータの確認1

かねてから挑戦して見たかったKaggleに挑戦する(つもりだった...)
そのコンペは終了してしまったため、Submitはできないのだが、せっかく着手していたのでやりきりたい。

対象のコンペは、Recruit Restaurant Visitor Forecasting。
飲食店の来客数を予想する。

Recruit Restaurant Visitor Forecasting | Kaggle

これから数回に分けて、基礎分析から機械学習を用いた予測まで、一連の作業を進めていく。

最新のソースはGitHubで公開、随時更新していく。

github.com

今回は与えられたデータの確認1回目。
与えられたcsvデータどうしを結合せずに分析していきます。

0.sample_submission.csv [Submitサンプル]

  • store_idに、アンダースコア、訪問日を付与する必要がある
  • 2017-04-23〜2017-05-31が予測対象期間
  • visitorsに予測値を出力する
import numpy as np
import pandas as pd
from pandas import DataFrame
from datetime import datetime, timedelta
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# .csvをDataFrame化
sample = pd.read_csv('Data/csv/sample_submission.csv')
sample.head(3)
id visitors
0 air_00a91d42b08b08d9_2017-04-23 0
1 air_00a91d42b08b08d9_2017-04-24 0
2 air_00a91d42b08b08d9_2017-04-25 0
# 他のデータセットと形式を合わせておく
sample_edit = sample.copy()
sample_edit['air_store_id'] = sample_edit.id.map(lambda x: '_'.join(x.split('_')[:-1]))
sample_edit['calendar_date'] = sample_edit.id.map(lambda x: x.split('_')[2])
sample_edit['calendar_date'] = sample_edit.calendar_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
sample_edit.drop('id',axis=1)
sample_edit = sample_edit.loc[:,['air_store_id','calendar_date','visitors']]
sample_edit.head(3)
air_store_id calendar_date visitors
0 air_00a91d42b08b08d9 2017-04-23 0
1 air_00a91d42b08b08d9 2017-04-24 0
2 air_00a91d42b08b08d9 2017-04-25 0
# データ型を確認
sample_edit.dtypes
air_store_id             object
calendar_date    datetime64[ns]
visitors                  int64
dtype: object
# 統計量を確認
sample_edit.describe(include='all')
air_store_id calendar_date visitors
count 32019 32019 32019.0
unique 821 39 NaN
top air_b88192b35ac03c24 2017-05-26 00:00:00 NaN
freq 39 821 NaN
first NaN 2017-04-23 00:00:00 NaN
last NaN 2017-05-31 00:00:00 NaN
mean NaN NaN 0.0
std NaN NaN 0.0
min NaN NaN 0.0
25% NaN NaN 0.0
50% NaN NaN 0.0
75% NaN NaN 0.0
max NaN NaN 0.0

     

1.air_reserve.csv [Airリザーブの予約データ]

  • 92,378レコード
  • 314 /829店舗分の予約データ
  • 訪問予定は、2016-01-01 19:00:00 〜 2017-05-31 21:00:00
  • 予約時刻は、2016-01-01 01:00:00 〜 2017-04-22 23:00:00
# .csvをDataFrame化
df1 = pd.read_csv('Data/csv/air_reserve.csv')
df1.head(3)
air_store_id visit_datetime reserve_datetime reserve_visitors
0 air_877f79706adbfb06 2016-01-01 19:00:00 2016-01-01 16:00:00 1
1 air_db4b38ebe7a7ceff 2016-01-01 19:00:00 2016-01-01 19:00:00 3
2 air_db4b38ebe7a7ceff 2016-01-01 19:00:00 2016-01-01 19:00:00 6
# データ型を確認
df1.dtypes
air_store_id        object
visit_datetime      object
reserve_datetime    object
reserve_visitors     int64
dtype: object
# データ型を修正
df1['visit_datetime'] = df1.visit_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df1['reserve_datetime'] = df1.reserve_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df1.dtypes
air_store_id                object
visit_datetime      datetime64[ns]
reserve_datetime    datetime64[ns]
reserve_visitors             int64
dtype: object
# 統計量を確認
df1.describe(include='all')
air_store_id visit_datetime reserve_datetime reserve_visitors
count 92378 92378 92378 92378.000000
unique 314 4975 7513 NaN
top air_8093d0b565e9dbdf 2016-12-24 19:00:00 2016-11-24 18:00:00 NaN
freq 2263 255 106 NaN
first NaN 2016-01-01 19:00:00 2016-01-01 01:00:00 NaN
last NaN 2017-05-31 21:00:00 2017-04-22 23:00:00 NaN
mean NaN NaN NaN 4.481749
std NaN NaN NaN 4.919669
min NaN NaN NaN 1.000000
25% NaN NaN NaN 2.000000
50% NaN NaN NaN 3.000000
75% NaN NaN NaN 5.000000
max NaN NaN NaN 100.000000

1−1.月間予約数

#  予約数を月次で集計
df1_monthly = df1.copy()
df1_monthly['reserve_datetime'] = df1_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['air_store_id','visit_datetime']
df1_monthly = df1_monthly.rename(columns={'reserve_datetime': 'reserve_month'}).drop(drop_col,axis=1)
df1_monthly = df1_monthly.groupby(['reserve_month']).count().reset_index().rename(columns={'reserve_visitors':'reserve'})
df1_monthly.head(3)
reserve_month reserve
0 2016-01 2936
1 2016-02 2999
2 2016-03 3643
# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df1_monthly.reserve.as_matrix(), tick_label=df1_monthly.reserve_month, align="center", color='deepskyblue')
plt.title("Monthly_air_reserve")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve")
plt.grid(True)

f:id:select_from_where:20180217221028p:plain

1−2.月別訪問予定者数

#  訪問予定者数を月次で集計
df1_monthly = df1.copy()
df1_monthly['visit_datetime'] = df1_monthly.visit_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['air_store_id','reserve_datetime']
df1_monthly = df1_monthly.rename(columns={'visit_datetime': 'visit_month'}).drop(drop_col,axis=1)
df1_monthly = df1_monthly.groupby(['visit_month'])['reserve_visitors'].sum().reset_index()
df1_monthly.head(3)
visit_month reserve_visitors
0 2016-01 10335
1 2016-02 11579
2 2016-03 17071
# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,18), df1_monthly.reserve_visitors.as_matrix(), tick_label=df1_monthly.visit_month, align="center", color='deepskyblue')
plt.title("Monthly_air_reserve_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve_visitors")
plt.grid(True)

f:id:select_from_where:20180217221023p:plain

1−3.予約日から訪問予定日までの日数

#  予約日から訪問予定日までの日数を追加
df1_monthly_diff = df1.copy()
df1_monthly_diff['rsrv_2_vst'] = df1_monthly_diff.visit_datetime - df1_monthly_diff.reserve_datetime
df1_monthly_diff['rsrv_2_vst'] = df1_monthly_diff.rsrv_2_vst.map(lambda x: x.days)

df1_monthly_diff['reserve_datetime'] = df1_monthly_diff.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
df1_monthly_diff = df1_monthly_diff.rename(columns={'reserve_datetime': 'reserve_month'})

df1_monthly_diff.tail(3)
air_store_id visit_datetime reserve_month reserve_visitors rsrv_2_vst
92375 air_877f79706adbfb06 2017-05-31 20:00:00 2017-04 3 48
92376 air_900d755ebd2f7bbd 2017-05-31 20:00:00 2017-04 10 42
92377 air_3cad29d1a23209d2 2017-05-31 21:00:00 2017-04 3 39
plt.figure(figsize=(14, 8))
sns.boxplot(x="reserve_month", y="rsrv_2_vst", data=df1_monthly_diff)

f:id:select_from_where:20180217221016p:plain

1−4.気づき

2016年9月以前の訪問者数がそれ以降に比べて少ない

  • 2016年7月まで先行展開、2016年10月から本格展開の可能性がある
  • データがサンプリングされている可能性がある

忘年会、納会シーズンの予約が多い一方、歓迎会シーズンはそれほどでもないように見える

  • 休日祝日の日数による影響がある可能性がある

5月以降の訪問予定になっている予約データはおそらく、意図的に除外されている

  • 1年前、10か月前から予約が可能な人気店がある
  • 5月以降の訪問予定になっている予約データはおそらく、意図的に除外されている(除外されていないレコードも5,000弱存在はする)

     

2.air_visit_data.csv [Airリザーブの訪問データ]

  • 252,108レコード
  • 829店舗全ての履歴データがある
  • 訪問日時は、2016-01-01 〜 2017-04-22
# .csvをDataFrame化
df3 = pd.read_csv('Data/csv/air_visit_data.csv')
df3.head(3)
air_store_id visit_date visitors
0 air_ba937bf13d40fb24 2016-01-13 25
1 air_ba937bf13d40fb24 2016-01-14 32
2 air_ba937bf13d40fb24 2016-01-15 29
# データ型を確認
df3.dtypes
air_store_id    object
visit_date      object
visitors         int64
dtype: object
# データ型を修正
df3['visit_date'] = df3.visit_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
df3.dtypes
air_store_id            object
visit_date      datetime64[ns]
visitors                 int64
dtype: object
# 統計量を確認
df3.describe(include='all')
air_store_id visit_date visitors
count 252108 252108 252108.000000
unique 829 478 NaN
top air_5c817ef28f236bdf 2017-03-17 00:00:00 NaN
freq 477 799 NaN
first NaN 2016-01-01 00:00:00 NaN
last NaN 2017-04-22 00:00:00 NaN
mean NaN NaN 20.973761
std NaN NaN 16.757007
min NaN NaN 1.000000
25% NaN NaN 9.000000
50% NaN NaN 17.000000
75% NaN NaN 29.000000
max NaN NaN 877.000000

2−1.月間訪問者数

#  訪問データを月次に加工
df3_monthly = df3.copy()
df3_monthly['visit_date'] = df3_monthly.visit_date.map(lambda x: x.strftime('%Y-%m'))
df3_monthly = df3_monthly.rename(columns={'visit_date': 'visit_month'})
df3_monthly = df3_monthly.groupby(['visit_month'])['visitors'].sum().reset_index()
df3_monthly.head(3)
visit_month visitors
0 2016-01 152924
1 2016-02 159934
2 2016-03 189292
# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df3_monthly.visitors.as_matrix(), tick_label=df3_monthly.visit_month, align="center", color='navy')
plt.title("Monthly_air_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("visitors")
plt.grid(True)

f:id:select_from_where:20180217221013p:plain

2−2.一日の訪問者数

#  訪問予定月を追加
df3_monthly_diff = df3.copy()
df3_monthly_diff['visit_month'] = df3_monthly_diff.visit_date.map(lambda x: x.strftime('%Y-%m'))

df3_monthly_diff.tail(3)
air_store_id visit_date visitors visit_month
252105 air_24e8414b9b07decb 2017-04-20 7 2017-04
252106 air_24e8414b9b07decb 2017-04-21 8 2017-04
252107 air_24e8414b9b07decb 2017-04-22 5 2017-04
plt.figure(figsize=(14, 8))
sns.boxplot(x="visit_month", y="visitors", data=df3_monthly_diff)

f:id:select_from_where:20180217221010p:plain

2−3.予約データと並べて予約率を見てみる

※予約と実訪問日をキーに結合した値ではないため、参考程度に

# 棒グラフと折れ線グラフで描画する
height1 = df1_monthly.reserve_visitors.as_matrix()
height2 = df3_monthly.visitors.as_matrix()
line_height = (df1_monthly.reserve_visitors / df3_monthly.visitors * 100).as_matrix()
w1 = 0.4
w2 = w1 / 2
left1 = np.arange(len(height1))
left2 = np.arange(len(height2))

fig, ax1 = plt.subplots(figsize=(14, 8))
p1 = ax1.bar(left1, height1, align='center', width=w1, tick_label=df1_monthly.visit_month, color='deepskyblue')
p2 = ax1.bar(left2 + w1, height2, align='center', width=w1, tick_label=df3_monthly.visit_month, color='navy')
ax1.set_ylabel('Num of people')

ax2 = ax1.twinx()
p3 = ax2.plot(left1 + w2, line_height, color="red")
ax2.set_ylabel('Reserve Rate')
ax2.set_ylim(0, np.nanmax(line_height)*1.75)

plt.legend((p1[0], p2[0], p3[0]), ("reserve", "visit", "reserve_rate"), loc='upper right', bbox_to_anchor=(0.7, 0.5, 0.5, 0.5))

f:id:select_from_where:20180217221129p:plain

2–4.気づき

大口の予約が混ざっている

  • 100名を超えるような大口の予約も混ざっている

予約データと訪問データで充実している期間が異なる

  • 予約データは2016年11月以降、訪問データは2016年7月以降で充実している
    • 2016年7月まで先行展開、2016年10月から本格展開の可能性は否定すべきか
    • 予約機能と決済機能の展開時期が異なっていたと考えることもできる

     

3.air_store_info.csv [Airリザーブのレストラン情報]

  • 829店舗
  • 14ジャンル
  • 103エリア
# .csvをDataFrame化
df2 = pd.read_csv('Data/csv/air_store_info.csv')
df2.head(3)
air_store_id air_genre_name air_area_name latitude longitude
0 air_0f0cdeee6c9bf3d7 Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852
1 air_7cc17a324ae5c7dc Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852
2 air_fee8dcf4d619598e Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852
# データ型を確認
df2.dtypes
air_store_id       object
air_genre_name     object
air_area_name      object
latitude          float64
longitude         float64
dtype: object
# 統計量を確認
df2.describe(include='all')
air_store_id air_genre_name air_area_name latitude longitude
count 829 829 829 829.000000 829.000000
unique 829 14 103 NaN NaN
top air_eb5788dba285e725 Izakaya Fukuoka-ken Fukuoka-shi Daimyō NaN NaN
freq 1 197 64 NaN NaN
mean NaN NaN NaN 35.647042 137.415311
std NaN NaN NaN 2.084659 3.650971
min NaN NaN NaN 33.211967 130.195555
25% NaN NaN NaN 34.695124 135.341564
50% NaN NaN NaN 35.658068 139.685474
75% NaN NaN NaN 35.694003 139.751599
max NaN NaN NaN 44.020632 144.273398

3−1.ジャンルごとの店舗数

# ジャンルの内訳と頻度
df2.groupby('air_genre_name').size().sort_values(ascending=False).describe()
count     14.000000
mean      59.214286
std       66.013027
min        2.000000
25%       13.250000
50%       25.000000
75%       96.250000
max      197.000000
dtype: float64
# 棒グラフで描画
df2_air_genre_name = df2.groupby('air_genre_name').size().sort_values()
plt.barh(np.arange(1,15), df2_air_genre_name.values, tick_label=df2_air_genre_name.index, align="center", color='dodgerblue')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221127p:plain

3−2.エリアごとの店舗数

# エリアの内訳と頻度
df2.groupby('air_area_name').size().sort_values(ascending=False).describe()
count    103.000000
mean       8.048544
std       11.332219
min        2.000000
25%        2.000000
50%        4.000000
75%        7.500000
max       64.000000
dtype: float64

3−2−1.市区町村ごとの店舗数

# エリアを市区町村レベルに変更、plot時の文字化け対策で置換
df2_city = df2.copy()
df2_city['air_area_name'] = df2_city.air_area_name.map(lambda x: ' '.join(x.split(' ')[:2]).translate(str.maketrans('Ōōū', 'Oou')))
df2_city.head(3)
air_store_id air_genre_name air_area_name latitude longitude
0 air_0f0cdeee6c9bf3d7 Italian/French Hyogo-ken Kobe-shi 34.695124 135.197852
1 air_7cc17a324ae5c7dc Italian/French Hyogo-ken Kobe-shi 34.695124 135.197852
2 air_fee8dcf4d619598e Italian/French Hyogo-ken Kobe-shi 34.695124 135.197852
# 棒グラフで描画
df2_air_area_name = df2_city.groupby('air_area_name').size().sort_values()
plt.figure(figsize=(12, 14))
plt.barh(np.arange(1,56), df2_air_area_name.values, tick_label=df2_air_area_name.index, align="center", color='aqua')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221122p:plain

3−2−2.都道府県ごとの店舗数

# エリアを都道府県レベルに変更
df2_pref = df2_city.copy()
df2_pref['air_area_name'] = df2_pref.air_area_name.map(lambda x: ' '.join(x.split(' ')[:1]))
df2_pref.head(3)
air_store_id air_genre_name air_area_name latitude longitude
0 air_0f0cdeee6c9bf3d7 Italian/French Hyogo-ken 34.695124 135.197852
1 air_7cc17a324ae5c7dc Italian/French Hyogo-ken 34.695124 135.197852
2 air_fee8dcf4d619598e Italian/French Hyogo-ken 34.695124 135.197852
# 棒グラフで描画
df2_air_area_name = df2_pref.groupby('air_area_name').size().sort_values()
plt.barh(np.arange(1,10), df2_air_area_name.values, tick_label=df2_air_area_name.index, align="center", color='aqua')
plt.title("air_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221117p:plain

3−3.エリア x ジャンルの分布

# データフレームをピボット
df2_pref_heat = pd.pivot_table(data=df2_pref, values='air_store_id', 
                                  columns='air_area_name', index='air_genre_name', aggfunc='count').sort_values(by='Tokyo-to',ascending=False)
df2_pref_heat = df2_pref_heat.T.sort_values(by='Izakaya',ascending=False)
df2_pref_heat.head(3)
air_genre_name Cafe/Sweets Izakaya Dining bar Italian/French Bar/Cocktail Japanese food Other Western food Creative cuisine Okonomiyaki/Monja/Teppanyaki Yakiniku/Korean food Asian International cuisine Karaoke/Party
air_area_name
Tokyo-to 87.0 87.0 65.0 65.0 54.0 38.0 16.0 14.0 5.0 5.0 4.0 2.0 2.0 NaN
Fukuoka-ken 45.0 27.0 14.0 11.0 9.0 11.0 5.0 NaN 2.0 NaN 3.0 NaN NaN NaN
Hyogo-ken 10.0 24.0 4.0 9.0 NaN 2.0 NaN NaN 4.0 2.0 2.0 NaN NaN NaN
# ヒートマップを描画
plt.figure(figsize=(8, 4))
sns.heatmap(df2_pref_heat, annot=True, center=30, fmt='g', cmap='Blues')

f:id:select_from_where:20180217221113p:plain

3–4.気づき

ジャンル x エリアの分布に規則性が見られる

  • 何らかのルールでサンプリングされていそう

     

4.hpg_reserve.csv [ホットペッパーグルメの予約データ]

  • 2,000,320レコード
  • 13,325 /4,690店舗の予約データ???
  • 訪問予定は、2016-01-01 11:00:00 〜 2017-05-31 23:00:00
  • 予約時刻は、2016-01-01 00:00:00 〜 2017-04-22 23:00:00
# .csvをDataFrame化
df4 = pd.read_csv('Data/csv/hpg_reserve.csv')
df4.head(3)
hpg_store_id visit_datetime reserve_datetime reserve_visitors
0 hpg_c63f6f42e088e50f 2016-01-01 11:00:00 2016-01-01 09:00:00 1
1 hpg_dac72789163a3f47 2016-01-01 13:00:00 2016-01-01 06:00:00 3
2 hpg_c8e24dcf51ca1eb5 2016-01-01 16:00:00 2016-01-01 14:00:00 2
# データ型を確認
df4.dtypes
hpg_store_id        object
visit_datetime      object
reserve_datetime    object
reserve_visitors     int64
dtype: object
# データ型を修正
df4['visit_datetime'] = df4.visit_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df4['reserve_datetime'] = df4.reserve_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
df4.dtypes
hpg_store_id                object
visit_datetime      datetime64[ns]
reserve_datetime    datetime64[ns]
reserve_visitors             int64
dtype: object
# 統計量を確認
df4.describe(include='all')
hpg_store_id visit_datetime reserve_datetime reserve_visitors
count 2000320 2000320 2000320 2.000320e+06
unique 13325 9847 11450 NaN
top hpg_2afd5b187409eeb4 2016-12-16 19:00:00 2016-12-12 21:00:00 NaN
freq 1155 10528 907 NaN
first NaN 2016-01-01 11:00:00 2016-01-01 00:00:00 NaN
last NaN 2017-05-31 23:00:00 2017-04-22 23:00:00 NaN
mean NaN NaN NaN 5.073785e+00
std NaN NaN NaN 5.416172e+00
min NaN NaN NaN 1.000000e+00
25% NaN NaN NaN 2.000000e+00
50% NaN NaN NaN 3.000000e+00
75% NaN NaN NaN 6.000000e+00
max NaN NaN NaN 1.000000e+02

4−1.月間予約数

#  予約数を月次で集計
df4_monthly = df4.copy()
df4_monthly['reserve_datetime'] = df4_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['hpg_store_id','visit_datetime']
df4_monthly = df4_monthly.rename(columns={'reserve_datetime': 'reserve_month'}).drop(drop_col,axis=1)
df4_monthly = df4_monthly.groupby(['reserve_month']).count().reset_index().rename(columns={'reserve_visitors':'reserve'})
df4_monthly.head(3)
reserve_month reserve
0 2016-01 80905
1 2016-02 79934
2 2016-03 98276
# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,17), df4_monthly.reserve.as_matrix(), tick_label=df4_monthly.reserve_month, align="center", color='pink')
plt.title("Monthly_hpg_reserve")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve")
plt.grid(True)

f:id:select_from_where:20180217221226p:plain

4−2.月別訪問予定者数

#  訪問予定者数を月次で集計
df4_monthly = df4.copy()
df4_monthly['visit_datetime'] = df4_monthly.visit_datetime.map(lambda x: x.strftime('%Y-%m'))
drop_col = ['hpg_store_id','reserve_datetime']
df4_monthly = df4_monthly.rename(columns={'visit_datetime': 'visit_month'}).drop(drop_col,axis=1)
df4_monthly = df4_monthly.groupby(['visit_month'])['reserve_visitors'].sum().reset_index()
df4_monthly.head(3)
visit_month reserve_visitors
0 2016-01 313180
1 2016-02 359597
2 2016-03 546433
# 棒グラフで描画
plt.figure(figsize=(14, 8))
plt.bar(np.arange(1,18), df4_monthly.reserve_visitors.as_matrix(), tick_label=df4_monthly.visit_month, align="center", color='pink')
plt.title("Monthly_hpg_reserve_visitors")
plt.xticks(rotation = 30)
plt.xlabel("month")
plt.ylabel("reserve_visitors")
plt.grid(True)

f:id:select_from_where:20180217221223p:plain

4−3.予約日から訪問予定日までの日数

#  予約日から訪問予定日までの日数を追加
df4_monthly = df4.copy()
df4_monthly['rsrv_2_vst'] = df4_monthly.visit_datetime - df4_monthly.reserve_datetime
df4_monthly['rsrv_2_vst'] = df4_monthly.rsrv_2_vst.map(lambda x: x.days)

df4_monthly['reserve_datetime'] = df4_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m'))
df4_monthly = df4_monthly.rename(columns={'reserve_datetime': 'reserve_month'})

df4_monthly.tail(3)
hpg_store_id visit_datetime reserve_month reserve_visitors rsrv_2_vst
2000317 hpg_e9151de687b93aa3 2017-05-31 21:00:00 2017-04 49 47
2000318 hpg_fae6c96189b4a122 2017-05-31 21:00:00 2017-04 9 42
2000319 hpg_0b70de808b55ad1e 2017-05-31 23:00:00 2017-04 6 40
plt.figure(figsize=(14, 8))
sns.boxplot(x="reserve_month", y="rsrv_2_vst", data=df4_monthly)

f:id:select_from_where:20180217221223p:plain

4−4.気づき

2016年10月以降のデータ量が多い

  • 純粋な増加なのか、データ収集基盤の改善か、サンプリングによるものかは不明

11月の予約から訪問までの日差は大きい

  • 忘年会に備えた事前予約数が多い

     

5.hpg_store_info.csv [ホットペッパーグルメのレストラン情報]

  • 4,690の店舗データ
  • 34ジャンル
  • 119エリア
# .csvをDataFrame化
df5 = pd.read_csv('Data/csv/hpg_store_info.csv')
df5.head(3)
hpg_store_id hpg_genre_name hpg_area_name latitude longitude
0 hpg_6622b62385aec8bf Japanese style Tōkyō-to Setagaya-ku Taishidō 35.643675 139.668221
1 hpg_e9e068dd49c5fa00 Japanese style Tōkyō-to Setagaya-ku Taishidō 35.643675 139.668221
2 hpg_2976f7acb4b3a3bc Japanese style Tōkyō-to Setagaya-ku Taishidō 35.643675 139.668221
# データ型を確認
df5.dtypes
hpg_store_id       object
hpg_genre_name     object
hpg_area_name      object
latitude          float64
longitude         float64
dtype: object
# 統計量を確認
df5.describe(include='all')
hpg_store_id hpg_genre_name hpg_area_name latitude longitude
count 4690 4690 4690 4690.000000 4690.000000
unique 4690 34 119 NaN NaN
top hpg_b6fcb5aea05b867b Japanese style Tōkyō-to Shinjuku-ku None NaN NaN
freq 1 1750 257 NaN NaN
mean NaN NaN NaN 35.810261 137.675816
std NaN NaN NaN 2.138755 3.197233
min NaN NaN NaN 33.311645 130.339313
25% NaN NaN NaN 34.692109 135.498859
50% NaN NaN NaN 35.659214 139.495733
75% NaN NaN NaN 35.703381 139.737998
max NaN NaN NaN 43.774846 143.714585

5−1.ジャンルごとの店舗数

# ジャンルの内訳と頻度
df5.groupby('hpg_genre_name').size().sort_values(ascending=False).describe()
count      34.000000
mean      137.941176
std       323.163055
min         1.000000
25%         5.000000
50%        25.500000
75%        79.750000
max      1750.000000
dtype: float64
# 棒グラフで描画
df5_hpg_genre_name = df5.groupby('hpg_genre_name').size().sort_values()
plt.figure(figsize=(8, 10))
plt.barh(np.arange(1,35), df5_hpg_genre_name.values, tick_label=df5_hpg_genre_name.index, align="center", color='orange')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221218p:plain

5−2.エリアごとの店舗数

# エリアの内訳と頻度
df5.groupby('hpg_area_name').size().sort_values(ascending=False).describe()
count    119.000000
mean      39.411765
std       45.887004
min        1.000000
25%        8.000000
50%       20.000000
75%       58.000000
max      257.000000
dtype: float64

5−2−1.市区町村ごとの店舗数

# エリアを市区町村レベルに変更、plot時の文字化け対策で置換
df5_city = df5.copy()
df5_city['hpg_area_name'] = df5_city.hpg_area_name.map(lambda x: ' '.join(x.split(' ')[:2]).translate(str.maketrans('Ōōū', 'Oou')))
df5_city.head(3)
hpg_store_id hpg_genre_name hpg_area_name latitude longitude
0 hpg_6622b62385aec8bf Japanese style Tokyo-to Setagaya-ku 35.643675 139.668221
1 hpg_e9e068dd49c5fa00 Japanese style Tokyo-to Setagaya-ku 35.643675 139.668221
2 hpg_2976f7acb4b3a3bc Japanese style Tokyo-to Setagaya-ku 35.643675 139.668221
# 棒グラフで描画
df5_hpg_area_name = df5_city.groupby('hpg_area_name').size().sort_values()
plt.figure(figsize=(12, 22))
plt.barh(np.arange(1,74), df5_hpg_area_name.values, tick_label=df5_hpg_area_name.index, align="center", color='yellow')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221213p:plain

5−2−2.都道府県ごとの店舗数

# エリアを都道府県レベルに変更
df5_pref = df5_city.copy()
df5_pref['hpg_area_name'] = df5_pref.hpg_area_name.map(lambda x: ' '.join(x.split(' ')[:1]))
df5_pref.head(3)
hpg_store_id hpg_genre_name hpg_area_name latitude longitude
0 hpg_6622b62385aec8bf Japanese style Tokyo-to 35.643675 139.668221
1 hpg_e9e068dd49c5fa00 Japanese style Tokyo-to 35.643675 139.668221
2 hpg_2976f7acb4b3a3bc Japanese style Tokyo-to 35.643675 139.668221
# 棒グラフで描画
df5_hpg_area_name = df5_pref.groupby('hpg_area_name').size().sort_values()
plt.barh(np.arange(1,14), df5_hpg_area_name.values, tick_label=df5_hpg_area_name.index, align="center", color='yellow')
plt.title("hpg_genre_freq")
plt.xlabel("frequency")
plt.ylabel("genre")
plt.grid(True)

f:id:select_from_where:20180217221347p:plain

5−3.エリア x ジャンルの分布

# データフレームをピボット
df5_pref_heat = pd.pivot_table(data=df5_pref, values='hpg_store_id', 
                                  columns='hpg_area_name', index='hpg_genre_name', aggfunc='count').sort_values(by='Tokyo-to',ascending=False)
df5_pref_heat = df5_pref_heat.T.sort_values(by='Japanese style',ascending=False)
df5_pref_heat.head()
hpg_genre_name Japanese style International cuisine Creation Seafood Grilled meat Italian Spain Bar/Italian Bar Chinese general Karaoke Korean cuisine ... Dim Sum/Dumplings Spain/Mediterranean cuisine Sushi Sweets Taiwanese/Hong Kong cuisine Western food Amusement bar Bar/Cocktail Cantonese food Udon/Soba
hpg_area_name
Tokyo-to 805.0 370.0 152.0 121.0 118.0 116.0 109.0 45.0 28.0 27.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
Osaka-fu 196.0 44.0 63.0 35.0 57.0 36.0 23.0 9.0 3.0 4.0 ... NaN NaN 1.0 1.0 NaN 1.0 NaN NaN NaN NaN
Fukuoka-ken 143.0 65.0 42.0 32.0 41.0 19.0 18.0 4.0 2.0 NaN ... NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
Hyogo-ken 106.0 29.0 30.0 24.0 35.0 13.0 8.0 13.0 1.0 4.0 ... NaN NaN 1.0 NaN NaN 1.0 NaN 1.0 4.0 NaN
Shizuoka-ken 92.0 32.0 22.0 24.0 7.0 10.0 11.0 3.0 2.0 NaN ... NaN NaN 4.0 NaN NaN NaN NaN NaN NaN NaN

5 rows × 34 columns

# ヒートマップを描画
plt.figure(figsize=(18, 6))
sns.heatmap(df5_pref_heat, annot=True, fmt='g', center=300, cmap='Oranges')

f:id:select_from_where:20180217221343p:plain

5–4.気づき

サンプリングされているとして、均一に取られていそう

  • ランダムにサンプリングか

     

6.store_id_relation.csv [id横断データ]

  • 150店舗、1対1
# .csvをDataFrame化
df7 = pd.read_csv('Data/csv/store_id_relation.csv')
df7.head(3)
air_store_id hpg_store_id
0 air_63b13c56b7201bd9 hpg_4bc649e72e2a239a
1 air_a24bf50c3e90d583 hpg_c34b496d0305a809
2 air_c7f78b4f3cba33ff hpg_cd8ae0d9bbd58ff9
# データ型を確認
df7.dtypes
air_store_id    object
hpg_store_id    object
dtype: object
# 統計量を確認
df7.describe()
air_store_id hpg_store_id
count 150 150
unique 150 150
top air_622375b4815cf5cb hpg_eaa312b3100ae94b
freq 1 1

     

7.date_info.csv [カレンダー]

  • 2016-01-01〜2017-05-31
# .csvをDataFrame化
df6 = pd.read_csv('Data/csv/date_info.csv')
df6.head(3)
calendar_date day_of_week holiday_flg
0 2016-01-01 Friday 1
1 2016-01-02 Saturday 1
2 2016-01-03 Sunday 1
# データ型を確認
df6.dtypes
calendar_date    object
day_of_week      object
holiday_flg       int64
dtype: object
# データ型を修正
df6['calendar_date'] = df6.calendar_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d'))
df6.dtypes
calendar_date    datetime64[ns]
day_of_week              object
holiday_flg               int64
dtype: object
# 統計量を確認
df6.describe(include='all')
calendar_date day_of_week holiday_flg
count 517 517 517.000000
unique 517 7 NaN
top 2016-09-08 00:00:00 Sunday NaN
freq 1 74 NaN
first 2016-01-01 00:00:00 NaN NaN
last 2017-05-31 00:00:00 NaN NaN
mean NaN NaN 0.067698
std NaN NaN 0.251471
min NaN NaN 0.000000
25% NaN NaN 0.000000
50% NaN NaN 0.000000
75% NaN NaN 0.000000
max NaN NaN 1.000000

続きはこちら

select-from-where.hatenablog.com