Kaggleに挑戦する - 与えられたデータの確認1
かねてから挑戦して見たかったKaggleに挑戦する(つもりだった...)
そのコンペは終了してしまったため、Submitはできないのだが、せっかく着手していたのでやりきりたい。
対象のコンペは、Recruit Restaurant Visitor Forecasting。
飲食店の来客数を予想する。
Recruit Restaurant Visitor Forecasting | Kaggle
これから数回に分けて、基礎分析から機械学習を用いた予測まで、一連の作業を進めていく。
最新のソースはGitHubで公開、随時更新していく。
今回は与えられたデータの確認1回目。
与えられたcsvデータどうしを結合せずに分析していきます。
0.sample_submission.csv [Submitサンプル]
- store_idに、アンダースコア、訪問日を付与する必要がある
- 2017-04-23〜2017-05-31が予測対象期間
- visitorsに予測値を出力する
import numpy as np import pandas as pd from pandas import DataFrame from datetime import datetime, timedelta import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
# .csvをDataFrame化 sample = pd.read_csv('Data/csv/sample_submission.csv') sample.head(3)
id | visitors | |
---|---|---|
0 | air_00a91d42b08b08d9_2017-04-23 | 0 |
1 | air_00a91d42b08b08d9_2017-04-24 | 0 |
2 | air_00a91d42b08b08d9_2017-04-25 | 0 |
# 他のデータセットと形式を合わせておく sample_edit = sample.copy() sample_edit['air_store_id'] = sample_edit.id.map(lambda x: '_'.join(x.split('_')[:-1])) sample_edit['calendar_date'] = sample_edit.id.map(lambda x: x.split('_')[2]) sample_edit['calendar_date'] = sample_edit.calendar_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d')) sample_edit.drop('id',axis=1) sample_edit = sample_edit.loc[:,['air_store_id','calendar_date','visitors']] sample_edit.head(3)
air_store_id | calendar_date | visitors | |
---|---|---|---|
0 | air_00a91d42b08b08d9 | 2017-04-23 | 0 |
1 | air_00a91d42b08b08d9 | 2017-04-24 | 0 |
2 | air_00a91d42b08b08d9 | 2017-04-25 | 0 |
# データ型を確認
sample_edit.dtypes
air_store_id object
calendar_date datetime64[ns]
visitors int64
dtype: object
# 統計量を確認 sample_edit.describe(include='all')
air_store_id | calendar_date | visitors | |
---|---|---|---|
count | 32019 | 32019 | 32019.0 |
unique | 821 | 39 | NaN |
top | air_b88192b35ac03c24 | 2017-05-26 00:00:00 | NaN |
freq | 39 | 821 | NaN |
first | NaN | 2017-04-23 00:00:00 | NaN |
last | NaN | 2017-05-31 00:00:00 | NaN |
mean | NaN | NaN | 0.0 |
std | NaN | NaN | 0.0 |
min | NaN | NaN | 0.0 |
25% | NaN | NaN | 0.0 |
50% | NaN | NaN | 0.0 |
75% | NaN | NaN | 0.0 |
max | NaN | NaN | 0.0 |
1.air_reserve.csv [Airリザーブの予約データ]
- 92,378レコード
- 314 /829店舗分の予約データ
- 訪問予定は、2016-01-01 19:00:00 〜 2017-05-31 21:00:00
- 予約時刻は、2016-01-01 01:00:00 〜 2017-04-22 23:00:00
# .csvをDataFrame化 df1 = pd.read_csv('Data/csv/air_reserve.csv') df1.head(3)
air_store_id | visit_datetime | reserve_datetime | reserve_visitors | |
---|---|---|---|---|
0 | air_877f79706adbfb06 | 2016-01-01 19:00:00 | 2016-01-01 16:00:00 | 1 |
1 | air_db4b38ebe7a7ceff | 2016-01-01 19:00:00 | 2016-01-01 19:00:00 | 3 |
2 | air_db4b38ebe7a7ceff | 2016-01-01 19:00:00 | 2016-01-01 19:00:00 | 6 |
# データ型を確認
df1.dtypes
air_store_id object
visit_datetime object
reserve_datetime object
reserve_visitors int64
dtype: object
# データ型を修正 df1['visit_datetime'] = df1.visit_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S')) df1['reserve_datetime'] = df1.reserve_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S')) df1.dtypes
air_store_id object
visit_datetime datetime64[ns]
reserve_datetime datetime64[ns]
reserve_visitors int64
dtype: object
# 統計量を確認 df1.describe(include='all')
air_store_id | visit_datetime | reserve_datetime | reserve_visitors | |
---|---|---|---|---|
count | 92378 | 92378 | 92378 | 92378.000000 |
unique | 314 | 4975 | 7513 | NaN |
top | air_8093d0b565e9dbdf | 2016-12-24 19:00:00 | 2016-11-24 18:00:00 | NaN |
freq | 2263 | 255 | 106 | NaN |
first | NaN | 2016-01-01 19:00:00 | 2016-01-01 01:00:00 | NaN |
last | NaN | 2017-05-31 21:00:00 | 2017-04-22 23:00:00 | NaN |
mean | NaN | NaN | NaN | 4.481749 |
std | NaN | NaN | NaN | 4.919669 |
min | NaN | NaN | NaN | 1.000000 |
25% | NaN | NaN | NaN | 2.000000 |
50% | NaN | NaN | NaN | 3.000000 |
75% | NaN | NaN | NaN | 5.000000 |
max | NaN | NaN | NaN | 100.000000 |
1−1.月間予約数
# 予約数を月次で集計 df1_monthly = df1.copy() df1_monthly['reserve_datetime'] = df1_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m')) drop_col = ['air_store_id','visit_datetime'] df1_monthly = df1_monthly.rename(columns={'reserve_datetime': 'reserve_month'}).drop(drop_col,axis=1) df1_monthly = df1_monthly.groupby(['reserve_month']).count().reset_index().rename(columns={'reserve_visitors':'reserve'}) df1_monthly.head(3)
reserve_month | reserve | |
---|---|---|
0 | 2016-01 | 2936 |
1 | 2016-02 | 2999 |
2 | 2016-03 | 3643 |
# 棒グラフで描画 plt.figure(figsize=(14, 8)) plt.bar(np.arange(1,17), df1_monthly.reserve.as_matrix(), tick_label=df1_monthly.reserve_month, align="center", color='deepskyblue') plt.title("Monthly_air_reserve") plt.xticks(rotation = 30) plt.xlabel("month") plt.ylabel("reserve") plt.grid(True)
1−2.月別訪問予定者数
# 訪問予定者数を月次で集計 df1_monthly = df1.copy() df1_monthly['visit_datetime'] = df1_monthly.visit_datetime.map(lambda x: x.strftime('%Y-%m')) drop_col = ['air_store_id','reserve_datetime'] df1_monthly = df1_monthly.rename(columns={'visit_datetime': 'visit_month'}).drop(drop_col,axis=1) df1_monthly = df1_monthly.groupby(['visit_month'])['reserve_visitors'].sum().reset_index() df1_monthly.head(3)
visit_month | reserve_visitors | |
---|---|---|
0 | 2016-01 | 10335 |
1 | 2016-02 | 11579 |
2 | 2016-03 | 17071 |
# 棒グラフで描画 plt.figure(figsize=(14, 8)) plt.bar(np.arange(1,18), df1_monthly.reserve_visitors.as_matrix(), tick_label=df1_monthly.visit_month, align="center", color='deepskyblue') plt.title("Monthly_air_reserve_visitors") plt.xticks(rotation = 30) plt.xlabel("month") plt.ylabel("reserve_visitors") plt.grid(True)
1−3.予約日から訪問予定日までの日数
# 予約日から訪問予定日までの日数を追加 df1_monthly_diff = df1.copy() df1_monthly_diff['rsrv_2_vst'] = df1_monthly_diff.visit_datetime - df1_monthly_diff.reserve_datetime df1_monthly_diff['rsrv_2_vst'] = df1_monthly_diff.rsrv_2_vst.map(lambda x: x.days) df1_monthly_diff['reserve_datetime'] = df1_monthly_diff.reserve_datetime.map(lambda x: x.strftime('%Y-%m')) df1_monthly_diff = df1_monthly_diff.rename(columns={'reserve_datetime': 'reserve_month'}) df1_monthly_diff.tail(3)
air_store_id | visit_datetime | reserve_month | reserve_visitors | rsrv_2_vst | |
---|---|---|---|---|---|
92375 | air_877f79706adbfb06 | 2017-05-31 20:00:00 | 2017-04 | 3 | 48 |
92376 | air_900d755ebd2f7bbd | 2017-05-31 20:00:00 | 2017-04 | 10 | 42 |
92377 | air_3cad29d1a23209d2 | 2017-05-31 21:00:00 | 2017-04 | 3 | 39 |
plt.figure(figsize=(14, 8)) sns.boxplot(x="reserve_month", y="rsrv_2_vst", data=df1_monthly_diff)
1−4.気づき
2016年9月以前の訪問者数がそれ以降に比べて少ない
- 2016年7月まで先行展開、2016年10月から本格展開の可能性がある
- データがサンプリングされている可能性がある
忘年会、納会シーズンの予約が多い一方、歓迎会シーズンはそれほどでもないように見える
- 休日祝日の日数による影響がある可能性がある
5月以降の訪問予定になっている予約データはおそらく、意図的に除外されている
- 1年前、10か月前から予約が可能な人気店がある
- 5月以降の訪問予定になっている予約データはおそらく、意図的に除外されている(除外されていないレコードも5,000弱存在はする)
2.air_visit_data.csv [Airリザーブの訪問データ]
- 252,108レコード
- 829店舗全ての履歴データがある
- 訪問日時は、2016-01-01 〜 2017-04-22
# .csvをDataFrame化 df3 = pd.read_csv('Data/csv/air_visit_data.csv') df3.head(3)
air_store_id | visit_date | visitors | |
---|---|---|---|
0 | air_ba937bf13d40fb24 | 2016-01-13 | 25 |
1 | air_ba937bf13d40fb24 | 2016-01-14 | 32 |
2 | air_ba937bf13d40fb24 | 2016-01-15 | 29 |
# データ型を確認
df3.dtypes
air_store_id object
visit_date object
visitors int64
dtype: object
# データ型を修正 df3['visit_date'] = df3.visit_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d')) df3.dtypes
air_store_id object
visit_date datetime64[ns]
visitors int64
dtype: object
# 統計量を確認 df3.describe(include='all')
air_store_id | visit_date | visitors | |
---|---|---|---|
count | 252108 | 252108 | 252108.000000 |
unique | 829 | 478 | NaN |
top | air_5c817ef28f236bdf | 2017-03-17 00:00:00 | NaN |
freq | 477 | 799 | NaN |
first | NaN | 2016-01-01 00:00:00 | NaN |
last | NaN | 2017-04-22 00:00:00 | NaN |
mean | NaN | NaN | 20.973761 |
std | NaN | NaN | 16.757007 |
min | NaN | NaN | 1.000000 |
25% | NaN | NaN | 9.000000 |
50% | NaN | NaN | 17.000000 |
75% | NaN | NaN | 29.000000 |
max | NaN | NaN | 877.000000 |
2−1.月間訪問者数
# 訪問データを月次に加工 df3_monthly = df3.copy() df3_monthly['visit_date'] = df3_monthly.visit_date.map(lambda x: x.strftime('%Y-%m')) df3_monthly = df3_monthly.rename(columns={'visit_date': 'visit_month'}) df3_monthly = df3_monthly.groupby(['visit_month'])['visitors'].sum().reset_index() df3_monthly.head(3)
visit_month | visitors | |
---|---|---|
0 | 2016-01 | 152924 |
1 | 2016-02 | 159934 |
2 | 2016-03 | 189292 |
# 棒グラフで描画 plt.figure(figsize=(14, 8)) plt.bar(np.arange(1,17), df3_monthly.visitors.as_matrix(), tick_label=df3_monthly.visit_month, align="center", color='navy') plt.title("Monthly_air_visitors") plt.xticks(rotation = 30) plt.xlabel("month") plt.ylabel("visitors") plt.grid(True)
2−2.一日の訪問者数
# 訪問予定月を追加 df3_monthly_diff = df3.copy() df3_monthly_diff['visit_month'] = df3_monthly_diff.visit_date.map(lambda x: x.strftime('%Y-%m')) df3_monthly_diff.tail(3)
air_store_id | visit_date | visitors | visit_month | |
---|---|---|---|---|
252105 | air_24e8414b9b07decb | 2017-04-20 | 7 | 2017-04 |
252106 | air_24e8414b9b07decb | 2017-04-21 | 8 | 2017-04 |
252107 | air_24e8414b9b07decb | 2017-04-22 | 5 | 2017-04 |
plt.figure(figsize=(14, 8)) sns.boxplot(x="visit_month", y="visitors", data=df3_monthly_diff)
2−3.予約データと並べて予約率を見てみる
※予約と実訪問日をキーに結合した値ではないため、参考程度に
# 棒グラフと折れ線グラフで描画する height1 = df1_monthly.reserve_visitors.as_matrix() height2 = df3_monthly.visitors.as_matrix() line_height = (df1_monthly.reserve_visitors / df3_monthly.visitors * 100).as_matrix() w1 = 0.4 w2 = w1 / 2 left1 = np.arange(len(height1)) left2 = np.arange(len(height2)) fig, ax1 = plt.subplots(figsize=(14, 8)) p1 = ax1.bar(left1, height1, align='center', width=w1, tick_label=df1_monthly.visit_month, color='deepskyblue') p2 = ax1.bar(left2 + w1, height2, align='center', width=w1, tick_label=df3_monthly.visit_month, color='navy') ax1.set_ylabel('Num of people') ax2 = ax1.twinx() p3 = ax2.plot(left1 + w2, line_height, color="red") ax2.set_ylabel('Reserve Rate') ax2.set_ylim(0, np.nanmax(line_height)*1.75) plt.legend((p1[0], p2[0], p3[0]), ("reserve", "visit", "reserve_rate"), loc='upper right', bbox_to_anchor=(0.7, 0.5, 0.5, 0.5))
2–4.気づき
大口の予約が混ざっている
- 100名を超えるような大口の予約も混ざっている
予約データと訪問データで充実している期間が異なる
- 予約データは2016年11月以降、訪問データは2016年7月以降で充実している
- 2016年7月まで先行展開、2016年10月から本格展開の可能性は否定すべきか
- 予約機能と決済機能の展開時期が異なっていたと考えることもできる
3.air_store_info.csv [Airリザーブのレストラン情報]
- 829店舗
- 14ジャンル
- 103エリア
# .csvをDataFrame化 df2 = pd.read_csv('Data/csv/air_store_info.csv') df2.head(3)
air_store_id | air_genre_name | air_area_name | latitude | longitude | |
---|---|---|---|---|---|
0 | air_0f0cdeee6c9bf3d7 | Italian/French | Hyōgo-ken Kōbe-shi Kumoidōri | 34.695124 | 135.197852 |
1 | air_7cc17a324ae5c7dc | Italian/French | Hyōgo-ken Kōbe-shi Kumoidōri | 34.695124 | 135.197852 |
2 | air_fee8dcf4d619598e | Italian/French | Hyōgo-ken Kōbe-shi Kumoidōri | 34.695124 | 135.197852 |
# データ型を確認
df2.dtypes
air_store_id object
air_genre_name object
air_area_name object
latitude float64
longitude float64
dtype: object
# 統計量を確認 df2.describe(include='all')
air_store_id | air_genre_name | air_area_name | latitude | longitude | |
---|---|---|---|---|---|
count | 829 | 829 | 829 | 829.000000 | 829.000000 |
unique | 829 | 14 | 103 | NaN | NaN |
top | air_eb5788dba285e725 | Izakaya | Fukuoka-ken Fukuoka-shi Daimyō | NaN | NaN |
freq | 1 | 197 | 64 | NaN | NaN |
mean | NaN | NaN | NaN | 35.647042 | 137.415311 |
std | NaN | NaN | NaN | 2.084659 | 3.650971 |
min | NaN | NaN | NaN | 33.211967 | 130.195555 |
25% | NaN | NaN | NaN | 34.695124 | 135.341564 |
50% | NaN | NaN | NaN | 35.658068 | 139.685474 |
75% | NaN | NaN | NaN | 35.694003 | 139.751599 |
max | NaN | NaN | NaN | 44.020632 | 144.273398 |
3−1.ジャンルごとの店舗数
# ジャンルの内訳と頻度 df2.groupby('air_genre_name').size().sort_values(ascending=False).describe()
count 14.000000
mean 59.214286
std 66.013027
min 2.000000
25% 13.250000
50% 25.000000
75% 96.250000
max 197.000000
dtype: float64
# 棒グラフで描画 df2_air_genre_name = df2.groupby('air_genre_name').size().sort_values() plt.barh(np.arange(1,15), df2_air_genre_name.values, tick_label=df2_air_genre_name.index, align="center", color='dodgerblue') plt.title("air_genre_freq") plt.xlabel("frequency") plt.ylabel("genre") plt.grid(True)
3−2.エリアごとの店舗数
# エリアの内訳と頻度 df2.groupby('air_area_name').size().sort_values(ascending=False).describe()
count 103.000000
mean 8.048544
std 11.332219
min 2.000000
25% 2.000000
50% 4.000000
75% 7.500000
max 64.000000
dtype: float64
3−2−1.市区町村ごとの店舗数
# エリアを市区町村レベルに変更、plot時の文字化け対策で置換 df2_city = df2.copy() df2_city['air_area_name'] = df2_city.air_area_name.map(lambda x: ' '.join(x.split(' ')[:2]).translate(str.maketrans('Ōōū', 'Oou'))) df2_city.head(3)
air_store_id | air_genre_name | air_area_name | latitude | longitude | |
---|---|---|---|---|---|
0 | air_0f0cdeee6c9bf3d7 | Italian/French | Hyogo-ken Kobe-shi | 34.695124 | 135.197852 |
1 | air_7cc17a324ae5c7dc | Italian/French | Hyogo-ken Kobe-shi | 34.695124 | 135.197852 |
2 | air_fee8dcf4d619598e | Italian/French | Hyogo-ken Kobe-shi | 34.695124 | 135.197852 |
# 棒グラフで描画 df2_air_area_name = df2_city.groupby('air_area_name').size().sort_values() plt.figure(figsize=(12, 14)) plt.barh(np.arange(1,56), df2_air_area_name.values, tick_label=df2_air_area_name.index, align="center", color='aqua') plt.title("air_genre_freq") plt.xlabel("frequency") plt.ylabel("genre") plt.grid(True)
3−2−2.都道府県ごとの店舗数
# エリアを都道府県レベルに変更 df2_pref = df2_city.copy() df2_pref['air_area_name'] = df2_pref.air_area_name.map(lambda x: ' '.join(x.split(' ')[:1])) df2_pref.head(3)
air_store_id | air_genre_name | air_area_name | latitude | longitude | |
---|---|---|---|---|---|
0 | air_0f0cdeee6c9bf3d7 | Italian/French | Hyogo-ken | 34.695124 | 135.197852 |
1 | air_7cc17a324ae5c7dc | Italian/French | Hyogo-ken | 34.695124 | 135.197852 |
2 | air_fee8dcf4d619598e | Italian/French | Hyogo-ken | 34.695124 | 135.197852 |
# 棒グラフで描画 df2_air_area_name = df2_pref.groupby('air_area_name').size().sort_values() plt.barh(np.arange(1,10), df2_air_area_name.values, tick_label=df2_air_area_name.index, align="center", color='aqua') plt.title("air_genre_freq") plt.xlabel("frequency") plt.ylabel("genre") plt.grid(True)
3−3.エリア x ジャンルの分布
# データフレームをピボット df2_pref_heat = pd.pivot_table(data=df2_pref, values='air_store_id', columns='air_area_name', index='air_genre_name', aggfunc='count').sort_values(by='Tokyo-to',ascending=False) df2_pref_heat = df2_pref_heat.T.sort_values(by='Izakaya',ascending=False) df2_pref_heat.head(3)
air_genre_name | Cafe/Sweets | Izakaya | Dining bar | Italian/French | Bar/Cocktail | Japanese food | Other | Western food | Creative cuisine | Okonomiyaki/Monja/Teppanyaki | Yakiniku/Korean food | Asian | International cuisine | Karaoke/Party |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
air_area_name | ||||||||||||||
Tokyo-to | 87.0 | 87.0 | 65.0 | 65.0 | 54.0 | 38.0 | 16.0 | 14.0 | 5.0 | 5.0 | 4.0 | 2.0 | 2.0 | NaN |
Fukuoka-ken | 45.0 | 27.0 | 14.0 | 11.0 | 9.0 | 11.0 | 5.0 | NaN | 2.0 | NaN | 3.0 | NaN | NaN | NaN |
Hyogo-ken | 10.0 | 24.0 | 4.0 | 9.0 | NaN | 2.0 | NaN | NaN | 4.0 | 2.0 | 2.0 | NaN | NaN | NaN |
# ヒートマップを描画 plt.figure(figsize=(8, 4)) sns.heatmap(df2_pref_heat, annot=True, center=30, fmt='g', cmap='Blues')
3–4.気づき
ジャンル x エリアの分布に規則性が見られる
- 何らかのルールでサンプリングされていそう
4.hpg_reserve.csv [ホットペッパーグルメの予約データ]
- 2,000,320レコード
- 13,325 /4,690店舗の予約データ???
- 訪問予定は、2016-01-01 11:00:00 〜 2017-05-31 23:00:00
- 予約時刻は、2016-01-01 00:00:00 〜 2017-04-22 23:00:00
# .csvをDataFrame化 df4 = pd.read_csv('Data/csv/hpg_reserve.csv') df4.head(3)
hpg_store_id | visit_datetime | reserve_datetime | reserve_visitors | |
---|---|---|---|---|
0 | hpg_c63f6f42e088e50f | 2016-01-01 11:00:00 | 2016-01-01 09:00:00 | 1 |
1 | hpg_dac72789163a3f47 | 2016-01-01 13:00:00 | 2016-01-01 06:00:00 | 3 |
2 | hpg_c8e24dcf51ca1eb5 | 2016-01-01 16:00:00 | 2016-01-01 14:00:00 | 2 |
# データ型を確認
df4.dtypes
hpg_store_id object
visit_datetime object
reserve_datetime object
reserve_visitors int64
dtype: object
# データ型を修正 df4['visit_datetime'] = df4.visit_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S')) df4['reserve_datetime'] = df4.reserve_datetime.map(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S')) df4.dtypes
hpg_store_id object
visit_datetime datetime64[ns]
reserve_datetime datetime64[ns]
reserve_visitors int64
dtype: object
# 統計量を確認 df4.describe(include='all')
hpg_store_id | visit_datetime | reserve_datetime | reserve_visitors | |
---|---|---|---|---|
count | 2000320 | 2000320 | 2000320 | 2.000320e+06 |
unique | 13325 | 9847 | 11450 | NaN |
top | hpg_2afd5b187409eeb4 | 2016-12-16 19:00:00 | 2016-12-12 21:00:00 | NaN |
freq | 1155 | 10528 | 907 | NaN |
first | NaN | 2016-01-01 11:00:00 | 2016-01-01 00:00:00 | NaN |
last | NaN | 2017-05-31 23:00:00 | 2017-04-22 23:00:00 | NaN |
mean | NaN | NaN | NaN | 5.073785e+00 |
std | NaN | NaN | NaN | 5.416172e+00 |
min | NaN | NaN | NaN | 1.000000e+00 |
25% | NaN | NaN | NaN | 2.000000e+00 |
50% | NaN | NaN | NaN | 3.000000e+00 |
75% | NaN | NaN | NaN | 6.000000e+00 |
max | NaN | NaN | NaN | 1.000000e+02 |
4−1.月間予約数
# 予約数を月次で集計 df4_monthly = df4.copy() df4_monthly['reserve_datetime'] = df4_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m')) drop_col = ['hpg_store_id','visit_datetime'] df4_monthly = df4_monthly.rename(columns={'reserve_datetime': 'reserve_month'}).drop(drop_col,axis=1) df4_monthly = df4_monthly.groupby(['reserve_month']).count().reset_index().rename(columns={'reserve_visitors':'reserve'}) df4_monthly.head(3)
reserve_month | reserve | |
---|---|---|
0 | 2016-01 | 80905 |
1 | 2016-02 | 79934 |
2 | 2016-03 | 98276 |
# 棒グラフで描画 plt.figure(figsize=(14, 8)) plt.bar(np.arange(1,17), df4_monthly.reserve.as_matrix(), tick_label=df4_monthly.reserve_month, align="center", color='pink') plt.title("Monthly_hpg_reserve") plt.xticks(rotation = 30) plt.xlabel("month") plt.ylabel("reserve") plt.grid(True)
4−2.月別訪問予定者数
# 訪問予定者数を月次で集計 df4_monthly = df4.copy() df4_monthly['visit_datetime'] = df4_monthly.visit_datetime.map(lambda x: x.strftime('%Y-%m')) drop_col = ['hpg_store_id','reserve_datetime'] df4_monthly = df4_monthly.rename(columns={'visit_datetime': 'visit_month'}).drop(drop_col,axis=1) df4_monthly = df4_monthly.groupby(['visit_month'])['reserve_visitors'].sum().reset_index() df4_monthly.head(3)
visit_month | reserve_visitors | |
---|---|---|
0 | 2016-01 | 313180 |
1 | 2016-02 | 359597 |
2 | 2016-03 | 546433 |
# 棒グラフで描画 plt.figure(figsize=(14, 8)) plt.bar(np.arange(1,18), df4_monthly.reserve_visitors.as_matrix(), tick_label=df4_monthly.visit_month, align="center", color='pink') plt.title("Monthly_hpg_reserve_visitors") plt.xticks(rotation = 30) plt.xlabel("month") plt.ylabel("reserve_visitors") plt.grid(True)
4−3.予約日から訪問予定日までの日数
# 予約日から訪問予定日までの日数を追加 df4_monthly = df4.copy() df4_monthly['rsrv_2_vst'] = df4_monthly.visit_datetime - df4_monthly.reserve_datetime df4_monthly['rsrv_2_vst'] = df4_monthly.rsrv_2_vst.map(lambda x: x.days) df4_monthly['reserve_datetime'] = df4_monthly.reserve_datetime.map(lambda x: x.strftime('%Y-%m')) df4_monthly = df4_monthly.rename(columns={'reserve_datetime': 'reserve_month'}) df4_monthly.tail(3)
hpg_store_id | visit_datetime | reserve_month | reserve_visitors | rsrv_2_vst | |
---|---|---|---|---|---|
2000317 | hpg_e9151de687b93aa3 | 2017-05-31 21:00:00 | 2017-04 | 49 | 47 |
2000318 | hpg_fae6c96189b4a122 | 2017-05-31 21:00:00 | 2017-04 | 9 | 42 |
2000319 | hpg_0b70de808b55ad1e | 2017-05-31 23:00:00 | 2017-04 | 6 | 40 |
plt.figure(figsize=(14, 8)) sns.boxplot(x="reserve_month", y="rsrv_2_vst", data=df4_monthly)
4−4.気づき
2016年10月以降のデータ量が多い
- 純粋な増加なのか、データ収集基盤の改善か、サンプリングによるものかは不明
11月の予約から訪問までの日差は大きい
- 忘年会に備えた事前予約数が多い
5.hpg_store_info.csv [ホットペッパーグルメのレストラン情報]
- 4,690の店舗データ
- 34ジャンル
- 119エリア
# .csvをDataFrame化 df5 = pd.read_csv('Data/csv/hpg_store_info.csv') df5.head(3)
hpg_store_id | hpg_genre_name | hpg_area_name | latitude | longitude | |
---|---|---|---|---|---|
0 | hpg_6622b62385aec8bf | Japanese style | Tōkyō-to Setagaya-ku Taishidō | 35.643675 | 139.668221 |
1 | hpg_e9e068dd49c5fa00 | Japanese style | Tōkyō-to Setagaya-ku Taishidō | 35.643675 | 139.668221 |
2 | hpg_2976f7acb4b3a3bc | Japanese style | Tōkyō-to Setagaya-ku Taishidō | 35.643675 | 139.668221 |
# データ型を確認
df5.dtypes
hpg_store_id object
hpg_genre_name object
hpg_area_name object
latitude float64
longitude float64
dtype: object
# 統計量を確認 df5.describe(include='all')
hpg_store_id | hpg_genre_name | hpg_area_name | latitude | longitude | |
---|---|---|---|---|---|
count | 4690 | 4690 | 4690 | 4690.000000 | 4690.000000 |
unique | 4690 | 34 | 119 | NaN | NaN |
top | hpg_b6fcb5aea05b867b | Japanese style | Tōkyō-to Shinjuku-ku None | NaN | NaN |
freq | 1 | 1750 | 257 | NaN | NaN |
mean | NaN | NaN | NaN | 35.810261 | 137.675816 |
std | NaN | NaN | NaN | 2.138755 | 3.197233 |
min | NaN | NaN | NaN | 33.311645 | 130.339313 |
25% | NaN | NaN | NaN | 34.692109 | 135.498859 |
50% | NaN | NaN | NaN | 35.659214 | 139.495733 |
75% | NaN | NaN | NaN | 35.703381 | 139.737998 |
max | NaN | NaN | NaN | 43.774846 | 143.714585 |
5−1.ジャンルごとの店舗数
# ジャンルの内訳と頻度 df5.groupby('hpg_genre_name').size().sort_values(ascending=False).describe()
count 34.000000
mean 137.941176
std 323.163055
min 1.000000
25% 5.000000
50% 25.500000
75% 79.750000
max 1750.000000
dtype: float64
# 棒グラフで描画 df5_hpg_genre_name = df5.groupby('hpg_genre_name').size().sort_values() plt.figure(figsize=(8, 10)) plt.barh(np.arange(1,35), df5_hpg_genre_name.values, tick_label=df5_hpg_genre_name.index, align="center", color='orange') plt.title("hpg_genre_freq") plt.xlabel("frequency") plt.ylabel("genre") plt.grid(True)
5−2.エリアごとの店舗数
# エリアの内訳と頻度 df5.groupby('hpg_area_name').size().sort_values(ascending=False).describe()
count 119.000000
mean 39.411765
std 45.887004
min 1.000000
25% 8.000000
50% 20.000000
75% 58.000000
max 257.000000
dtype: float64
5−2−1.市区町村ごとの店舗数
# エリアを市区町村レベルに変更、plot時の文字化け対策で置換 df5_city = df5.copy() df5_city['hpg_area_name'] = df5_city.hpg_area_name.map(lambda x: ' '.join(x.split(' ')[:2]).translate(str.maketrans('Ōōū', 'Oou'))) df5_city.head(3)
hpg_store_id | hpg_genre_name | hpg_area_name | latitude | longitude | |
---|---|---|---|---|---|
0 | hpg_6622b62385aec8bf | Japanese style | Tokyo-to Setagaya-ku | 35.643675 | 139.668221 |
1 | hpg_e9e068dd49c5fa00 | Japanese style | Tokyo-to Setagaya-ku | 35.643675 | 139.668221 |
2 | hpg_2976f7acb4b3a3bc | Japanese style | Tokyo-to Setagaya-ku | 35.643675 | 139.668221 |
# 棒グラフで描画 df5_hpg_area_name = df5_city.groupby('hpg_area_name').size().sort_values() plt.figure(figsize=(12, 22)) plt.barh(np.arange(1,74), df5_hpg_area_name.values, tick_label=df5_hpg_area_name.index, align="center", color='yellow') plt.title("hpg_genre_freq") plt.xlabel("frequency") plt.ylabel("genre") plt.grid(True)
5−2−2.都道府県ごとの店舗数
# エリアを都道府県レベルに変更 df5_pref = df5_city.copy() df5_pref['hpg_area_name'] = df5_pref.hpg_area_name.map(lambda x: ' '.join(x.split(' ')[:1])) df5_pref.head(3)
hpg_store_id | hpg_genre_name | hpg_area_name | latitude | longitude | |
---|---|---|---|---|---|
0 | hpg_6622b62385aec8bf | Japanese style | Tokyo-to | 35.643675 | 139.668221 |
1 | hpg_e9e068dd49c5fa00 | Japanese style | Tokyo-to | 35.643675 | 139.668221 |
2 | hpg_2976f7acb4b3a3bc | Japanese style | Tokyo-to | 35.643675 | 139.668221 |
# 棒グラフで描画 df5_hpg_area_name = df5_pref.groupby('hpg_area_name').size().sort_values() plt.barh(np.arange(1,14), df5_hpg_area_name.values, tick_label=df5_hpg_area_name.index, align="center", color='yellow') plt.title("hpg_genre_freq") plt.xlabel("frequency") plt.ylabel("genre") plt.grid(True)
5−3.エリア x ジャンルの分布
# データフレームをピボット df5_pref_heat = pd.pivot_table(data=df5_pref, values='hpg_store_id', columns='hpg_area_name', index='hpg_genre_name', aggfunc='count').sort_values(by='Tokyo-to',ascending=False) df5_pref_heat = df5_pref_heat.T.sort_values(by='Japanese style',ascending=False) df5_pref_heat.head()
hpg_genre_name | Japanese style | International cuisine | Creation | Seafood | Grilled meat | Italian | Spain Bar/Italian Bar | Chinese general | Karaoke | Korean cuisine | ... | Dim Sum/Dumplings | Spain/Mediterranean cuisine | Sushi | Sweets | Taiwanese/Hong Kong cuisine | Western food | Amusement bar | Bar/Cocktail | Cantonese food | Udon/Soba |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hpg_area_name | |||||||||||||||||||||
Tokyo-to | 805.0 | 370.0 | 152.0 | 121.0 | 118.0 | 116.0 | 109.0 | 45.0 | 28.0 | 27.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN |
Osaka-fu | 196.0 | 44.0 | 63.0 | 35.0 | 57.0 | 36.0 | 23.0 | 9.0 | 3.0 | 4.0 | ... | NaN | NaN | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN |
Fukuoka-ken | 143.0 | 65.0 | 42.0 | 32.0 | 41.0 | 19.0 | 18.0 | 4.0 | 2.0 | NaN | ... | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Hyogo-ken | 106.0 | 29.0 | 30.0 | 24.0 | 35.0 | 13.0 | 8.0 | 13.0 | 1.0 | 4.0 | ... | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | 1.0 | 4.0 | NaN |
Shizuoka-ken | 92.0 | 32.0 | 22.0 | 24.0 | 7.0 | 10.0 | 11.0 | 3.0 | 2.0 | NaN | ... | NaN | NaN | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 34 columns
# ヒートマップを描画 plt.figure(figsize=(18, 6)) sns.heatmap(df5_pref_heat, annot=True, fmt='g', center=300, cmap='Oranges')
5–4.気づき
サンプリングされているとして、均一に取られていそう
- ランダムにサンプリングか
6.store_id_relation.csv [id横断データ]
- 150店舗、1対1
# .csvをDataFrame化 df7 = pd.read_csv('Data/csv/store_id_relation.csv') df7.head(3)
air_store_id | hpg_store_id | |
---|---|---|
0 | air_63b13c56b7201bd9 | hpg_4bc649e72e2a239a |
1 | air_a24bf50c3e90d583 | hpg_c34b496d0305a809 |
2 | air_c7f78b4f3cba33ff | hpg_cd8ae0d9bbd58ff9 |
# データ型を確認
df7.dtypes
air_store_id object
hpg_store_id object
dtype: object
# 統計量を確認
df7.describe()
air_store_id | hpg_store_id | |
---|---|---|
count | 150 | 150 |
unique | 150 | 150 |
top | air_622375b4815cf5cb | hpg_eaa312b3100ae94b |
freq | 1 | 1 |
7.date_info.csv [カレンダー]
- 2016-01-01〜2017-05-31
# .csvをDataFrame化 df6 = pd.read_csv('Data/csv/date_info.csv') df6.head(3)
calendar_date | day_of_week | holiday_flg | |
---|---|---|---|
0 | 2016-01-01 | Friday | 1 |
1 | 2016-01-02 | Saturday | 1 |
2 | 2016-01-03 | Sunday | 1 |
# データ型を確認
df6.dtypes
calendar_date object
day_of_week object
holiday_flg int64
dtype: object
# データ型を修正 df6['calendar_date'] = df6.calendar_date.map(lambda x: datetime.strptime(x,'%Y-%m-%d')) df6.dtypes
calendar_date datetime64[ns]
day_of_week object
holiday_flg int64
dtype: object
# 統計量を確認 df6.describe(include='all')
calendar_date | day_of_week | holiday_flg | |
---|---|---|---|
count | 517 | 517 | 517.000000 |
unique | 517 | 7 | NaN |
top | 2016-09-08 00:00:00 | Sunday | NaN |
freq | 1 | 74 | NaN |
first | 2016-01-01 00:00:00 | NaN | NaN |
last | 2017-05-31 00:00:00 | NaN | NaN |
mean | NaN | NaN | 0.067698 |
std | NaN | NaN | 0.251471 |
min | NaN | NaN | 0.000000 |
25% | NaN | NaN | 0.000000 |
50% | NaN | NaN | 0.000000 |
75% | NaN | NaN | 0.000000 |
max | NaN | NaN | 1.000000 |