Alessio Petrozziello and Ivan Jordanov
Data Analytics, Recommender Systems, Missing Features, K-NN, Random Forests
Nowadays, the online travel agencies (OTAs) provide the main service for booking holidays, business trips, accommodations, etc. As in all online services where users, items, and decisions are involved, there is a necessity for a Recommender System (RS) to facilitate the navigation of catalogues and websites. For a travel RS, the use of a pure collaborative filtering approach is not feasible because the user-item matrix is way too sparse. For this reason, a content-based filtering is investigated in this work, focusing on one of its main problems: missing features. An initial exploratory analysis is used to identify a class of poorly ranked properties (e.g., Vacation Rentals (VR)). To deal with the missingness in the data, several state-of-the-art imputation methods (K-NN, Random Forests, and Gradient-Boosted Trees) are investigated and their performance critically analysed and tested. These techniques are applied following dataset preprocessing that includes cleaning, feature scaling, and standardization. In addition to that, a k-fold cross validation is used to validate the imputation results and reduce the possibility of overfitting. Three similarity measures (Jaccard, Weighted Hamming and Fuzzy-C-Means rankings) based on engineered non-historical features (amenities and geographical position) are analysed and employed for determining the best proxy for unavailable features.