FAQs
Exactly what data are displayed?
At the moment we offer the tourist apartment's ads basic data (location, price, capacity, id of the owner ...). We look forward to providing detailed occupancy data in the future.
We collect and show only information of the ads that we find in the different platforms. The same apartment can contain several ads (one for each room, for example), or it can be duplicated on the platform. There are many different cases, so we decided to only display ads found on the web.
What Methodology do you use?
For the data collection, a scraper system is used to automate the query to the platform APIs (airbnb, homeaway, housetrip and onlyapartments) and make a directory of all the advertisements that you have found available (let's call it index). This process usually lasts several days, depending on the difficulty of the task (airbnb about 3-5 days, homeaway 1-3, others less). index contains mainly the unique identifier data of the ad and associated url.
Subsequently index goes through another scraping process and saves more detailed information in the general data directory (let's call it warehouse), which contains all the announcements that have been found since we started (~ October 2017). If in this process it is detected that the ad is new, a new element will be created in the warehouse, with the creation date of that moment (found). If on the contrary it is detected that this ad already existed, we update the revised date of the announcement (revised) in warehouse. This process usually lasts more days (airbnb about 10 days, homeaway 5, others less). Once the process is finished, the index is emptied.
Each time a new 'research' warehouse is bigger, since new ads are found but not found ones are deleted. This generates a 'historical' and incremental database, which is the one that is then downloaded to the web page (apartments.csv). All the software is mainly developed with python except some higher level tasks that require bash.
To deal with this data, this process must be taken into account. To take a current 'photo', it is necessary to filter the old apartments and take into account only the current ones, and this is done by eliminating those that have an old revised date.
For the statistics of each region (geojson.json), the warehouse is crossed with a geographic data base elaborated from https://gadm.org/index.html and own contributions. This database can be viewed in one of the graphics of the web page itself.
Why not join data from all platforms?
An apartment is usually on one or several platforms, with the same data or with different information. If you add ads from different platforms, it is likely that you are repeating many apartments, so the information will be distorted. That is why we offer the data in separate files.
Are you going to identify unique apartments on various platforms?
Identifying the same apartment on all platforms is a hard job that we do not have in mind at the moment.
How can I find out which ads are currently published?
You have to find the most recent update date (revised field). You'll see that they correspond to several days (not just one), and it's the last time the scraper was active. You have to filter the data and stay alone with the ads reviewed on those dates.
How can I find out which ads were published on a specific date?
For each advertisement, we show two dates: The date of the first scraping (found) and the date of the last scraping in which it was found (revised). With these dates it is possible to determine which apartments were advertised on a specific date. For example, to determine active ads on August 14, we have to filter ads whose 'found' date is before August 14 and whose 'revised' date is after August 14.
Is it possible to visualize data of any municipality?
Yes, it is possible to navigate between regions. From the page of each country you can access their divisions, from them to their provinces, and from these to their municipalities.
Look at the bottom of each region page, there are links to subdivisions.
What do I get if I take all the ads without considering the date?
All ads visited since DataHippo started running will be shown, that is, ads that we have found will at any time.
The ID of the owner is the same on all platforms?
The owner's ID refers to the ID that each platform uses internally to identify it, so it is not the same
Can you see the price variations of a home?
No, we do not store the price variation, only the base price of the ad (which is not the final price), since it is a hard data to handle.
The prices of the same accommodation and for the same day can vary a lot, since the platform allows discounts for long stays, per day of arrival, etc., in addition to increases in some cases (extra services that may have the accommodation pe). If you do a little research, you will see that airbnb has its own API to calculate the price based on X parameters (dates, number of people ...) which is where you should shoot from (setting the same parameters for all the accommodations, so that have the same scale).
Why do some regions have a wrong name?
Region's names (districts, municipalities, etc.) were obtained consulting the Google Maps API with this script: GoogleMapsAPI_get_region_name.py. This method is not 100% reliable, and may produce errors. If you see any, please let us know.