Data Processing / Exploratory Data Analysis

CLEANING

All data manipulation, organization and cleaning was done in R with RStudio. The code and PDF of everything computed is available for reproducing our results and is included in the following github repository:

link: https://github.com/christinegiang/cyplan101

POLLUTION

  • All of the negative / NA values were removed to prevent skewing of distributions. Negative values do not make sense for air quality measurement units, and most negative values were -999, which usually indicates missing data points.
  • The distributions were analyzed as single points on scatterplots, as well as aggregated weekly & monthly into line graphs using average PM2.5 measurements.
  • Variance was calculated after aggregation, but was not
    significant enough to drastically effect interpretation
  • Variance for monthly aggregation was much worse, so
    we decided not to use it. Instead, daily aggregation
    was plotted on an axis refitted to show months for
    better analysis and visual interpretation.

Standard Deviations of aggregated AQI’s are shown to the left, and the variance is extremely high, which means these measurements are not reliable. Additionally, the mean aggregated values are too even, which shows a very even distribution throughout the year which is not accurate. The measurements vary greatly from 0 all the way up to almost 800.

  • Pollution data was visualized all together and also organized by city to view marginal distributions.
When all five cities were graphed together, the trends through 2016 seem to follow the same patterns by visual inspection. Since historical smog data was only released for these five cities, this shows that the combined data set can be used as an adequate proxy for the pollution patterns of the entire country.

DEMOGRAPHICS

We were only able to find mean income data and population data for each city by year, which was not ideal for our analysis. However, we did graph their distributions next to that of the cities and did not find any obvious correlations. We were not able to draw any conclusions based on these data points. If in the future we are able to find better proxies for economic data possibly on poverty levels or land use in each region, we could further speculate if those have any relationships with pollution.

FACE MASK SALES

This data has been organized into a monthly distribution. Additionally, the pollution data was aggregated and subsetted to plot on a six month timeline so it could be compared directly with the mask trends.

PROVINCES

  • The shapefile did not provide names, so over 50 provinces have been hand-labeled and coordinated with the table.
  • Indicator values have been set to the 5 provinces relevant to our region-specific pollution data.
  • We obtained a Google API key to access use in a mapping package (ggmap) made for R.
  • The province data was then mapped onto Carto to analyze relationships between mobile activity and location.

heatmap
mobile activity data has been overlaid onto the province map
link: https://christinegiang.carto.com/builder/d3b97f31-9002-4130-bdf3-4258d49ae33b/embed

time series
animated representation of mobile activity data over the span 5/1 to 5/7 of 2016.
link: https://christinegiang.carto.com/builder/73637b89-ddb6-4154-9caa-437536528ad6/embed

MOBILE ACTIVITY

An essential and foremost thing we were thinking about was which was indicator we could use for representing urban activities. In the research paper done collectively by MIT and Tongji University (Yan, Duarte, Wang, Zheng & Ratti, 2018), researchers used the social media check-in data from Sina Weibo micro-blogging microform, the Chinese version of Twitter as the indicator of people’s urban activity data. As mobile devices are ubiquitous (Chinese person surveyed owns at least a basic mobile phone (98%)). We thus used the data from Kaggle. The Data is collected from TalkingData SDK integrated within mobile apps. TalkingData serves under the service term between TalkingData and mobile app developers. Kaggle data was collected based on full recognition and consent from individual user of those apps have been obtained, and appropriate anonymization have been performed to protect privacy.

Kaggle Data Sample

We started by tabling observations by device ID to retrieve frequency of entries for each unique person in the dataset and removed users with fewer than 20 entries to exclude the users not living in that specific region. In order to identify the residential city of each user, we cleaned the data by extracting the people with more than 20 entries in that specific region.

  • Observations from the first (April 30) and last day (May 8) were removed because they only have 800 and 2 points, respectively. This compared to the rest of the days containing upwards of 400,000 entries is very insignificant and will not provide much insight.
  • The points have been tabled by device ID to retrieve frequency of entries from each unique person in the dataset.
  • Users with fewer than 20 entries were removed to extrapolate where each user lives.

K-MEANS CLUSTERING MAP

K-means clustering, an unsupervised machine learning method was applied to all mobile activity observations to see where most of the data are gathered together. The centers of each cluster were then plotted onto the province map. The map shows that a lot of the biggest clusters are near our 5 cities which gives stronger evidence that the mobile check-in data can be used as an indicator for activity.

Leave a comment