Stories by Lily Su on Medium

How Mobile Location Data Could Help You Limit Exposure to Covid-19

Lily Su — Mon, 08 Jun 2020 14:50:19 GMT

As NYC moves into Phase 1 reopening, here is a look at peak foot traffic among popular shopping areas in the past 6 weeks.

Going outside the house for any purpose has become an invitation to put oneself in exposure to Covid-19, but what are some measures we can take to limit our risk?

Online shopping platforms have been overloaded with orders for most of social isolation, making it near impossible to get delivery in especially the outer boroughs in late March and early April. Sometimes choices are limited and making that one trip is inevitable whether to the bank, to do laundry, to get supplies for the next few weeks.

But beyond just getting up going on our way to the grocery store or post office, is there anything to one can do to avoid that line around the block?

Mobile Location Data from Left to Right Jackson Heights, Union Square, Rego Park Areas

As confirmed cases in NYC dwindle and the weather is beginning to feel like summer, antibody testing is still fallible and Covid-19 testing has stayed in short supply. The crux of returning to “normal life” is the mixing of infected and susceptible people unbeknownst to each other leading to the influx of new cases.

As we have seen in Singapore and South Korea, reopening without clear strategies on mitigating high-exposure situations were followed by a resurgence of new cases. Singapore became the site with the most cases in South Asia in late April. South Korea saw a spike of new cases comparable to where they were at more than a month prior after their first week of re-opening.

Instead of the back and forth pendulum swings of re-openings and reinstated lock downs, can our awareness of pedestrian traffic contribute to flattening the curve?

Data Analytics NYC and Predicio used anonymized mobile location data of the past month to get a general idea of how foot traffic to some of the most populous retail areas in our city has fluctuated hour-by-hour, week-by-week so that you might use such insights to have a safer shopping experience and shorten the time of your trip.

The anonymous mobile data used for analysis brought to you by Predicio, an up and coming location-based behavioral intelligence startup that provides GDPR-compliant and CCPA-compliant location data from apps where users have consented to share their location.

We first looked at the data by honing in on a random sample of 100,000 anonymized mobile location points each hour within a 10 x 10 mile radius in the greater NYC region to generate our visualizations hour by hour on Tuesdays over the course of a month.

Out of our 100,000 data points in our region of interest, we looked at the foot traffic density over a quarter-mile radius around a point of interest and generated these visualizations:

We also wanted to people to see how the pedestrian traffic compares in relation to each other hour-by-hour as a whole throughout one day.

Foot Traffic Count out of a Uniform Sample of Mobile Data Quarter Mile Around our Point of Interest

On the grounds that the histogram fluctuates widely day-to-day, we combined histograms and took the median foot traffic of the past month.

Median Foot Traffic Between April 21 — May 15 in a Quarter Mile Around Our Point of Interest

As one can see, taking the median foot traffic, which eliminates outliers where marginal observations pull at our general pattern, gives a smoother distribution that correlates with our assumptions of traffic as it relates to normal sleeping schedules.

We deduced that the least crowded times are right when the store opens and evenings before the establishments close.

We arrived at this conclusion based on the operating hours on Tuesdays being 9AM — 8 PM at our location of interest, then deducing from the foot traffic count by hour. As the establishment approaches opening time, we saw a spike in foot traffic starting from the hour before. The levels of foot traffic from opening time does fluctuate into the afternoon but remains high as compared to 7PM, one hour before closing time.

It makes sense why people would not like to go run errands in the evenings. Who knows if the merchandise will be out of stock, and worse if the location is crowded one may not accomplish their task before the establishment closes?

April 21 — May 15, Day-to-Day Fluctuations Within the 25 Days of a Quarter Mile Radius from Our Point of Interest

Some other ways we are looking at the data was to look at a longer span of time, on a day basis to build upon our discovery of patterns.

In the plot to the left, and bottom, instead of dividing into days, we looking at a continuous stream of hours visually. Then we performed some time series analysis using ARIMA modeling to get a general sense of repeating patterns over days, weeks and months, experimenting with patterns found in various time spans.

Fitting an ARIMA Model Over a Continuous Stream of Hours. The X-axis Represents Our Sample Head Count, Our Y-axis Represents Hours Over Time Starting April 21st 12AM. (ie. Hour 25 is April 22nd 1 AM)

As you can see, a big part of the work is to separate the signal from the noise. One way that we are also fine-tuning where we are observing the data is to better isolate our mobile data points to only within the block that our point of interest sits upon.

Zeroing in On Just Penn Station, Looking at One City Block

Here’s how we compare to Google Map’s smoothed out plot for Tuesdays for Penn Station filtering out all other foot traffic outside the block above that Penn Station sits upon:

From Google Maps for Penn Station on Tuesdays

As one can see, there is a somewhat wide variation in foot traffic based on our data due various reasons such as certain users being more active on their phones and our limited sample intensifying our results.

If you are curious as to how we arrived at our visuals, the following will be a window into the technical aspects of our analysis, presented in the form of a tutorial.

Get the data here.

From the Predicio data portal, we extracted data by hour, we filtered a geographical radius of interest, picked a specific location to focus on, then made further analyses. The following shows how we did it and how you can manipulate location information too!

You can use any location data to follow along as long as you have a longitude, latitude and time. Here are two similar open data sets to try a hand: the NYC Open Data Portal on Pedestrian traffic counts on pedestrians crossing the Brooklyn Bridge, or a pre-configured data set with people count and eight specific high traffic locations provided by the Leeds City Council.

Depending on the data you have, you might need to hone down on what you want to focus on and convert to the proper date time. The following code condenses some possible wrangling you might need to make.

Here are example steps you might decide to take to organize your data:

1. Narrow down on an area of focus and apply your wrangle function on just the area.

2. Create a new dataframe that filters just the distance from our point of interest.

3. Plot a heatmap for each hour and save a cropped screenshot of the plot.

def wrangle(X):

    X = X.copy()
    X = X.sample(, replace=True)

    X['date_EST'] = pd.to_datetime(X['timestamp'], unit='s')
    X['date_EST'] = X['date_EST'].dt.tz_localize('UTC')\
                                 .dt.tz_convert('US/Eastern')
    X['date'] = X['date_EST'].dt.date
    X['hour']  = X['date_EST'].dt.hour

return X

In our analysis, we took a 100,000 sample over a 14 x 14 mile radius per hour around the location of interest.

We then processed just our 2,400,000 sample using the wrangle function to get the proper date time format.

Due to the fact that the filtering processes taking a good amount of time, we wrapped our loops in TQDM to see a progress bar of the process.

df_container = []
total_rows = 0
cols = ['timestamp', 'lat', 'lng']

for i in tqdm(range(24)):
    i = str(i).zfill(2)

    for j in range(5):
        try:
            df = pd.read_csv(f'                         {} \
                          rest_of_file_name
                         {}>', \
                          delimiter = '\t', \
                          error_bad_lines=False, \ 
                          usecols = cols)
            #keeping track of the total amount of rows for reference
            rows, _ = df.shape
            total_rows += rows

            #selecting only data that is in the 
            #area of interest while reading the data in
            df = df[(df['lat'] > ) & \
                    (df['lat'] < ) & \
                    (df['lng'] < ) & \
                    (df['lng'] > )]

            df = wrangle(df)
            df_container.append(df)

        except:
            continue

df_sample = pd.concat([df for df in df_container])

Now that we have a dataframe with the proper time, we’re going to pinpoint our location by creating a new column with our distance from the location of interest using Geopy.

lat = 
lng =

df_sample['distance_in_km'] = [distance.distance((lat, lng), (i,j)).km for i, j in tqdm(zip(df_sample['lat'], df_sample['lng']))]

We then created a new dataframe that isolates only points where distance to point of interest is less than .5 km or a quarter mile radius.

df_point_of_interest = df_sample[df_sample['distance_in_km'] < 0.5]

To create a histogram in matplotlib, we plotted the dataframe as a groupby like so:

df_point_of_interest.groupby('hour') \
                    .count() \
                    .reset_index() \
                    .plot.bar(x='hour', y='timestamp', \
                              color = 'orange', width=0.9, \
                              figsize=(10,6));

Filtering the Data to Pin Point Within One Block

The method in which we isolate specific locations is to draw polygons in a mapping software not unlike connect-the-dots then convert the polygon shape to a .shp file like below. We show an example of Penn Station:

To filter only mobile data points within our polygon, we geocode our longitude and latitude coordinates into a geopandas dataframe, then create a new column saving the results of a conditional for whether or not our point is in the shape.

Finally we perform dataframe filtering to extract only dataframe rows within the shape.

gdf = geopandas.GeoDataFrame(df,
      geometry=geopandas.points_from_xy(df['lng'], df['lat']))

gdf = gdf.assign(**{'within_shape': gdf.within(i) for i in shapefile_df['geometry']})

new_df = gdf[gdf['within_shape'] == True]

The below is the result of our filtering:

Isolating Our Data to Just Within One Block

With our data filtered, we can safely eliminate data outside our scope of interest.

From Left to Right, Some Popular Shopping Destinations in the Bronx, in the Flatiron Area of Manhattan, and Penn Station

Visualizing Data with a Heatmap

To generate the heatmap above, we used Folium, a wrapper around Leaflet.js which makes beautiful interactive maps that you can view in any browser.

The script we came up with to generate the heatmap works as follows we introduce our script, looping over a visual per hour into multiple parts below:

1. For each hour, create a dataframe filtered just for the specific hour

2. The heatmap receives an array of longitude and latitude coordinates and the Folium map settings among other parameters.

3. Save the Folium map under a url name, then with Selenium, go into the url and take a screenshot and save it.

Within a for loop between hours 0 and 23, we create a new dataframe filtering only the hour in questions, then generate a plot.

folder_path =

for i in tqdm(range(24)):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')

    # open it, go to a website, and get results
    wd = webdriver.Chrome('chromedriver',options=options)
    path = os.getcwd()

    # filtering dataframe per hour
    hour = df_point_of_interest[df_point_of_interest['hour'] == i]

For the plot, we first instantiate a map object, set an epicenter for the map to center itself to, specify a map theme, then indicate a level of zoom. The code snippet below is a continuation of the for-loop above.

    # instantiating the map
    point_of_interest_map = folium.Map(location=[lat, lng],  \
                                       tiles='Stamen Toner', \
                                       zoom_start=14)

For the heatmap object, we pass in a list of longitudes and latitudes, then specify the radius of our heatmap, max zoom and add to the map that we just instantiated. The code snippet below is a continuation of the for-loop above.

    # adding mobile points to map
    HeatMap(data= hour[['lat', 'lng']] \
            .groupby(['lat', 'lng']) \
            .count().reset_index() \
            .values.tolist(), \
            radius=10, \
            max_zoom=13) \
            .add_to(point_of_interest_map)

Lastly, we create a URL for the .html that we saved out to retrieve from then save our Folium plot at that location. The code snippet below is a continuation of the for-loop above.

    mapfile = f'point_of_interest_hour_{i}'

    # saving map as default Folium html
    point_of_interest_map.save(f'{mapfile}.html')

    # creating pointer towards file location
    tmpurl=f'file://{path}/{mapfile}.html'

    # getting html map into memory and giving it time to load
    wd.get(tmpurl)
    time.sleep(random.randint(5,8))

    # saving it as screenshot to covert from html to png
    wd.save_screenshot(f'{shared_folder_path}{mapfile}.png')
    wd.quit()

We use Selenium the webdriver to go to the webpage and take a screenshot, save, then quit. The code snippet below is a continuation of the for-loop above.

    # opening the screenshot in memory
    im = Image.open(f'{shared_folder_path}{mapfile}.png')

    # getting the height
    width, height = im.size

    # crop from top left coordinate at (0,0)
    # crop to the bottom right at(with, height)
    im = im.crop((int(150), int(150), int(width), int(height)))

Our output thus far looks like the rainbow and black and white image below.

The Folium Plot that we Saved (Left) is Then Annotated (Right)

Because we want our visualizations to be more evocative, we first made some stylistic decisions in a photo editing program like Photoshop, then we automated the process to batch-execute annotations in Python.

We used Pillow, an imaging library to change the hue and saturation of the heatmap from a rainbow gradient to magenta’s and cyan’Mos. We also eased the map background of black and grays to dark blues and applied text. We plan to showcase a tutorial on image processing in a future article.

This image above, which also uses Pillow for annotation, was generated in Datashader, which handles large datasets well. Datashader integrates with Dask, Holoviews, Bokeh, and Geoviews to create dynamic, zoom-able maps overlayed on map tiles. This default code snippet was what was used to generate the static visualization, which was then compiled via Pillow into a .gif:

agg = ds.Canvas().points(hour_df, 'lng', 'lat')
    utils.export_image(tf.shade(agg,cmap=bgyw),filename=img_file_name, background="black", fmt=".png")

We, like you, are still enveloped in wonderment of the power in numbers that all cities illuminate. As New York City begins reopening in the coming weeks, we hope you stay safe.

Thanks for dropping by!

Thanks so much for learning with us. If you’d like to have a conversation surrounding this project, please don’t hesitate to reach out.

Acknowledgements: Thanks to Jef Ntungila for implementing efficient wrangling, statistical analysis, plotting and optimization.

How Mobile Location Data Could Help You Limit Exposure to Covid-19 was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Visualizing Covid-19 Cases for NYC on March 9, 2020

Lily Su — Tue, 10 Mar 2020 03:41:32 GMT

Visualizing Covid-19 Cases for NYC and NYS

The main question on everyone’s minds these days are: “Should I make choices to stay away from all people to avoid getting infected with the Covid-19 Virus?” and “What is the # of confirmed cases nearest to me?”

Related to these questions is one pressing one: “Where have those infected been exposed to the public?”

Based on news sources such as The New York State Department of Health, The New York Times, The Wall Street Journal, Business Insider, Patch.com, Spectrum NY1, The Gothamnist and AMNY, I have gathered relevant data regarding the Covid-19 virus and plotted them on a map of NYC.

To create the map, I sourced geographical shape files from the NYC Open Data Portal — specifically ones with zip codes to map out the NYC neighborhoods and truck route data to display some major roads to help NYC residents orient themselves.

If you are interested in working on visualizing the data, please find my notebook here.

Here is another visual of how the cases have been growing in the last 12 days by county. The data is coming from the New York State Department of Health.

Positive Covid-19 Cases by New York State County

Here is what the data looks like from the New York State Department of Health website, transcribed into a dataframe:

Number of cases by county as a pandas dataframe.

The county boundaries were from gis.ny.gov.

Here below are two more maps of the latest positive cases from Saturday, March 14, with 269 cases and Sunday March 15th with 329 cases in New York City.

Positive Covid-19 Cases by New York State County on March 14th, 2020

Positive Covid-19 Cases by New York State County on March 15th, 2020

Please note that as of this time, there are a large amount of factors that prevent the information released by the New York State Department of Health to be current.

There are limited tests and the tests take time to process.

It can take up to 2 weeks for those infected to develop symptoms.

Here is a chart that shows how these factors played out in China:

Chart from the Journal of the American Medical Association, based on raw case data from the Chinese Center for Disease Control and Prevention

Based on this great analysis by Tomas Pueyo, one can better measure the number of cases by the deaths reported. Here’s what he wrote in the article relating to deaths:

If you have deaths in your region, you can use that to guess the number of true current cases. We know approximately how long it takes for that person to go from catching the virus to dying on average (17.3 days). That means the person who died on 2/29 in Washington State probably got infected around 2/12.

Then, you know the mortality rate. For this scenario, I’m using 1% (we’ll discuss later the details). That means that, around 2/12, there were already around ~100 cases in the area (of which only one ended up in death 17.3 days later).

Now, use the average doubling time for the coronavirus (time it takes to double cases, on average). It’s 6.2. That means that, in the 17 days it took this person to die, the cases had to multiply by ~8 (=2^(17/6)). That means that, if you are not diagnosing all cases, one death today means 800 true cases today.

As of today, there are two deaths reported in New York City. It is projected that the death rate of Covid-19 is around 3.6%. So according to him, with two deaths today, there are 1600 true cases in New York City.

3 Things You Probably Didn’t Know About Municipal Services in NYC

Lily Su — Mon, 10 Feb 2020 18:30:42 GMT

What defines a city? Why would anyone want to live in a city?

One may think that a city is defined as the epicenter of the densest population, which has its merits. Recent studies of population density argue that cities, towns, villages, districts and counties are all constructs by the ruling state based on historical population metrics, that, overtime become outdated as it gets a lot more difficult to change boundaries as it was to set up.

Here is what the New York State Comptroller has to say regarding the concept of a village:

“All State residents live in either a city or a town, as their boundaries do not overlap. Villages, in contrast, are located within towns, and their residents pay taxes to both the village and town. Historically, villages tend to be formed from the more densely populated section of a town — the area where additional services were likely to be needed.

In essence, villages were a smaller version of a city, providing services not available in a town, such as water, sewer, police and fire protection. However, suburbanization led to changes in law that allowed such services to be provided without the creation of a village (often through “special districts”).

Today, police, water, sewer, sanitation and fire protection services are provided routinely throughout towns, and the incorporation of a village is no longer necessary for these purposes.”

Lets take a look at some quantifiable data regarding the city and draw our own conclusions on the ingenuity of New York City’s urban planning of municipalities.

The graph to the left is plotting the number of feet by borough. The department of Sanitation breaks down Queens and Brooklyn into 2 parts.

The converted road length to miles is in the column to the right in spreadsheet below, condensed by borough, so Manhattan, MN has 677 miles of roads as an example.

Below is a pie chart representation of the above spread sheet. The recorded roads and lengths are provided by the Department of Sanitation which publishes via the NYC Open Data Portal on Snow Plowing Priorities. Roads are mapped out by the department in order of plowing priorities shown at a latter plot.

Some of the longest roads being Northern Blvd, also known as New York State Route 25A being 12.5 miles , Frances Lewis Blvd, named after a signer of the Declaration of Independence, being 10.8 miles, and Queens Boulevard, being 7.5 miles

During the 1920s and 1930s the boulevard was widened in conjunction with the digging of the IND Queens Boulevard Line subway tunnels. … Trenches had to be dug up in the center of the thoroughfare, and to allow pedestrians to pass over the construction, temporary bridges were built.

Now let’s take a look at the NYC population pulling from the Census data from 2010:

In pink are the critical routes for snow plowing. Population density are colors filled in between the roads.

Below are comparisons of population by numbers of days in a week that trash is picked up.

Are there any conclusions that you can draw from these visuals?

Here is a Jupyter notebook of the code that generated the visualizations.

Works Cited:

Hevesi Alan G. “Local Government Issues In Focus.” Outdated Municipal Structures, Office Of the New York State Comptroller, Vol.2.№3., 10/2006, www.osc.state.ny.us/localgov/pubs/research/munistructures.pdf p.3. 10/02/2020.

New York State Route 25A. Wikipedia. 15/01/2020. Web. en.wikipedia.org/wiki/New_York_State_Route_25A

Queens Boulevard. Wikipedia. 15/01/2020. Web. en.wikipedia.org/wiki/Queens_Boulevard

Thoughts on: “Release Strategies and the Social Impacts of Language Models” OpenAI, Published…

Lily Su — Sat, 24 Aug 2019 20:40:05 GMT

Thoughts on: “Release Strategies and the Social Impacts of Language Models” OpenAI, Published August 2019

Photo by Einar H. Reynis on Unsplash

We have yet to see who or what will become responsible for the negative consequences to come of large-scale language models.

The OpenAI paper’s intent has been a company explanation on the delayed release of the full GPT-2 model, while accompanying a 3rd larger, but not full model release, citing observances on existing uses of language models for harmful purposes as cause.

Providing some records of OpenAI’s efforts in collaborating with various institutions to study the risks associated with the release of language models the paper acknowledges that the language model, once released cannot be easily reverse engineered with high accuracy for the detection of its synthetic source.

The paper classifies types of risks on the social-political scale, drawing conclusions that among those with moderate programming skill: “a minimal immediate risk of a fully-integrated malicious application” and excusing a take on advanced persistent threats with:

“Given the specialization required, OpenAI cannot devote significant resources to fighting APT [advanced persistent threat] actors.”

Yes, but what are some ideas besides the obvious that OpenAI has? Brain-washing? Fraudulent misleading emails? Tampering important instructions? The mass generation of fake comments?

A more forthright approach would involve presenting some further ideas of how language models can garner misuse other than “generating fake news articles or building spambots for forums and social media” and also present more analysis on methods of prevention from bias since GPT-2 was developed from scraping links to articles from Reddit, a discussion site filled with special interest groups, many of which are controversial due to the ease of anonymizing one’s identity.

There is an obvious constrained political correctness in tone, with an appearance of empathy, a dash of recitation — that only goes as deep as to deflect by means of referral main topics of concern, shadowing the real need of OpenAI as an organization to uphold their reputation.

Without a preventative entity responsible for the regulation of language model usage, we await the next big victimization prior to any policy implementation. We all know that policy development is a slow process; so will there ever be a knowledgeable entity who can operate in industry on a financially independent model? How can the largest language model development entities be prompted to come together to create one amid competition for releasing the best model?

Finding Word Patterns in Online Reviews that Help Businesses Improve

Lily Su — Sun, 11 Aug 2019 14:24:21 GMT

How We Did It: Word Trend Analysis in Online Reviews

It can be daunting for a business owner of any popular establishment to sort through their online reviews.

If we assume that each customer explained specifics for the star ratings they gave, we can begin to sift through the comments, and distinguish through repeat words, whether there are patterns and themes of how an establishment can improve.

Though there are many types of analysis on review data, word and phrase usage vs the amount of stars given on Yelp reviews of coffee shops in Austin, Texas for 2015–2016 was explored in this blog post.

Here are resources for you to replicate this project:

Here is the URL of the interactive visualization above. (It takes awhile to load).

This is the Jupyter Notebook of my code to analyse and generate such visualizations.

The open-source dataset we used can be found here at the data.world site.

The Scattertext library source code and documentation can be found here.

This is what the original data frame looks like of coffee shop names, the text of the review, and the star ratings:

A Squarify plot of most frequent word occurrences after tokenization and removal of stop words:

Word Popularity as a Squarify Plot

A bar graph via Matplotlib by word count:

Count of Word Occurrences in All Reviews

An exploration of term explorations following the Scattertext tutorial by finding associations of the highest ratings by star count and the lowest number of star ratings:

Term Frequencies by High and Lowest Rating Score

Gif of Interactive Scattertext Plot, Run on D3

Motor Vehicle Injuries and Deaths from Nov. 1, 2018 — Apr. 8, 2019

Lily Su — Fri, 19 Apr 2019 17:56:24 GMT

Motor Vehicle Injuries and Deaths from November 2018 — April 2019

Click here to manipulate the above visual.

It’s important for us as humans to have a greater awareness of our state of being in regards to understanding ourselves in relation to our surroundings.

In this thread of thought, we are interested in sharing our findings on the risks of travel by motor vehicles in the NYC 5-Borough Region.

We have created visualizations that show the occurrence of accidents by motor vehicles that resulted in damages of over $1000 in the last 6 months, specifically spanning the dates of November 1st, 2018 to April 21, 2019 reported by the Traffic Stat department of the NYPD.

This information was downloaded from the NYC Open Data portal, which is publicly available from the site courtesy of the NYC government.

We found it informative that there were more accidents than we ever imagined, and more casualties than we’ve ever thought for a space so local.

Click here to manipulate the above visual.

There were 111 deaths from November 1st, 2018 to April 21, 2019 due to motor vehicle accidents. The locations of those deaths are mapped to the left.

What can we deduce from the data?

It is not surprising that more accidents both with or without casualties tend to happen near major highways in highly trafficked areas.

101,304 is 1% of the population of NYC, which is 8 million.

What seems surprising is the sheer mass of casualties that NYC residents have never hear of from these accidents, yet other types casualty situations around the world are the contents broadcasted in the media.

There were 101,304 motor vehicle accidents total reported by the NYPD between Nov. 01, 2018 — Apr. 21, 2019. There were 20,008 cases of accidents resulting in injuries, which is 19.6% of all accidents, with 81,296 accidents resulting in no casualties.

Society’s knowledge of these facts can define the society’s awareness to its human condition. The next step is to define the affinity of influencing future outcomes.

It is possible to find patterns where city municipalities needs more work and more change. The data holds more clarity and more reasons for accountability for greater safety.

Click here for the Jupyter Notebook to see the code source.

Learning From Open NYC Data

Lily Su — Thu, 11 Apr 2019 15:36:17 GMT

How Open NYC Data Can Help You

Please note that the above graph captures complaints above 3%.

NYC’s releasing of its city operational data is a step toward a more effective city government.

When a NYC resident calls 311, that call is logged by the 311 operator by complaint type and location and tracked over time on city departments’ responses till case close. Having this data be accessible to all allows for more eyes and brains on ways to improve our urban standard of living. Deducing new patterns in this data may prove city municipality services to be more impactful for the greater good.

Data Analysis NYC is here to act as a bridge between that data and residents of NYC to minimize informational asymmetry as with all individuals to institutional entities.

Machine Learning for the Human Body — A Layman’s Summary of Sparse Inertial Poser: Automatic 3D…

Lily Su — Fri, 28 Sep 2018 21:40:57 GMT

Machine Learning for the Human Body — A Layman’s Summary of Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs

Here are some notes I took to better understand the paper: Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs

Body is broken down into its constituent vertices with sets of coordinates. Those coordinates are displaced in two ways:

The offset of the vertices are influenced by:

1: 23 joints that has to deal with the pose of the body

2: 1786 high resolution 3D scans total comprised of multiple poses of 20 females and 20 males.

All base bodies are rigged with the 23 joints. Every point on the body is influenced by one to four points of the 23. This is because for example, wrist joint movement will not affect ankle movement.

Each vertex has a weight based on how important it is in influencing the joint location.

The joint location is determined.

The weights of how vertexes around the joints are determined.

The vertex locations based on scanned body shapes are determined.

There are two different ways this process can be used:

One can put in a human body in a default pose and churn out a posed body. In this case, only the posing algorithms will be used, which involves joint placement and the vertexes of the model will carry weights based on the joints.

Or, one can generate a new body of a unique body shape from the database that the algorithm is trained on, that is accurate in both the base pose and various posed positions.

How it’s used: Maya with some Python-based scripts for adjustment, also Unity with some C# scripts for adjustments.

Machine Learning for the Human Body — A Layman’s Summary of Coregistration: Simultaneous Alignment…

Lily Su — Fri, 28 Sep 2018 20:44:03 GMT

Machine Learning for the Human Body — A Layman’s Summary of Coregistration: Simultaneous Alignment and Modeling of Articulated 3D Shape

— TL;DR: How It’s Better to Automate the Cleaning of 3D Body Scan Data By Matching Scans to Correct Templates and Learning a Formula as You Go.

I am attempting to understand the paper Coregistration: Simultaneous Alignment and Modeling of Articulated 3D Shape by regurgitating the information in a way that someone with basic understanding of machine learning concepts can understand. Here are some additional supplemental material from the research.

To dumb down the title further for fun: How to Double Task When Fixing Messy Scans

Why Are We Doing This Research:

Having a 3D representation of a solid surface is important in making sense of coordinate data. The focus of this paper is on automatically cleaning up choppy scans with missing areas into a model that is anatomically correct and as smooth and solid as possible.

Now the issue at hand is what is our ideal body and pose to match a messy scan to? What is our guide system? We combine two techniques done in previous papers — a learn-as-we-go approach to place major landmark markers on the body as we morph a template body into the messy scan.

The Scans:

First time we did it: 124 scans of 3 females in standing and sitting poses(really 2, but 1 of them was scanned 2 years apart and she changed). Out of that 30 scans of each person as testing data.

Second time we did it: We used the dataset from a previous research project: 337 scans of 34 different women in 35 poses. 36 landmarks were used for each scan.

To improve things: We randomly selected 4 females with 10 poses each.

Each scan had 12 landmarks manually placed.

What we did:

We trained our data and ran our method through a first batch, and used the information to initially calibrate a second run through the same batch.

How we did it:

Transformations:

We thought of the 3D scan data as standalone triangles that can be deformed in a 3x3 rubic’s cube-like caged volume.

With each triangle, we deformed it 3 ways, and then stitched it back.

The 3 ways we deformed it was influenced by what the body shape is, what the body pose is, and how would fat, muscle and bone react around the joints, armpits and groin areas.

Stuff that are done:

In the first round of learning, scans with manual landmarks are placed in close alignment with that of the respective template. During the next round, we don’t work with landmarks.

We use PCA(Principal Component Analysis) to match scan to template.

For the joints, armpits and groin areas, we stiffen the trajectory of where the limbs have been projecting so it doesn’t look like Harry Potter’s bent rubber arm. Linear Blend

Calculating Error:

We warp the scan to match the template. Distance error between the scan and template is then calculated. Geman-McClure Robust Error Function

We apply a negative reward for deformations so that scans have a limit for how much they can deform. Frobenius Norm

As the scans are matched to each other, there are two mathematical formulas that we use in the calculation to transform the scan that is constantly shifting based on what is being learned.

There is a different optimal policy for each different body.

We prevent over-fitting by performing regularization regression in varying the

Making it better:

To prevent the formula from over-accounting for details that does not apply to future bodies, calculations with the intensity of deformations on a scale of 0.25 to 5, is run several times, and then picking the best run. Over-fitting, Regularization Regression

Terms Mentioned in Paper:

Rodrigues Vectors

Linear Blend

Geman-McClure Robust Error Function

PCA

Frobenius Norm