Tokyo Photographs¶

In [1]:

from IPython.display import display_markdown

display_markdown(open("README.md").read(), raw=True)

Tokyo data¶

This dataset contains a sample of geotagged images uploaded to Flickr for the Tokyo region. The original extract (generated by Meixu Chen, meixu@liverpool.ac.uk) is stored for archival purposes as tokyo.csv.

Source: Yahoo Flickr Creative Commons 100 Million Dataset
URL:

https://yahooresearch.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images

Processing: transformations applied to the original extract, including a random subset to a more manageable size for pedagogical purposes, are documented in tokyo_cleaning.ipynb
- Clean file: tokyo_clean.csv

Metadata¶

For every record, the following information is provided:

user_id: the unique id number of each Flickr user.
longitude: longitude of the geotagged Flickr photo in decimal format,

under WGS1984 geographic coordinate system.

latitude: latitude of the geotagged Flickr photo in decimal format,

under WGS1984 geographic coordinate system.

date_taken: the date when the photo was taken.
photo/video_page_url: an url link where the photo/video content is

available.

In [65]:

%matplotlib inline

import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [72]:

db = pd.read_csv('data/tokyo.csv')

Randomly subsetting¶

In [73]:

# Set the "seed" so every run produces the generates the same random numbers
np.random.seed(1234)
# Create a sequence of length equal to the number of rows in the table
ri = np.arange(len(db))
# Randomly reorganize (shuffle) the values
np.random.shuffle(ri)
# Reindex the table by using only the first 10,000 numbers 
# of the (now randomly arranged) sequence
db = db.iloc[ri[:10000], :]

Reproject XY coordinates in separate columns¶

In [74]:

%%time
pts = db.apply(lambda r: Point(r.longitude, r.latitude), axis=1)

CPU times: user 431 ms, sys: 4.86 ms, total: 436 ms
Wall time: 436 ms

In [75]:

gdb = gpd.GeoDataFrame(db.assign(geometry=pts), \
                       crs={'init' :'epsg:4326'})

In [76]:

%%time
gdb = gdb.to_crs(epsg=3857)

CPU times: user 529 ms, sys: 7.46 ms, total: 536 ms
Wall time: 747 ms

In [77]:

%%time
xys = gdb['geometry'].apply(lambda pt: pd.Series({'x': pt.x, 'y': pt.y}))
gdb['x'] = xys['x']
gdb['y'] = xys['y']

CPU times: user 2.13 s, sys: 20.3 ms, total: 2.15 s
Wall time: 2.16 s

In [79]:

gdb.drop('geometry', axis=1).to_csv('tokyo_clean.csv', index=False)

Download link¶

{download}[Download the *tokyo_clean.csv* file] <tokyo_clean.csv>