Skip to content
Snippets Groups Projects
01_raw_intro.md 46.6 KiB
Newer Older
Alexander Dunkel's avatar
Alexander Dunkel committed
---
jupyter:
  jupytext:
    encoding: '# -*- coding: utf-8 -*-'
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.3'
      jupytext_version: 1.14.5
Alexander Dunkel's avatar
Alexander Dunkel committed
  kernelspec:
    display_name: worker_env
Alexander Dunkel's avatar
Alexander Dunkel committed
    language: python
    name: worker_env
Alexander Dunkel's avatar
Alexander Dunkel committed
---

<div style="width: 100%;display: flex; align-items: top;">
    <div style="float:left;width: 80%;text-align:left;position:relative">
        <h1>Part 1: Social Media Data</h1>
        <p><strong>Workshop: Social Media, Data Analysis, &amp; Cartograpy, WS 2023/24</strong></p>
        <p><em><a href="mailto:alexander.dunkel@tu-dresden.de">Alexander Dunkel</a>
        <br> Leibniz Institute of Ecological Urban and Regional Development, 
        Transformative Capacities & Research Data Centre & Technische Universität Dresden, 
        Institute of Cartography</em></p>
        <p><img src="https://kartographie.geo.tu-dresden.de/ad/jupyter_python_datascience/version.svg" style="float:left"></p>
    </div>
    <div style="float: right;">
    <div style="width:300px">
    <img src="https://kartographie.geo.tu-dresden.de/ad/jupyter_python_datascience/FDZ-Logo_DE_RGB-blk_bg-tra_mgn-full_h200px_web.svg" style="position:relative;width:256px;margin-top:0px;margin-right:10px;clear: both;"/>
    <img  src="https://kartographie.geo.tu-dresden.de/ad/jupyter_python_datascience/TU_Dresden_Logo_blau_HKS41.svg" style="position:relative;width:256px;margin-top:0px;margin-right:10px;clear: both;"/>
    </div>
Alexander Dunkel's avatar
Alexander Dunkel committed
    </div>
</div>

<img src="https://ad.vgiscience.org/links/imgs/2019-05-23_emojimap_campus.png" style="width:800px;text-align:left;position:relative;float:left">


**1. Link the workshop environment centrally from the project folder at ZIH:**
Alexander Dunkel's avatar
Alexander Dunkel committed

Select the cell below and click <kbd>CTRL+ENTER</kbd> to run the cell. Once the `*` (left of the cell) turns into a number (`1`), the process is finished.
Alexander Dunkel's avatar
Alexander Dunkel committed

```python
!cd .. && sh activate_workshop_envs.sh
Alexander Dunkel's avatar
Alexander Dunkel committed
```

<div class="alert alert-warning" role="alert" style="color: black;">
    <details><summary style="cursor: pointer;"><strong>Refresh the browser window afterwards with <kbd>F5</kbd> and select <code>01_intro_env</code> in the top-right corner. Wait until the <code>[*]</code> disappears.</strong></summary>
    Refresh the browser window afterwards with <kbd>F5</kbd>, so that the environment becomes available on the top-right dropdown list of kernels.
    </details>
</div>

Alexander Dunkel's avatar
Alexander Dunkel committed


<div class="alert alert-success" role="alert" style="color: black;">   
<details><summary style="cursor: pointer;">Well done!</summary>
        <strong>Welcome</strong> to the IfK Social Media, Data Science, & Cartograpy workshop.
</details></div>


This is the first notebook in a series of four notebooks:
    
1. Introduction to **Social Media data, jupyter and python spatial visualizations**
2. Introduction to **privacy issues** with Social Media data **and possible solutions** for cartographers
3. Specific visualization techniques example: **TagMaps clustering**
4. Specific data analysis: **Topic Classification**

Open these notebooks through the file explorer on the left side.


<div class="alert alert-success" role="alert" style="color: black;">
    <strong><a href="https://www.urbandictionary.com/define.php?term=tl%3Bdr">tl;dr</a></strong>
    <ul>
        <li>Please make sure that <strong>"01_intro_env"</strong> is shown on the 
            <strong>top-right corner</strong>. If not, click & select.</li>
        <li>If the "01_intro_env" is not listed, save the notebook (CTRL+S), and check the list again after a few seconds</li>
        <li>use <kbd>SHIFT+ENTER</kbd> to walk through cells in the notebook</li>
Alexander Dunkel's avatar
Alexander Dunkel committed
    </ul>
</div>


<div class="alert alert-warning" role="alert" style="color: black;">
    <details><summary style="cursor: pointer;"><strong>User Input</strong></summary>
Alexander Dunkel's avatar
Alexander Dunkel committed
    <ul>
        <li>We'll highlight sections where you can change parameters</li>
        <li>All other code is intended to be used as-is, do not change</li>
    </ul>
    </details>
</div>


<details><summary style="cursor: pointer;"><strong>FAQ</strong></summary>
    <br>
    If you haven't worked with jupyter, these are some tips:
    <div style="width:500px">
    <ul>
        <li><a href="https://jupyterlab.readthedocs.io/en/stable/">Jupyter Lab</a> allows to interactively execute and write annotated code</li>
      <li>There are two types of cells:
          <strong><a href="https://www.markdownguide.org/extended-syntax/">Markdown</a></strong> cells contain only text (annotations),
          <strong>Code</strong> cells contain only python code</li>
      <li>Cells can be executed by <strong>SHIFT+Enter</strong></li>
      <li>The output will appear below</li>
        <li><i>States</i> of python will be kept in-between code cells: This means that a value
          assigned to a variable in one cell remains available afterwards</li>
        <li>This is accomplished with <a href="https://ipython.org/">IPython</a>, an <strong>i</strong>nteractive version of python</li>
      <li><strong>Important: </strong> The order in which cells are executed does not have to be linear.
          It is possible to execute any cell in any order. Any code in the cell will use the
          current "state" of all other variables. This also allows you to update variables.
        </li>
    </ul>
    </div>
</details>


<details><summary style="cursor: pointer;"><strong>LINKS</strong></summary>
    <br>
    Some links
    <ul>
      <li>This notebook is prepared to be used from the 
            <a href="https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/JupyterHub">TUD ZIH Jupyter Hub</a></li>
      <li>.. but it can also be run locally. For example, with our 
         <a href="https://gitlab.vgiscience.de/lbsn/tools/jupyterlab">IfK Jupyter Lab Docker Container</a></li>
      <li>The contents of this workshop are available in 
         <a href="https://gitlab.vgiscience.de/tud_ifk/mobile_cart_workshop2020"> a git repository</a></li>
      <li>There, you'll also find static HTML versions of these notebooks.</li>
    </ul>
</details>


<details><summary style="cursor: pointer;"><strong>AVAILABLE PACKAGES</strong></summary>
    <br>
    This python environment is prepared for spatial data processing/ cartography. 
    <br>
    The following is a list of the most important packages, with references to documentation:
    <ul>
      <li><a href="https://geoviews.org/user_guide/index.html">Geoviews</a></li>
      <li><a href="https://holoviews.org/">Holoviews</a></li>
      <li><a href="https://docs.bokeh.org/en/latest/index.html">Bokeh</a></li>
      <li> <a href="https://hvplot.holoviz.org/">hvPlot</a></li>
      <li> <a href="https://geopandas.org/">Geopandas</a></li>
      <li> <a href="https://pandas.pydata.org/">Pandas</a></li>
      <li> <a href="https://numpy.org/">Numpy</a></li>
      <li> <a href="https://matplotlib.org/">Matplotlib</a></li>
      <li> <a href="https://contextily.readthedocs.io/en/latest/">Contextily</a></li>
      <li> <a href="https://colorcet.holoviz.org/">Colorcet</a></li>
      <li> <a href="https://scitools.org.uk/cartopy/docs/latest/">Cartopy</a></li>
      <li> <a href="https://shapely.readthedocs.io/en/stable/manual.html">Shapely</a></li>
      <li> <a href="https://pyproj4.github.io/pyproj/stable/">Pyproj</a></li>
        <li> <a href="https://github.com/rhattersley/pyepsg">pyepsg</a></li>
      <li> <a href="https://pysal.org/notebooks/viz/mapclassify/intro.html">Mapclassify</a></li>
      <li> <a href="https://seaborn.pydata.org/">Seaborn</a></li>  
      <li> <a href="http://xarray.pydata.org/en/stable/">Xarray</a></li>  
      <li> <a href="https://ad.vgiscience.org/tagmaps/docs/">Tagmaps</a></li>  
      <li> <a href="https://lbsn.vgiscience.org/">lbsnstructure</a></li>    
    </ul>
    We will explore <i>some</i> functionality of these packages in this workshop.
    <br>
    If you want to run these notebooks at home, try the <a href="https://gitlab.vgiscience.de/lbsn/tools/jupyterlab">IfK Jupyter Docker Container</a>, which includes the same packages.
</details>


## Preparations

We are creating several output graphics and temporary files.

These will be stored in the subfolder **notebooks/out/**.

```python
from pathlib import Path

OUTPUT = Path.cwd() / "out"
OUTPUT.mkdir(exist_ok=True)
```

<details><summary><strong>Syntax: pathlib.Path() / "out" ?</strong></summary>
    Python pathlib provides a convenient, OS independend access to local filesystems. These paths work
    independently of the OS used (e.g. Windows or Linux). Path.cwd() gets the current directory, where the notebook is running.
    See <a href="https://docs.python.org/3/library/pathlib.html">the docs.</a>.
</details>


To reduce the code shown in this notebook, some helper methods are made available in a separate file. 

Load helper module from `../py/module/tools.py`.

```python
import sys

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules import tools
```

Activate autoreload of changed python files:

```python
%load_ext autoreload
%autoreload 2
```

## Introduction: VGI and Social Media Data

<!-- #region -->
<div style="width:500px">
    
    
Broadly speaking, GI and User Generated Content can be classified in the following three categories of data:  
<ul>
    <li>Authoritative data that follows objective criteria of measurement such as Remote Sensing, Land-Use, Soil data etc.
    <li>Explicitly volunteered data, such as OpenStreetMap or Wikipedia. This is typically collected by many people,
        who collaboratively work on a common goal and follow more or less specific contribution guidelines.</li>
    <li><strong>Subjective</strong> information sources</li>
    <ul>
        <li>Explicit: e.g. Surveys, Opinions etc.</li>
        <li>Implicit: e.g. Social Media</li>
    </ul>
</ul>
    
<strong>Social Media data</strong> belongs to the third category of subjective information, representing certain views held by groups of people. The difference to Surveys is that there is no interaction needed between those who analyze the data and those who share the data
    online, e.g. as part of their daily communication.
<br><br>
Social Media data is used in marketing, but it is also increasingly important for understanding people's behaviour, subjective values, and human-environment interaction, e.g. in citizen science and landscape & urban planning. 
    
<strong>In this notebook, we will explore basic routines how Social Media and VGI can be accessed through APIs and visualized in python.</strong>
</div>
<!-- #endregion -->

### Social Media APIs


- Social Media data can be accessed through public APIs. 
- This will typically only include data that is explicitly made public by users.
- Social Media APIs exist for most networks, e.g.
  [Flickr](https://www.flickr.com/services/api/),
  [Twitter](https://developer.twitter.com/en/docs), or
  [Instagram](https://www.instagram.com/developer/)
  
<details><summary style="cursor: pointer;"><strong>Privacy?</strong></summary>
    We'll discuss legal, ethical and privacy issues with Social Media data in the second notebook: <a href="02_hll_intro.ipynb">02_hll_intro.ipynb</a>
</details>


### Instagram Example


- Retrieving data from APIs requires a specific syntax that is different for each service.
- commonly, there is an **endpoint** (a url) that returns data in a structured format (e.g. **json**)
- most APIs require you to authenticate, but not all (e.g. Instagram, commons.wikimedia.org)

<details><summary><strong>But the Instagram API was discontinued!</strong></summary>
    <div style="width:500px">
    <ul>
    <li>Instagram discontinued their official API in October 2018. However, their Web-API is still available,
        and can be accessed even without authentication.</li>
    <li>One rationale is that users not signed in to Instagram can have "a peek" at images,
        which provides significant attraction to join the network.</li>
        <li>We'll discuss questions of privacy and ethics in the second notebook.</li>
    </ul>
    </div>
</details>


Load Instagram data for a specific Location ID. Get the location ID from a search on Instagram first.
Alexander Dunkel's avatar
Alexander Dunkel committed

```python
location_id = "1893214" # "Großer Garten" Location
query_url = f'https://www.instagram.com/explore/locations/{location_id}/?__a=1&__d=dis'
Alexander Dunkel's avatar
Alexander Dunkel committed
```

<div class="alert alert-warning" role="alert" style="color: black;">
    <details><summary><strong>Use your own location</strong></summary>
        Optionally replace "1893214" with another location above. You can search on <a href="https://instagram.com">instagram.com</a> and extract location IDs from the URL. Examples:
        <ul>
            <li><code>657492374396619</code><a href="https://www.instagram.com/explore/locations/657492374396619/beutlerpark/">Beutlerpark, Dresden</a></li>
            <li><code>270772352</code><a href="https://www.instagram.com/explore/locations/270772352/knabackshusens-underbara-bad/">Knäbäckshusens beach, Sweden</a></li>
            <li>You name it</li>
        </ul>
Alexander Dunkel's avatar
Alexander Dunkel committed
    </details>
</div>


<details><summary><strong>Syntax: f'{}' ?</strong></summary>
    This is called an f-string, <a href="https://realpython.com/python-f-strings/">a convenient python convention</a> to concat strings and variables.
</details>

```python
from IPython.core.display import HTML
display(HTML(tools.print_link(query_url, location_id)))
Alexander Dunkel's avatar
Alexander Dunkel committed
```

<div style="width:500px">
<ul>
    <li><strong>If you're not signed in:</strong> Chances are high that you're seeing a "Login" page. Since we are working in a workshop, only very few 
  requests to Instagram non-login API are allowed.</li>
    <li>otherwise, you'll see a <strong>json object</strong> with the latest feed content</li>
    <li>In the following, we will try to <strong>retrieve this json object</strong> and display it.</li>
</ul>
</div>


<div class="alert alert-info" role="alert" style="color: black;">
    <details><summary><strong>Use your own json, in case automatic download did not work.</strong></summary>
        <div style="width:500px">
            <ul>
                <li>Since it is likely that access without login will not be possible for all in the workshop,
                    we have provided <strong>a sample json</strong>, that will be retrieved if no access is possible</li>
                <li><strong>If automatic download didn't work above</strong>, you can use your own json below, by saving the result from the link above (e.g. park.json) and moving
                it, via drag-and-drop, to the <strong>out</strong> folder on the left.</li>
                <li>This is optional, we also provide a sample download link that will automatically request alternate data</li>
            </ul>
        </div>
    </details>
</div>


First, try to get the json-data without login. This may or may not work:

```python
import requests

json_text = None
response = requests.get(
    url=query_url, headers=tools.HEADER)
```

```python
if response.status_code == 429 \
    or "/login/" in response.url \
    or '"status":"fail"' in response.text \
    or '<!DOCTYPE html>' in response.text:
    print(f"Loading live json failed: {response.text[:250]}")
Alexander Dunkel's avatar
Alexander Dunkel committed
else:
    # write to temporary file
    with open(OUTPUT / f"live_{location_id}.json", 'w') as f:
        f.write(json_text)
Alexander Dunkel's avatar
Alexander Dunkel committed
    json_text = response.text
    print("Loaded live json")
```

<div class="alert alert-info" role="alert" style="color: black;">
    <details><summary><strong>"Loaded live json"? Successful</strong></summary>
        <br>
        <div style="width:500px">
            If you see <strong>Loaded live json</strong>, loading of json was successful. if nothing is shown, simply continue, to load sample json below.
        </div>
    </details>
</div>


<div style="width:500px">
    If the the url refers to the "login" page (or status_code 429), access is blocked. In this case, you can open the link using your browser and Instagram-Account to manually download the json and store it in Jupyter (left side) in <code>our/sample.json</code>. If such a file exists, it will be loaded:</div>
Alexander Dunkel's avatar
Alexander Dunkel committed

```python
if not json_text:
    # check if manual json exists
    local_json = [json for json in OUTPUT.glob('*.json')]
    if len(local_json) > 0:
        # read local json
        with open(local_json[0], 'r') as f:
            json_text = f.read()
        print("Loaded local json")
```

<details><summary><strong>Syntax: [x for x in y] ?</strong></summary>
    This is called a list comprehension, <a href="https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions">
    a convenient python convention</a> to to create lists (from e.g. generators etc.).
</details>


Otherwise, if neither live nor local json has been loaded, load sample json from archive:
Alexander Dunkel's avatar
Alexander Dunkel committed

```python
if not json_text:
    sample_url = tools.get_sample_url()
    sample_json_url = f'{sample_url}/download?path=%2F&files=1893214.json'
Alexander Dunkel's avatar
Alexander Dunkel committed
    
    response = requests.get(url=sample_json_url)
    json_text = response.text
    print("Loaded sample json")
```

Turn text into json format:

```python
import json
json_data = json.loads(json_text)
```

**Have a peek at the returned data.**

We can use `json.dumps` for this and limit the output to the first 550 characters `[0:550]`.

Alexander Dunkel's avatar
Alexander Dunkel committed
```python
print(json.dumps(json_data, indent=2)[0:550])
```

The json data is nested. Values can be accessed with dictionary keys.

```python
total_cnt = json_data["native_location_data"]["location_info"].get("media_count")
Alexander Dunkel's avatar
Alexander Dunkel committed

display(HTML(
    f'''<details><summary>Working with the JSON Format</summary>
    The json data is nested. Values can be accessed with dictionary keys. <br>For example,
    for the location <strong>{location_id}</strong>, 
Alexander Dunkel's avatar
Alexander Dunkel committed
    the total count of available images on Instagram is <strong>{total_cnt:,.0f}</strong>.
    </details>
    '''))
```

You can find the media under:

```python
for ix in range(1, 5):
    display(str(json_data["native_location_data"]["ranked"]["sections"][ix]["layout_content"]["medias"][0])[0:100])
```

Where `[ix]` is a pointer to a list of three media per row. Below, we loop through these lists and combine them to a single dataframe.


Dataframes are a flexible data analytics interface that is available with `pandas.DataFrame()`. Most tabular data can be turned into a dataframe.
Alexander Dunkel's avatar
Alexander Dunkel committed


<details><summary><strong>Dataframe ?</strong></summary>
    A <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">pandas dataframe</a>
    is the typical tabular data format used in python data science. Most data can be directly converted to a DataFrame.
</details>

```python
import pandas as pd

df = None
# loop through media, skip first item
for data in json_data["native_location_data"]["ranked"]["sections"][1:]:
    df_new = pd.json_normalize(
        data, errors="ignore")
    if df is None:
        df = df_new
    else:
        df = pd.concat([df, df_new])
```

Display

```python
Alexander Dunkel's avatar
Alexander Dunkel committed
df.transpose()
```

See an overview of all columns/attributes available at this json level:

```python
from IPython.core.display import HTML
display(HTML(f"<details><summary>Click</summary><code>{[col for col in df.columns]}</code></summary>"))
<div class="alert alert-info" role="alert" style="color: black;">
    You can have a peer at all attributes of the the first item (from the list of images) with <code>df['layout_content.medias'].iloc[0]</code>.
</div>

We want to extract URLs of images, so that we can download images in python and display inside the notebook.

```python
url_list = []
for media_grid in df['layout_content.medias']:
    for media_row in media_grid:
        for media in media_row["media"]["image_versions2"]["candidates"]:
            url_list_new = []
            url = media["url"]
            url_list_new.append(url)
        url_list.extend(url_list_new)
```

**View the first few (15) images**
Alexander Dunkel's avatar
Alexander Dunkel committed


First, define a function.

<div class="alert alert-info" role="alert" style="color: black;">
    <details><summary><strong>PIL Library</strong></summary>
        <br>
        <div style="width:500px">
            The <a href="https://pillow.readthedocs.io/en/5.1.x/reference/ImageFilter.html">PIL</a> library
            allows transformation of images. In the example below, we apply the <code>resize</code> function
            and the <code>ImageFilter.BLUR</code> filter. The Image is processed in-memory.
            Afterwards, <code>plt.subplot()</code> is used to plot images in a row. Can you modify the code to plot images
            in <em>a multi-line grid</em>?
        </div>
    </details>
</div>

```python
from typing import List
import matplotlib.pyplot as plt

from PIL import Image, ImageFilter
from io import BytesIO

def image_grid_fromurl(url_list: List[str]):
    """Load and show images in a grid from a list of urls"""
    count = len(url_list)
    plt.figure(figsize=(11, 18))
    for ix, url in enumerate(url_list[:15]):
Alexander Dunkel's avatar
Alexander Dunkel committed
        r = requests.get(url=url)
        i = Image.open(BytesIO(r.content))
        resize = (150, 150)
        i = i.resize(resize)
        i = i.filter(ImageFilter.BLUR)
        plt.subplots_adjust(bottom=0.3, right=0.8, top=0.5)
        ax = plt.subplot(3, 5, ix + 1)
        ax.axis('off')
        plt.imshow(i)
```

Use the function to display images from "node.display_url" column.


All images are public and available without Instagram login, but we still blur images a bit, as a precaution and a measure of privacy.

Alexander Dunkel's avatar
Alexander Dunkel committed
```python
image_grid_fromurl(
    url_list)
Alexander Dunkel's avatar
Alexander Dunkel committed
```

<div class="alert alert-info" role="alert" style="color: black;">
    <details><summary><strong>Get Images for a hashtag</strong></summary>
Alexander Dunkel's avatar
Alexander Dunkel committed
        <br>
        <div style="width:500px">
            Similar to locations, we can get the results for a specific hashtag. However, accessibility and paths of media/data constantly change,
            so it is up to the developer to adjust code.
Alexander Dunkel's avatar
Alexander Dunkel committed
            <br><br>
            For example, the location-feed for the <strong>Großer Garten</strong> is available at:
            <a href="https://www.instagram.com/explore/tags/park/">https://www.instagram.com/explore/hashtag/park/</a>. 
Alexander Dunkel's avatar
Alexander Dunkel committed
535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000
        </div>
    </details>
</div>


## Creating Maps


<div style="width:500px">
<ul>
    <li>Frequently, VGI and Social Media data contains references to locations such as places or coordinates.</li>
    <li>Most often, spatial references will be available as latitude and logitude (decimal degrees and WGS1984 projection).</li>
    <li>To demonstrate integration of data, we are now going to query another API, 
        <strong><a href="https://commons.wikimedia.org/">commons.wikimedia.com</a></strong>, to get a list of places near
certain coordinates.</li>
</ul>
</div>


<div class="alert alert-warning" role="alert" style="color: black;">
    <details><summary><strong>Choose a coordinate</strong></summary>
        <div style="width:500px">
            <ul>
                <li>Below, coordinates for the Großer Garten are used.
                They can be found in the <a href="https://www.instagram.com/explore/locations/1893214/?__a=1">json link</a>.</li>
                <li>Substitute with your own coordinates of a chosen place.</li>
            </ul>
        </div>
    </details>
</div>

```python
lat = 51.03711
lng = 13.76318
```

**Get list of nearby places using commons.wikimedia.org's API:**

```python
query_url = f'https://commons.wikimedia.org/w/api.php'
params = {
    "action":"query",
    "list":"geosearch",
    "gsprimary":"all",
    "gsnamespace":14,
    "gslimit":50,
    "gsradius":1000,
    "gscoord":f'{lat}|{lng}',
    "format":"json"
    }
```

```python
response = requests.get(
    url=query_url, params=params)
if response.status_code == 200:
    print(f"Query successful. Query url: {response.url}")
```

```python
json_data = json.loads(response.text)
print(json.dumps(json_data, indent=2)[0:500])
```

Get List of places.

```python
location_dict = json_data["query"]["geosearch"]
```

Turn into DataFrame.

```python
df = pd.DataFrame(location_dict)
display(df.head())
```

```python
df.shape
```

If we have queried 50 records, we have reached the limit specified in our query. There is likely more available, which would need to be queried using subsequent queries (e.g. by grid/bounding box). However, for the workshop, 50 locations are enough.


**Modify data.**: Replace "Category:" in column title.

- Functions can be easily applied to subsets of records in DataFrames.
- although it is tempting, do not iterate through records
- dataframe vector-functions are almost always faster and more pythonic

```python
df["title"] = df["title"].str.replace("Category:", "")
df.rename(
    columns={"title":"name"},
    inplace=True)
```

Turn DataFrame into a **Geo**DataFrame


<details><summary><strong>GeoDataframe ?</strong></summary>
    A <a href="https://geopandas.org/reference/geopandas.GeoDataFrame.html">geopandas GeoDataFrame</a>
    is the spatial equivalent of a pandas dataframe. It supports all operations of DataFrames, plus
    spatial operations. A GeoDataFrame can be compared to a Shapefile in (e.g.), QGis.
</details>

```python
import geopandas as gp
gdf = gp.GeoDataFrame(
    df, geometry=gp.points_from_xy(df.lon, df.lat))
```

Set projection, reproject


<details><summary><strong>Projections in Python</strong></summary>
    <div style="width:500px">
        <ul>
            <li>Most available spatial packages have more or less agreed on a standard format for handling projections in python.</li>
            <li>The recommended way is to define projections using their <strong>epsg ids</strong>, which can be found using <a href="http://epsg.io/">epsg.io</a></li>
            <li>Note that, sometimes, the projection-string refers to other providers, e.g. for <a href="http://epsg.io/54009">Mollweide</a>, it is "ESRI:54009"</li>
        </ul>
    </div>
</details>

```python
CRS_PROJ = "epsg:3857" # Web Mercator
CRS_WGS = "epsg:4326" # WGS1984
gdf.crs = CRS_WGS # Set projection
gdf = gdf.to_crs(CRS_PROJ) # Project
```

```python
gdf.head()
```

**Display location on a map**


- Maplotlib and contextily provide one way to plot static maps.
- we're going to show another, interactive map renderer afterwards


Import [contextily](https://contextily.readthedocs.io/en/latest/), which provides static background tiles to be used in matplot-renderer.

```python
import contextily as cx
```

**1. Create a bounding box for the map**

```python
x = gdf.loc[0].geometry.x
y = gdf.loc[0].geometry.y

margin = 1000 # meters
bbox_bottomleft = (x - margin, y - margin)
bbox_topright = (x + margin, y + margin)
```

<details><summary><strong>gdf.loc[0] ?</strong></summary>
  <ul>  
      <li><code>gdf.loc[0]</code> is the loc-indexer from pandas. It means: access the first record of the (Geo)DataFrame.</li>
      <li><code>.geometry.x</code> is used to access the (projected) x coordinate geometry (point). This is only available for GeoDataFrame (geopandas)</li>
    </ul>
</details>


**2. Create point layer, annotate and plot.**

- With matplotlib, it is possible to adjust almost every pixel individual. 
- However, the more fine-tuning is needed, the more complex the plotting code will get.
- In this case, it is better to define methods and functions, to structure and reuse code.


<div class="alert alert-info" role="alert" style="color: black;">
    <details><summary><strong>Code complexity</strong></summary>
        <br>
        <div style="width:500px">
            <li>Matplotlib is not made for spatial visualizations.</li> 
            <li>The code below simply illustrates, how much fine-tuning is possible.</li>
            <li>But it also shows the limits of using matplotlib as a backend. </li>
            <li>We will learn to use easier methods in the following.</li>
        </div>
    </details>
</div>

```python
from matplotlib.patches import ArrowStyle
# create the point-layer
ax = gdf.plot(
    figsize=(10, 15),
    alpha=0.5,
    edgecolor="black",
    facecolor="red",
    markersize=300)
# set display x and y limit
ax.set_xlim(
    bbox_bottomleft[0], bbox_topright[0])
ax.set_ylim(
    bbox_bottomleft[1], bbox_topright[1])
# turn of axes display
ax.set_axis_off()
# add callouts 
# for the name of the places
for index, row in gdf.iterrows():
    # offset labels by odd/even
    label_offset_x = 30
    if (index % 2) == 0:
        label_offset_x = -100
    label_offset_y = -30
    if (index % 4) == 0:
        label_offset_y = 100
    ax.annotate(
        text=row["name"].replace(' ', '\n'),
        xy=(row["geometry"].x, row["geometry"].y),
        xytext=(label_offset_x, label_offset_y),
        textcoords="offset points",
        fontsize=8,
        bbox=dict(
            boxstyle='round,pad=0.5',
            fc='white',
            alpha=0.5),
        arrowprops=dict(
            mutation_scale=4,
            arrowstyle=ArrowStyle(
                "simple, head_length=2, head_width=2, tail_width=.2"), 
            connectionstyle=f'arc3,rad=-0.3',
            color='black',
            alpha=0.2))
cx.add_basemap(
    ax, alpha=0.5,
    source=cx.providers.OpenStreetMap.Mapnik)
```

<!-- #region -->
There is space for some improvements:
- Labels overlap. There is a package callec [adjust_text](https://github.com/Phlya/adjustText) that allows to reduce overlapping annotations in mpl automatically. This will take more time, however.
- Line breaks after short words don't look good. Use the native `textwrap` function.


<div class="alert alert-info" role="alert" style="color: black;">
    <details><summary><strong>Code execution time</strong></summary>
        <br>
        <div style="width:500px">
            <li>Adding adjust_text significantly increases code execution time</li> 
            <li>We are reaching the limits to what can be done in plain matplotlib.</li>
            <li>In Notebook 4, we will use Mapnik, which is explicitly made for map label placement, to place thousands of labels in a short time. </li>
        </div>
    </details>
</div>
<!-- #endregion -->

```python
from adjustText import adjust_text
import textwrap

# create the point-layer
ax = gdf.plot(
    figsize=(15, 25),
    alpha=0.5,
    edgecolor="black",
    facecolor="red",
    markersize=300)
# set display x and y limit
ax.set_xlim(
    bbox_bottomleft[0], bbox_topright[0])
ax.set_ylim(
    bbox_bottomleft[1], bbox_topright[1])
# turn of axes display
ax.set_axis_off()
# add callouts 
# for the name of the places
texts = []
for index, row in gdf.iterrows():
    texts.append(
        plt.text(
            s='\n'.join(textwrap.wrap(
                row["name"], 18, break_long_words=True)),
            x=row["geometry"].x,
            y=row["geometry"].y,
            horizontalalignment='center',
            fontsize=8,
            bbox=dict(
                boxstyle='round,pad=0.5',
                fc='white',
                alpha=0.5)))
adjust_text(
    texts, autoalign='y', ax=ax,
    arrowprops=dict(
        arrowstyle="simple, head_length=2, head_width=2, tail_width=.2",
        color='black', lw=0.5, alpha=0.2, mutation_scale=4, 
        connectionstyle=f'arc3,rad=-0.3'))

cx.add_basemap(
    ax, alpha=0.5,
    source=cx.providers.OpenStreetMap.Mapnik)
```

<div class="alert alert-warning" role="alert" style="color: black;">
    <details><summary><strong>Further improvements</strong></summary>
        <div style="width:500px">
            <ul>
                <li>Try adding a title. Suggestion: Use the explicit <code>ax</code> object.</li>
                <li>Add a scale bar. Suggestion: Use the pre-installed package <code>matplotlib-scalebar</code></li>
                <li>Change the basemap to Aerial.</li>
            </ul>
        </div>
    </details>
</div>



Have a look at the available basemaps:

```python
cx.providers.keys()
```

And a look at the basemaps for a specific provider:

```python
cx.providers.CartoDB.keys()
```

## Interactive Maps


**Plot with Holoviews/ Geoviews (Bokeh)**

```python
import holoviews as hv
import geoviews as gv
from cartopy import crs as ccrs
hv.notebook_extension('bokeh')
```

Create point layer:

```python
places_layer = gv.Points(
    df,
    kdims=['lon', 'lat'],
    vdims=['name', 'pageid'],
    label='Place') 
```

Make an additional query, to request pictures shown in the area from commons.wikimedia.org

```python
query_url = f'https://commons.wikimedia.org/w/api.php'
params = {
    "action":"query",
    "list":"geosearch",
    "gsprimary":"all",
    "gsnamespace":6,
    "gsradius":1000,
    "gslimit":500,
    "gscoord":f'{lat}|{lng}',
    "format":"json"
    }
```

```python
response = requests.get(
        url=query_url, params=params)
print(response.url)
```

```python
json_data = json.loads(response.text)
```

```python
df_images = pd.DataFrame(json_data["query"]["geosearch"])
```

```python
df_images.head()
```

- Unfortunately, this didn't return any information for the pictures. We want to query the thumbnail-url, to show this on our map.
- For this, we'll first set the pageid as the index (=the key), 
- and we use this key to update our Dataframe with thumbnail-urls, retrievd from an additional API call


Set Column-type as integer:

```python
df_images["pageid"] = df_images["pageid"].astype(int)
```

Set the index to pageid:

```python
df_images.set_index("pageid", inplace=True)
```

```python
df_images.head()
```

Load additional data from API: Place Image URLs

```python
params = {
    "action":"query",
    "prop":"imageinfo",
    "iiprop":"timestamp|user|userid|comment|canonicaltitle|url",
    "iiurlwidth":200,
    "format":"json"
    }
```

See the full list of [available attributes](https://commons.wikimedia.org/w/api.php?action=help&modules=query%2Bimageinfo).


Query the API for a random sample of 50 images:

```python
%%time
from IPython.display import clear_output
from datetime import datetime

count = 0
df_images["userid"] = 0 # set default value
for pageid, row in df_images.sample(n=50).iterrows():
    params["pageids"] = pageid
    try:
        response = requests.get(
            url=query_url, params=params)
    except OSError:
        print(
            "Connection error: Either try again or "
            "continue with limited number of items.")
        break
    json_data = json.loads(response.text)
    image_json = json_data["query"]["pages"][str(pageid)]
    if not image_json:
        continue
    image_info = image_json.get("imageinfo")
    if image_info:
        thumb_url = image_info[0].get("thumburl")
        count += 1
        df_images.loc[pageid, "thumb_url"] = thumb_url
        clear_output(wait=True)
        display(HTML(
            f"Queried {count} image urls, "
            f"<a href='{response.url}'>last query-url</a>."))
        # assign additional attributes
        df_images.loc[pageid, "user"] = image_info[0].get("user")
        df_images.loc[pageid, "userid"] = image_info[0].get("userid")
        timestamp = pd.to_datetime(image_info[0].get("timestamp"))
        df_images.loc[pageid, "timestamp"] = timestamp
    df_images.loc[pageid, "title"] = image_json.get("title")
```

<details><summary><code>Connection error</code> ?</summary>
    The Jupyter Hub @ ZIH is behind a proxy. Sometimes, connections will get reset.
    In this case, either execute the cell again or continue with the limited number of items
    retrieved so far.
</details>


<details><summary><code>%%time</code> ?</summary>
    IPython has a number of built-in "<a href="https://ipython.readthedocs.io/en/stable/interactive/magics.html">magics</a>",