Skip to content
Snippets Groups Projects
02_hll_intro.md 41.2 KiB
Newer Older
Alexander Dunkel's avatar
Alexander Dunkel committed

```python tags=["active-ipynb"]
union_de_fr = pd.concat([grid_de, grid_fr])
union_de_uk = pd.concat([grid_de, grid_uk])
union_uk_fr = pd.concat([grid_uk, grid_fr])
```

**Calculate union**

```python tags=["active-ipynb"]
grid_sel = {
    "de-uk": union_de_uk,
    "de-fr": union_de_fr,
    "uk-fr": union_uk_fr
}
distinct_common = {}
for country_tuple, grid_sel in grid_sel.items():
    cardinality = union_all_hll(
        grid_sel["usercount_hll"].dropna())
    distinct_common[country_tuple] = cardinality
    print(
        f"{distinct_common[country_tuple]} distinct total users "
        f"who shared YFCC100M photos from either {country_tuple.split('-')[0]} "
        f"or {country_tuple.split('-')[1]} (union)")
```

**Calculate intersection**

```python tags=["active-ipynb"]
distinct_intersection = {}
for a, b in [("de", "uk"), ("de", "fr"), ("uk", "fr")]:
    a_total = distinct_users_total[a]
    b_total = distinct_users_total[b]
    common_ref = f'{a}-{b}'
    intersection_count = a_total + b_total - distinct_common[common_ref]
    distinct_intersection[common_ref] = intersection_count
    print(
        f"{distinct_intersection[common_ref]} distinct users "
        f"who shared YFCC100M photos from {a} and {b} (intersection)")
```

Finally, lets get the number of users who have shared pictures from all three countries, based on the [formula for three sets](https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle):

$|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$

which can also be written as:

$|A \cap B \cap C| = |A \cup B \cup C| - |A| - |B| - |C| + |A \cap B| + |A \cap C| + |B \cap C|$


**Calculate distinct users of all three countries:**

```python tags=["active-ipynb"]
union_de_fr_uk = pd.concat(
    [grid_de, grid_fr, grid_uk])
cardinality = union_all_hll(
    union_de_fr_uk["usercount_hll"].dropna())
union_count_all = cardinality
union_count_all
```

```python tags=["active-ipynb"]
country_a = "de"
country_b = "uk"
country_c = "fr"
```

**Calculate intersection**

```python tags=["active-ipynb"]
intersection_count_all = union_count_all - \
    distinct_users_total[country_a] - \
    distinct_users_total[country_b] - \
    distinct_users_total[country_c] + \
    distinct_intersection[f'{country_a}-{country_b}'] + \
    distinct_intersection[f'{country_a}-{country_c}'] + \
    distinct_intersection[f'{country_b}-{country_c}']
    
print(intersection_count_all)
```

### Visualize intersection using Venn diagram


Since we're going to visualize this with [matplotlib-venn](https://github.com/konstantint/matplotlib-venn),
we need the following variables:

```python tags=["active-ipynb"]
from matplotlib_venn import venn3, venn3_circles
v = venn3(
    subsets=(
        500,
        500, 
        100,
        500,
        100,
        100,
        10),
    set_labels = ('A', 'B', 'C'))
v.get_label_by_id('100').set_text('Abc')
v.get_label_by_id('010').set_text('aBc')
v.get_label_by_id('001').set_text('abC')
v.get_label_by_id('110').set_text('ABc')
v.get_label_by_id('101').set_text('AbC')
v.get_label_by_id('011').set_text('aBC')
v.get_label_by_id('111').set_text('ABC')
plt.show()
```

We already have `ABC`, the other values can be calulated:

```python tags=["active-ipynb"]
ABC = intersection_count_all
```

```python tags=["active-ipynb"]
ABc = distinct_intersection[f'{country_a}-{country_b}'] - ABC
```

```python tags=["active-ipynb"]
aBC = distinct_intersection[f'{country_b}-{country_c}'] - ABC
```

```python tags=["active-ipynb"]
AbC = distinct_intersection[f'{country_a}-{country_c}'] - ABC
```

```python tags=["active-ipynb"]
Abc = distinct_users_total[country_a] - ABc - AbC + ABC
```

```python tags=["active-ipynb"]
aBc = distinct_users_total[country_b] - ABc - aBC + ABC
```

```python tags=["active-ipynb"]
abC = distinct_users_total[country_c] - aBC - AbC + ABC
```

## Illustrate intersection (Venn diagram)

Order of values handed over: Abc, aBc, ABc, abC, AbC, aBC, ABC

Define Function to plot Venn Diagram.

```python
from typing import Tuple

def plot_venn(
    subset_sizes: List[int],
    colors: List[str], 
    names: List[str],
    subset_sizes_raw: List[int] = None,
    total_sizes: List[Tuple[int, int]] = None,
    ax = None,
    title: str = None):
    """Plot Venn Diagram"""
    if not ax:
        fig, ax = plt.subplots(1, 1, figsize=(5,5))
    set_labels = (
        'A', 'B', 'C')
    v = venn3(
        subsets=(
            [subset_size for subset_size in subset_sizes]),
        set_labels = set_labels,
        ax=ax)    
    for ix, idx in enumerate(
        ['100', '010', '001']):
        v.get_patch_by_id(
            idx).set_color(colors[ix])
        v.get_patch_by_id(
            idx).set_alpha(0.8)
        v.get_label_by_id(
            set_labels[ix]).set_text(
            names[ix])
        if not total_sizes:
            continue
        raw_count = total_sizes[ix][0]
        hll_count = total_sizes[ix][1]
        difference = abs(raw_count-hll_count)
        v.get_label_by_id(set_labels[ix]).set_text(
            f'{names[ix]}, {hll_count},\n'
            f'{difference/(raw_count/100):+.1f}%')
    if subset_sizes_raw:
        for ix, idx in enumerate(
            ['100', '010', None, '001']):
            if not idx:
                continue
            dif_abs = subset_sizes[ix] - subset_sizes_raw[ix]
            dif_perc = dif_abs / (subset_sizes_raw[ix] / 100)
            v.get_label_by_id(idx).set_text(
                f'{subset_sizes[ix]}\n{dif_perc:+.1f}%')            
    label_ids = [
        '100', '010', '001',
        '110', '101', '011',
        '111', 'A', 'B', 'C']
    for label_id in label_ids:
        v.get_label_by_id(
            label_id).set_fontsize(14)
    # draw borders
    c = venn3_circles(
        subsets=(
            [subset_size for subset_size in subset_sizes]),
        linestyle='dashed',
        lw=1,
        ax=ax)
    if title:
        ax.title.set_text(title)
```

Plot Venn Diagram:

```python tags=["active-ipynb"]
subset_sizes = [
    Abc, aBc, ABc, abC, AbC, aBC, ABC]
colors = [
    color_de, color_uk, color_fr]
names = [
    'Germany', 'United Kingdom','France']
plot_venn(
    subset_sizes=subset_sizes,
    colors=colors,
    names=names,
    title="Common User Count")
```

**Combine Map & Venn Diagram**

```python tags=["active-ipynb"]
# figure with subplot (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize=(10, 24))
plot_map(
    grid=grid, sel_grids=sel_grids, 
    sel_colors=sel_colors, ax=ax[0])
plot_venn(
    subset_sizes=subset_sizes,
    colors=colors,
    names=names,
    ax=ax[1])
# store as png
fig.savefig(
    OUTPUT / "hll_intersection_ukdefr.png", dpi=300, format='PNG',
    bbox_inches='tight', pad_inches=1)
```

<div class="alert alert-info" role="alert" style="color: black;">
    <details><summary><strong>Error rates</strong></summary>
        <div style="width:500px"><ul>
        <li>Guaranteed error rates (2-3%) apply to HLL any union operation</li>  
        <li>When intersecting HLL sets, error rates may <strong>increase</strong>, depending on the size of sets</li> 
        <li>This is a limitation, but also provides a protection that prevents identifying individual users through intersection</li>
        <li>Have a look at the <a href="https://ad.vgiscience.org/yfcc_gridagg/04_interpretation.html">YFCC100M paper notebook</a>, where we have created the Venn diagram with raw and hll data, illustrating error rates</li>
            </ul>
</div>
    </details>
</div>     


## Create Notebook HTML

**Save the Notebook**, then execute the following cell to convert to HTML (archive format).

```python
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./02_hll_intro.ipynb \
    --template=../nbconvert.tpl \
Alexander Dunkel's avatar
Alexander Dunkel committed
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
```

## Summary


<div class="alert alert-warning" role="alert" style="color: black;">
    <details><summary><strong>Notes</strong></summary>
        <div style="width:500px">
        <ul>
            <li><a href="https://github.com/AdRoll/python-hll/">Python-hll</a>, the <a href="https://lbsn.vgiscience.org/">lbsn structure</a> and other tools shown in this work are in an early stage of development</li>
            <li>Adaption of workflows to the privacy-aware data structure requires effort</li>
            <li>Many, but not all visualizations are suited to be used with HLL data</li>
            <li>The <a href="https://lbsn.vgiscience.org/">lbsn structure</a> is a <strong>convention</strong>, there're many different ways to use and apply HLL in visual analytics. With the structure, we have specifically looked at the utility of HLL to privacy.</li>
           </ul>
        </div>
    </details>
</div>


<div class="alert alert-info" role="alert" style="color: black;">
    <details><summary><strong>Further work</strong></summary>
        <div style="width:500px">
        <ul>
            <li>Have a look at the <a href="https://lbsn.vgiscience.org/yfcc-introduction/">tutorial section</a></li>
            <li>Try to replicate the <a href="https://lbsn.vgiscience.org/environment/">Minimal example</a>, which explains how to start <code>rawdb</code> and <code>hlldb</code> locally using Docker</li>
            <li>Clone and run <a href="https://gitlab.vgiscience.de/ad/yfcc_gridagg">YFCC100M grid aggregation notebooks</a>, which demonstrate the full pipeline of importing, processing and visualizing data</li>
           </ul>
        </div>
    </details>
</div>


```python tags=["hidden"]
Alexander Dunkel's avatar
Alexander Dunkel committed
root_packages = [
    'python', 'colorcet', 'holoviews', 'ipywidgets', 'geoviews', 'hvplot',
    'geopandas', 'mapclassify', 'memory_profiler', 'python-dotenv', 'shapely',
    'matplotlib', 'sklearn', 'numpy', 'pandas', 'bokeh', 'fiona',
    'matplotlib-venn', 'xarray', 'panel']
Alexander Dunkel's avatar
Alexander Dunkel committed
tools.package_report(root_packages)
```