# <details><summary style="cursor: pointer;"><strong>If you are unfamiliar with HLL</strong></summary>
# <details><summary style="cursor: pointer;"><strong>Unfamiliar with HLL?</strong></summary>
# <ul>
# <ul>
# <li>A basic introduction to working with HLL data is provided <a href="https://kartographie.geo.tu-dresden.de/python_datascience_course/02_hll_intro.html">in this jupyter notebook</a></li>
# <li>A basic introduction to working with HLL data is provided <a href="https://kartographie.geo.tu-dresden.de/python_datascience_course/02_hll_intro.html">in this jupyter notebook</a></li>
# <li>The scoping study for the sunset-sunrise paper was based on the Flickr YFCC 100M dataset, and the corresponding Notebooks can be found <a href="https://gitlab.vgiscience.de/ad/yfcc_gridagg">here</a></li>
# <li>The scoping study for the sunset-sunrise paper was based on the Flickr YFCC 100M dataset, and the corresponding notebooks can be found <a href="https://gitlab.vgiscience.de/ad/yfcc_gridagg">here</a></li>
# <li>The notebooks for the sunset-sunrise article can be found <a href="https://gitlab.vgiscience.de/ad/sunset-sunrise-paper">here</a></li>
# <li>The notebooks for the sunset-sunrise article can be found <a href="https://gitlab.vgiscience.de/ad/sunset-sunrise-paper">here</a></li>
# First, let's test the slower Python implemention [python-hll](https://github.com/AdRoll/python-hll), to get a cardinality for some of the original HLL strings. Cardinality means the estimated count of distinct items in a set, which refers to the User Count in our case.
# + tags=[]
frompython_hll.utilimportNumberUtil
frompython_hll.hllimportHLL
# + tags=[]
defhll_from_byte(hll_set:str):
"""Return HLL set from binary representation"""
hex_string=hll_set[2:]
returnHLL.from_bytes(
NumberUtil.from_hex(
hex_string,0,len(hex_string)))
# -
# Have a look at the first HLL set:
# + tags=[]
df["user_hll"][0]
# -
# Cast to HLL and calculate cardinality in one step:
# + tags=[]
hll_from_byte(df["user_hll"][0]).cardinality()-1
# -
# An estimated number of 12 users has been observed at location `0.040050,-179.750586` (lat, lng).
#
# These latitude and longitude coordinates refer to the centroids of the original 50 km grid.
# ## Test Postgres HLL
# In order to speed up processing, we can connect to a Postgres Database running the faster postgresql-hll extension from citus.
#
# Password and username for connecting to local [hllworker](https://gitlab.vgiscience.de/lbsn/databases/pg-hll-empty) are loaded from environment.
# + tags=[]
DB_USER="hlluser"
DB_PASS=os.getenv('READONLY_USER_PASSWORD')
# set connection variables
DB_HOST="hllworkerdb"
DB_PORT="5432"
DB_NAME="hllworkerdb"
# -
# Connect to empty Postgres database running HLL Extension:
# + tags=[]
DB_CONN=psycopg2.connect(
host=DB_HOST,
port=DB_PORT,
dbname=DB_NAME,
user=DB_USER,
password=DB_PASS
)
DB_CONN.set_session(
readonly=True)
DB_CALC=tools.DbConn(
DB_CONN)
CUR_HLL=DB_CONN.cursor()
# -
# Test connection:
# + tags=[]
CUR_HLL.execute("SELECT 1;")
print(CUR_HLL.statusmessage)
# -
# Test with actuall HLL set calculation:
# + tags=[]
db_query=f"""
SELECT lat, lng, hll_cardinality(user_hll)::int as usercount
# 1. load the full list of HLL sets for all coordinates
# 2. assign coordinates to continents
# 3. union all hll sets per continent, to retrieve the combined number of estimated distinct users having visited each continent on Flickr
# ### Load continent geometries
# + tags=[]
world=gp.read_file(
gp.datasets.get_path('naturalearth_lowres'),
crs=CRS_WGS)
world=world.to_crs(CRS_PROJ)
# + tags=[]
world=world[['continent','geometry']]
# + tags=[]
ax=world.plot(column='continent')
ax.set_axis_off()
# -
# There are slight inaccuracies in Continent Geometries (overlapping polygons), which can be fixed by a small buffer, before dissolving country geometries:
# + tags=[]
world['geometry']=world.buffer(0.01)
# + tags=[]
continents=world.dissolve(by='continent')
# + tags=[]
continents.head()
# -
# Assign index back as a column, to classify via color by `column='continent'`
# It is also possible to intersect HLL sets, to some degree. This can be used to estimate common visitor counts for (e.g.) countries. The average error, however, will be larger, especially for intersections of HLL sets for different sizes.
# Intersect hll coordinates with country geometry. Since hll coordinates are pre-aggregated (50 km), direct intersection will yield some error rate (in this case, called [MAUP](https://en.wikipedia.org/wiki/Modifiable_areal_unit_problem)).
# This is based on the [OpenAI Example](https://beta.openai.com/examples/default-summarize) "Summarize Text". Also see the [API Reference](https://beta.openai.com/docs/api-reference/files).
f"This is part of a research paper (pages {page_start} to {page_end}). " \
f"Please summarize the text for a PhD student: \n" \
f"{text}\nthe_end"
response=openai.Completion.create(
engine="text-davinci-003",
prompt=task,
temperature=0.3,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=["\nthe_end"]
)
returnresponse
# Return distinct errors with error margin (±2.3%)
# - **max_tokens**: The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. Most models have a context length of 2048 tokens (except for the newest models, which support 4096).
# + tags=[]
# - **temperature**: What sampling temperature to use. Higher values means the model will take more risks. Try 0.9 for more creative applications, and 0 (argmax sampling) for ones with a well-defined answer.
forref,hll_setinhll_series.items():
# - **top_p**: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
cardinality=cardinality_hll(hll_set)
# - **presence_penalty**: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appearin the text so far, increasing the model's likelihood to talk about new topics.
print(f'Estimated {cardinality} (±{int(cardinality/100*2.3)}) users for {name_ref.get(ref)}')
# - **frequency_penalty**: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
f"who shared Flickr photos from {name_ref.get(a)} and {name_ref.get(b)} (intersection)")
# -
# ## Conclusions
# Finally, lets get the number of users who have shared pictures from all three countries, based on the [formula for three sets](https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle):
# ChatGPT is best with responses > 256 tokens, but it is limited to processing 4096 tokens at once, which is not enough to read the full paper in context. Still, this seems like a good way to get a quick summary when skimming through many research papers.
# Since negative visitor counts are impossible, the increasing inaccuracies for nested intersections are easily observable here. In other words, there is simply too little overlap between the visitors between all three countries to be measurable with HLL intersections
# ## Visualize as Venn diagram
# Since we're going to visualize this with [matplotlib-venn](https://github.com/konstantint/matplotlib-venn), we need the following variables:
# + tags=[]
frommatplotlib_vennimportvenn3,venn3_circles
plt.figure(figsize=(3,3))
v=venn3(
subsets=(
500,
500,
100,
500,
100,
100,
10),
set_labels=('A','B','C'))
v.get_label_by_id('100').set_text('Abc')
v.get_label_by_id('010').set_text('aBc')
v.get_label_by_id('001').set_text('abC')
v.get_label_by_id('110').set_text('ABc')
v.get_label_by_id('101').set_text('AbC')
v.get_label_by_id('011').set_text('aBC')
v.get_label_by_id('111').set_text('ABC')
plt.show()
# -
# We already have `ABC`, the other values can be calculated:
# Each dot on the left side is a coordinate shared in the original dataset, with an attached HLL set abstracting all users having shared photographs from this part of the world. By union of all HLL sets for each of the three country, based on coordinate-country inersection, we were able to calculate the intersection between different sets. These estimated common visitor counts are labeled on the right side in the Venn diagram.
#
# The number of distinct users who shared photos from the US by far outweights Germany and Canada. Not surprisingly, there is a larger overlap between common visitors for Canada and USA. By union of HLL sets, we are able to estimate distinct visitor counts for arbitrary areas or regions. This ability to re-use HLL sets is, however, limited by the lower limit of 50km resolution of the shared benchmark dataset.