This is an update of my previous analysis of subreddit gender ratios project. This time I have access to better data because /u/Stuck_In_the_Matrix has downloaded every publicly available Reddit comment and made them publicly available. Additionally, the dataset is now available in Google BigQuery allowing easy analysis.
The dataset includes users' flair like the Reddit flair API I used previously. Notably, the new dataset also allows us to generate a list of all submitters in a subreddit, which was extremely difficult previously.
Basically we will be deriving the gender for as many users as possible from their flair in various subreddits. Then we will use that set of users as a random sample of every other subbreddit and guessing the gender ratio.
We have no idea if the combined set of sampled users are representative of all Reddit users. In fact, some subreddits in the results look over represented in the results because they share many users with one of the subreddits we used for deriving gender. We just use as many subreddits as we can for gender and hope it's good enough.
The number of sampled users is very small compared to the set of all commenters of each subreddit.
In my previous analysis, it was near impossible to get a list of all commenters or subscribers of a subreddit. Instead the list of users with flair in a given subreddit was used as a proxy. That had major issues because in many subreddits the commenters and set of flaired users are not always the same. For example, some subreddits only give flair to submitters, which biased the results significantly. With the new dataset of every Reddit comment, that is no longer an issue.
In other words, this analysis is only mostly terrible now instead of completely terrible. I really doubt the exact percentages calculated are correct, but the relative comparisons between them and rankings look plausible.
Data through the end of November 2017 is included in this analysis.
We will be doing almost all the analysis in BigQuery using SQL directly.
I'm using Jupyter and IPython, so I'd like a nice interface to run BigQuery SQL statements. Let's quickly define %%bigquery
as an IPython cell magic.
This is pretty hacky, I'm using tons of internal functions from pandas-gbq. This is mostly a modified version of run_query() that limits the retrieved results to a specified number of rows. It doesn't really bother handle the data types of the results either. We also add SQL syntax highlighting in the Jupyter notebook for our new magic.
If you reuse this code, change PROJECT_ID
to your own project ID from the Google Cloud Platform Console. You will also have to enable billing on the project and ensure BigQuery is enabled. You can follow the quickstart in their documentation.
You can add the --verbose
flag to the magic to get status updates as it runs.
Becauase my %%bigquery
code is horribly hacky, make sure you have the exact version of pandas-gbq
that I used:
pip3 install -e git+https://github.com/pydata/pandas-gbq@4cab83cefb7df190bf3e5b43168c2cb53f596c31#egg=pandas-gbq
PROJECT_ID = 'bburky'
import pandas
from pandas_gbq import gbq
from IPython.core.magic import register_cell_magic
from IPython.core.magic_arguments import (
argument, magic_arguments, parse_argstring)
from google.auth.exceptions import RefreshError
from google.cloud.bigquery import QueryJobConfig
from concurrent.futures import TimeoutError
import itertools
@register_cell_magic
@magic_arguments()
@argument('-d', '--dialect', default='standard',
choices=['standard', 'legacy'])
@argument('--writedisposition', default='WRITE_EMPTY',
choices=['WRITE_TRUNCATE', 'WRITE_APPEND', 'WRITE_EMPTY'])
@argument('-v', '--verbose', action='store_true')
@argument('--limit', type=int, default=0,
help='Limit rows or retreived results to this number')
@argument('--destinationtable',
help='Save query results to destination table')
def bigquery(line, cell):
args = parse_argstring(bigquery, line)
query = cell
connector = gbq.GbqConnector(
PROJECT_ID, reauth=False, verbose=args.verbose,
dialect=args.dialect, auth_local_webserver=False)
job_config = {
'query': {
'writeDisposition': args.writedisposition,
}
}
if args.destinationtable:
dataset_id, table_id = args.destinationtable.rsplit('.', 1)
job_config['query']['destinationTable'] = {
'projectId': PROJECT_ID,
'datasetId': dataset_id,
'tableId': table_id,
}
connector._start_timer()
try:
connector._print('Requesting query... ', end="")
query_reply = connector.client.query(
query,
job_config=QueryJobConfig.from_api_repr(job_config['query']))
connector._print('ok.')
except (RefreshError, ValueError):
if connector.private_key:
raise gbq.AccessDenied(
"The service account credentials are not valid")
else:
raise gbq.AccessDenied(
"The credentials have been revoked or expired, "
"please re-run the application to re-authorize")
except connector.http_error as ex:
connector.process_http_error(ex)
job_id = query_reply.job_id
connector._print('Job ID: %s\nQuery running...' % job_id)
while query_reply.state != 'DONE':
connector.print_elapsed_seconds(' Elapsed', 's. Waiting...')
timeout_ms = job_config['query'].get('timeoutMs')
if timeout_ms and timeout_ms < connector.get_elapsed_seconds() * 1000:
raise QueryTimeout('Query timeout: {} ms'.format(timeout_ms))
timeout_sec = 1.0
if timeout_ms:
# Wait at most 1 second so we can show progress bar
timeout_sec = min(1.0, timeout_ms / 1000.0)
try:
query_reply.result(timeout=timeout_sec)
except TimeoutError:
# Use our own timeout logic
pass
except connector.http_error as ex:
connector.process_http_error(ex)
if connector.verbose:
if query_reply.cache_hit:
connector._print('Query done.\nCache hit.\n')
else:
bytes_processed = query_reply.total_bytes_processed or 0
bytes_billed = query_reply.total_bytes_billed or 0
connector._print('Query done.\nProcessed: {} Billed: {}'.format(
connector.sizeof_fmt(bytes_processed),
connector.sizeof_fmt(bytes_billed)))
connector._print('Standard price: ${:,.2f} USD\n'.format(
bytes_billed * connector.query_price_for_TB))
connector._print('Retrieving results...')
try:
rows_iter = query_reply.result()
except connector.http_error as ex:
connector.process_http_error(ex)
if args.limit:
connector._print(' Limiting results to {} rows'.format(args.limit))
result_rows = list(itertools.islice(rows_iter, args.limit))
else:
result_rows = list(rows_iter)
total_rows = rows_iter.total_rows
schema = {
'fields': [
field.to_api_repr()
for field in rows_iter.schema],
}
# print basic query stats
connector._print('Got {} of {} rows.\n'.format(
len(result_rows), total_rows))
return gbq._parse_data(schema, result_rows)
%%javascript
IPython.CodeCell.options_default.highlight_modes['magic_text/x-sql'] = {'reg':[/^%%bigquery/]};
First get a list of every user on reddit's flair in each subreddit. Save this to a userflair
table. Limit the returned results to 10 rows to avoid downloading the entire table.
%%bigquery --limit=10 --destinationtable=flair.userflair
SELECT
DISTINCT author,
subreddit,
author_flair_text,
author_flair_css_class
FROM
`fh-bigquery.reddit_comments.201*`
WHERE
author_flair_text IS NOT NULL
OR author_flair_css_class IS NOT NULL;
author | subreddit | author_flair_text | author_flair_css_class | |
---|---|---|---|---|
0 | SuperSonicMC | friendsafari | SonicBoom: 3926-5720-7723 [Fighting:Mienfoo,Sa... | mienfoo sawk tyrogue fighting |
1 | Jakester5112 | MLBTheShow | Rating: 1513 | empty-athletics |
2 | tommofrog | csgobetting | virtuspro | |
3 | TentacleBird | changemyview | 1∆ | points |
4 | StSob | learndota2 | Silencer | silencer |
5 | ScradleyGymson | pcmasterrace | SenorOiram | i-desk-steam |
6 | MikeWizowski | Gunners | liked Per before it was cool | |
7 | Zerockk | leagueoflegends | twistedfate | |
8 | SlipClutch | MMA | Team Cruz | |
9 | WhaterBlue | GlobalOffensive | Legendary Eagle | rank legendary-eagle |
Then, search for all potentially gendered flair. We're searching for flair css classes that contains the substrings "male" or "female" with a regex. Order the list by the number of users with each flair css class and limit the results to flair that occurs greater than 1000 times.
I don't claim to be the best at SQL, so some of these queries could be more optimal.
%%bigquery
SELECT
subreddit,
num_flair,
gender,
flair.author_flair_css_class
FROM (
SELECT
subreddit,
author_flair_css_class,
REGEXP_EXTRACT(
LOWER(author_flair_css_class),
'(?:fe)?male') AS gender,
COUNT(*) AS num_flair
FROM
flair.userflair
GROUP BY
subreddit,
author_flair_css_class,
gender) AS flair
WHERE
gender IS NOT NULL
AND num_flair > 1000
ORDER BY
num_flair DESC
subreddit | num_flair | gender | author_flair_css_class | |
---|---|---|---|---|
0 | AskMen | 32676 | male | male |
1 | AskWomen | 21870 | female | female |
2 | AskWomen | 16886 | male | male |
3 | AskMen | 12245 | female | female |
4 | AskMenOver30 | 5558 | male | male |
5 | AskWomenOver30 | 2477 | female | female |
6 | sexover30 | 1285 | male | male |
After looking at the results, and manually checking how the the subreddits use flair, we can use flair from the following subreddits:
Some subreddits use pink/blue colors to indicate genders. Let's search the flair CSS classes for that too. We use a regular expression to easily search for the strings anywhere within the flair css class.
%%bigquery
SELECT
subreddit,
num_flair,
CASE
WHEN color = 'pink' THEN 'female'
WHEN color = 'blue' THEN 'male'
END AS gender,
author_flair_css_class
FROM (
SELECT
subreddit,
REGEXP_EXTRACT(
LOWER(author_flair_css_class),
'pink|blue') AS color,
COUNT(*) AS num_flair,
author_flair_css_class
FROM
flair.userflair
GROUP BY
subreddit,
color,
author_flair_css_class)
WHERE
color IS NOT NULL
AND num_flair > 1000
ORDER BY
num_flair DESC;
subreddit | num_flair | gender | author_flair_css_class | |
---|---|---|---|---|
0 | tall | 22745 | male | blue |
1 | Android | 15075 | male | userBlue |
2 | short | 6819 | male | blue |
3 | MechanicalKeyboards | 6713 | male | mxblue |
4 | gainit | 5847 | male | blue |
5 | CasualConversation | 5304 | male | blue |
6 | bjj | 5064 | male | blue-belt |
7 | tall | 5026 | female | pink |
8 | weddingplanning | 4845 | male | blue |
9 | Random_Acts_Of_Amazon | 3706 | female | hotpink |
10 | Cubers | 3242 | male | blue |
11 | gameswap | 2949 | male | blueshroom |
12 | Agario | 2632 | male | blue |
13 | gonewildcurvy | 2524 | female | verifypink |
14 | weddingplanning | 2506 | female | pink |
15 | gonewildaudio | 2099 | male | verifyblackblue |
16 | UIUC | 2085 | male | bluei |
17 | destiny2 | 1991 | male | blue |
18 | mylittlepony | 1920 | female | pinkiepie |
19 | PS4 | 1877 | male | ps-blue |
20 | DebateAChristian | 1857 | male | blue brackets |
21 | Agario | 1830 | female | pink |
22 | Supernatural | 1816 | male | blue |
23 | short | 1781 | female | pink |
24 | gonewildaudio | 1574 | female | verifyblackpink |
25 | ModelUSGov | 1488 | male | Blue |
26 | mildlyinfuriating | 1438 | male | Blue |
27 | ABDL | 1436 | male | blue |
28 | bjj | 1419 | male | blue-belt-1 |
29 | curlyhair | 1387 | male | blue2 |
30 | MechanicalKeyboards | 1337 | male | mxblue2 |
31 | proED | 1335 | male | blue |
32 | ACTrade | 1308 | female | Pink |
33 | CasualConversation | 1302 | female | pink |
34 | bjj | 1257 | male | blue-belt-2 |
35 | cemu | 1219 | male | blackandblue |
36 | curlyhair | 1201 | female | pink |
37 | MechanicalKeyboards | 1139 | male | blue |
38 | Android | 1111 | female | userPink |
39 | androidapps | 1034 | male | bluegrey |
40 | PowerTV | 1020 | male | blueflaircop |
Let's attempt to parse all flair text on Reddit as A/S/L strings. Parse the flair text using a regular expression and look for a substring containing two slashes around an "M" or "F", case insensitively, with optional whitespace. The M or F is extracted with a capturing group and is then converted to the string "male" or "female".
%%bigquery
SELECT
subreddit,
num_flair,
CASE
WHEN asl = 'm' THEN 'male'
WHEN asl = 'f' THEN 'female'
END AS gender,
sample_flair
FROM (
SELECT
subreddit,
REGEXP_EXTRACT(
LOWER(author_flair_text),
'^[^/]+/(m|f)/[^/]+$') AS asl,
COUNT(*) AS num_flair,
ANY_VALUE(author_flair_text) as sample_flair
FROM
flair.userflair
GROUP BY
subreddit,
asl)
WHERE
asl IS NOT NULL
AND num_flair > 1000
ORDER BY
num_flair DESC;
subreddit | num_flair | gender | sample_flair | |
---|---|---|---|---|
0 | OkCupid | 10839 | male | 19/m/CO |
1 | keto | 5454 | male | 21/M/5'8" | SW: 295 | CW: 235 | GW: 165 |
2 | OkCupid | 4282 | female | 28/F/PA |
3 | keto | 4180 | female | 37/F/5'10" SW:252 CW:150 Goal Weight Achieved |
4 | childfree | 2962 | female | 23/F/Furbabies 4 lyfe |
5 | xxketo | 2356 | female | 26/F/5'3" | SW-156 | CW-148 | GW-120 |
6 | childfree | 2038 | male | 28/M/Alberta |
7 | LGBTeens | 1657 | male | 18/M/NY |
8 | loseit | 1326 | female | 26/F/5'6'' SW:248 CW:202 GW:154 |
9 | Tinder | 1261 | male | 24/M/S |
The following subreddits have greater than 1000 male or female users with A/S/L flair:
After looking more carefully at the A/S/L results, I noticed some subreddits use flair of the format "27M", "M 27" or just an "M" within a string of loosely structured text. We use \p{L}
in the regex to not match any Unicode letters (otherwise things like "Bayern München" match the regex). Apostrophes must also be excluded from our delimiters or the "m" will be matched from the word "I'm". A period and dollar sign must be excluded because things like "1.71m" or "$1.5M" could be matched.
%%bigquery
SELECT *
FROM (
SELECT
subreddit,
REGEXP_EXTRACT(
LOWER(author_flair_text),
"(?:^|[^\\p{L}0-9'\\.\\$])\\s*(?:\\d\\d)?\\s*(f|m)\\s*(?:\\d\\d)?\\s*(?:$|[^\\p{L}0-9'\\.])") AS asl,
COUNT(*) AS num,
ANY_VALUE(author_flair_text) as author_flair_text
FROM
flair.userflair
GROUP BY
subreddit,
asl
ORDER BY num DESC)
WHERE
asl IS NOT NULL
AND num > 1000;
subreddit | asl | num | author_flair_text | |
---|---|---|---|---|
0 | keto | m | 31974 | 40/M/6'2" | SW: 255 | CW: 184 | GW: 185 |
1 | loseit | f | 27250 | 28F 5'7" | SW 212.6 | CW 202.6 | GW 147 |
2 | keto | f | 22952 | 27/f/5'3|SW:198.5|CW:161.6|GW:135 |
3 | OkCupid | m | 15235 | 22/M gay af |
4 | pokemontrades | m | 14531 | 2595-5042-4008 || Joshua (M) |
5 | loseit | m | 13804 | M27/6'3"/SW 410 CW 360 GW 240 |
6 | xxketo | f | 13730 | F/27/5'4" | SW 161 | CW 150 | GW 125 |
7 | medicalschool | m | 8100 | M-1 |
8 | proED | f | 7644 | 4'11| CW 104 | GW: 100| 20F |
9 | OkCupid | f | 6480 | 21/F/NY |
10 | childfree | f | 6343 | 32/F/furbabies |
11 | LGBTeens | m | 5192 | 16 | Gay | M | IL |
12 | CFB | m | 4903 | Texas A&M Aggies |
13 | childfree | m | 4101 | 29/M/Upstate NY and Single |
14 | Pokemongiveaway | m | 3195 | 0319-1418-1967 | Slaapzak (M) |
15 | spikes | m | 2820 | S: In Flux | M: Living End |
16 | SVExchange | m | 2611 | 4270-3630-0138 || Ryan (UM, M, αS, X) || 2507 |
17 | fatlogic | f | 2319 | 27F 5'3" | SW: 151 | CW: 125 | GW: 115 |
18 | MechanicalKeyboards | m | 2238 | 93 Model M | QFR MX Blues |
19 | CompetitiveWoW | m | 2027 | 4/10M Windwalker |
20 | financialindependence | m | 1890 | 32m/37k :: 52m/37k/1m (2017 USD / 2037 ER) |
21 | rawdenim | f | 1889 | N&F E3 WG / UB 101 / OldBlueCo 14.5oz Extr... |
22 | Tinder | m | 1755 | 20/M |
23 | infj | m | 1705 | INFJ, 31M |
24 | 100DaysofKeto | f | 1484 | 31F | 5'3'' | SW: 151.2 | CW: 146.8 | GW1: 130 |
25 | headphones | m | 1359 | ATH M50, HiFiMAN RE-400 |
26 | infj | f | 1323 | INFJ, 33, f |
27 | LetItDie | f | 1224 | 30F |
28 | infertility | f | 1040 | 31F, Severe MFI, Secondary Infertitlity, TTC 1... |
29 | fatlogic | m | 1015 | M | 6'0 | Sep 2014 - 350 | CW 223 |
This new regex should capture everything the A/S/L regex did, but will also accept things like "27M or "M". It has some false positives, but it actually works fairly well if we limit it to the following subredits:
Now that we have identified queries to find gendered flair, let's collect a list of all users and their identified gender into a table.
%%bigquery --limit=10 --destinationtable=flair.usersubredditgenders
SELECT DISTINCT *
FROM (
-- male/female
SELECT
subreddit,
author,
REGEXP_EXTRACT(
LOWER(author_flair_css_class),
'(?:fe)?male') AS gender
FROM
flair.userflair
WHERE
subreddit IN (
'AskMen',
'AskWomen',
'AskMenOver30',
'AskWomenOver30',
'sexover30')
UNION ALL
-- pink/blue
SELECT
subreddit,
author,
CASE
WHEN author_flair_css_class = 'blue' THEN 'male'
WHEN author_flair_css_class = 'pink' THEN 'female'
END AS gender
FROM
flair.userflair
WHERE
subreddit IN (
'tall',
'short')
UNION ALL
-- A/S/L
SELECT
subreddit,
author,
CASE
WHEN asl = 'm' THEN 'male'
WHEN asl = 'f' THEN 'female'
END AS gender
FROM (
SELECT
subreddit,
author,
REGEXP_EXTRACT(
LOWER(author_flair_text),
"(?:^|[^\\p{L}0-9'\\.\\$])\\s*(?:\\d\\d)?\\s*(f|m)\\s*(?:\\d\\d)?\\s*(?:$|[^\\p{L}0-9'\\.])") AS asl
FROM
flair.userflair
WHERE
subreddit IN (
'OkCupid',
'keto',
'childfree',
'xxketo',
'LGBTeens',
'loseit',
'Tinder',
'proED',
'fatlogic',
'financialindependence',
'infj',
'infertility',
'100DaysofKeto')) )
WHERE
gender IS NOT NULL;
subreddit | author | gender | |
---|---|---|---|
0 | infj | AsianSwagger | male |
1 | infj | galifanasana | male |
2 | infj | P-mac | male |
3 | infj | Yoshimotso | male |
4 | infj | Thunder_54 | male |
5 | infj | TeenageMutant1 | male |
6 | infj | Shannester | male |
7 | infj | TheAmberbrew | male |
8 | infj | ejgamer13 | male |
9 | infj | NotVeryWellFriend | male |
For various reasons, users may report different genders over time or in different subreddits.
Let's create a new table with only users and their gender and omit users that report multiple genders, otherwise these users would be counted twice in our final analysis.
%%bigquery --limit=10 --destinationtable=flair.users
SELECT
author,
ANY_VALUE(gender) as gender
FROM (
SELECT
author,
gender
FROM
flair.usersubredditgenders
GROUP BY
author,
gender )
GROUP BY
author
HAVING COUNT(*) = 1;
author | gender | |
---|---|---|
0 | fisherpriceman | male |
1 | alltimeisrelative | male |
2 | Betrubnis | male |
3 | VoicelikeHoney | male |
4 | Pathomann | male |
5 | JohnDylena | male |
6 | SpookyWagons | male |
7 | FerrisTriangle | male |
8 | Jerry132 | male |
9 | Juswantedtono | male |
Now we can do our analysis of subreddits using the genders of authors of comments.
We use the standard read_gbq()
function from pandas-gbq now because we want its handling of types and indexes.
import pandas as pd
query="""
SELECT
subreddit,
subreddit_size,
COUNT(CASE WHEN gender='male' THEN 1 END) AS male,
COUNT(CASE WHEN gender='female' THEN 1 END) AS female
FROM
flair.users
JOIN (
SELECT DISTINCT
subreddit,
author,
subreddit_size
FROM
`fh-bigquery.reddit_comments.201*`
JOIN (
SELECT
subreddit,
COUNT(*) AS subreddit_size
FROM (
SELECT DISTINCT
subreddit,
author
FROM
`fh-bigquery.reddit_comments.201*`)
GROUP BY
subreddit)
USING
(subreddit)
JOIN
flair.users
USING
(author))
USING
(author)
GROUP BY
subreddit,
subreddit_size
ORDER BY
subreddit_size DESC;"""
df = pd.read_gbq(
query,
PROJECT_ID,
index_col='subreddit',
verbose=False,
dialect='standard')
df.head(20)
subreddit_size | male | female | |
---|---|---|---|
subreddit | |||
AskReddit | 6744464 | 86595 | 50572 |
funny | 3689875 | 68965 | 30025 |
pics | 3608386 | 68255 | 30453 |
gaming | 2517109 | 57656 | 13114 |
videos | 2448897 | 55836 | 17154 |
IAmA | 2413052 | 58177 | 24672 |
todayilearned | 2300584 | 59938 | 23739 |
WTF | 2075080 | 53287 | 22445 |
aww | 1956664 | 42377 | 26527 |
worldnews | 1904960 | 48800 | 14189 |
gifs | 1813010 | 44012 | 14308 |
AdviceAnimals | 1765520 | 50610 | 20777 |
movies | 1649322 | 44204 | 14338 |
news | 1480259 | 40769 | 14347 |
Music | 1465879 | 38535 | 12851 |
mildlyinteresting | 1359446 | 34267 | 14909 |
politics | 1337245 | 39622 | 11478 |
Showerthoughts | 1330610 | 32930 | 12688 |
explainlikeimfive | 1061782 | 35026 | 13598 |
leagueoflegends | 1038240 | 13388 | 1825 |
Create columns for the ratio of male to female users.
df['male_ratio'] = df.male / (df.female + df.male)
df['female_ratio'] = df.female / (df.female + df.male)
Plot the gender ratio for the 100 subreddits with the most comments.
%matplotlib inline
from matplotlib.ticker import FuncFormatter
df.sort_values('subreddit_size', ascending=False, inplace=True)
plot = df[['male_ratio', 'female_ratio']].head(100)[::-1].plot(
kind='barh',
stacked=True,
figsize=(8, 20),
xlim=[0, 1],
title='Subreddit Gender Ratios',
legend=True,
)
plot.xaxis.set_major_formatter(FuncFormatter('{0:.0%}'.format))
plot.set_ylabel('Subreddit')
plot.set_xlabel('Gender Ratio of Commenters')
plot.legend(loc='upper left', bbox_to_anchor=(-0.45, 1.005), fontsize=12)
plot.get_legend().get_texts()[0].set_text('Male')
plot.get_legend().get_texts()[1].set_text('Female')
Plot the gender ratio for the subreddits that are predominantly female. Only include subreddits that we have sampled at least 1000 users.
df_female = df[(df.male + df.female) > 1000][['male_ratio', 'female_ratio']]
plot = df_female.sort_values('female_ratio', ascending=False).head(100)[::-1].plot(
kind='barh',
stacked=True,
figsize=(8, 20),
xlim=[0, 1],
title='Subreddit Gender Ratios of Predominantly Female Subreddits',
legend=True,
)
plot.xaxis.set_major_formatter(FuncFormatter('{0:.0%}'.format))
plot.set_ylabel('Subreddit')
plot.set_xlabel('Gender Ratio of Commenters')
plot.legend(loc='upper left', bbox_to_anchor=(-0.42, 1.005), fontsize=12)
plot.get_legend().get_texts()[0].set_text('Male')
plot.get_legend().get_texts()[1].set_text('Female')
Plot the gender ratio for the subreddits that are predominantly male. Only include subreddits that we have sampled at least 1000 users.
df_male = df[(df.male + df.female) > 1000][['male_ratio', 'female_ratio']]
plot = df_male.sort_values('male_ratio', ascending=False).head(100)[::-1].plot(
kind='barh',
stacked=True,
figsize=(8, 20),
xlim=[0, 1],
title='Subreddit Gender Ratios of Predominantly Male Subreddits',
legend=True,
)
plot.xaxis.set_major_formatter(FuncFormatter('{0:.0%}'.format))
plot.set_ylabel('Subreddit')
plot.set_xlabel('Gender Ratio of Commenters')
plot.legend(loc='upper left', bbox_to_anchor=(-0.38, 1.005), fontsize=12)
plot.get_legend().get_texts()[0].set_text('Male')
plot.get_legend().get_texts()[1].set_text('Female')
Plot the total number of submitters and the number of users we actually sampled for each subreddit. Note that the number of male and female users barely even visible on the graph.
Or data is somewhat terrible. This is okay if our sampled users are somehow actually a perfect random sample. That isn't especially likely though.
df.sort_values('subreddit_size', ascending=False, inplace=True)
plot = df[['male', 'female', 'subreddit_size']].head(100)[::-1].plot(
kind='barh',
figsize=(8, 30),
title='Subreddit Gender Ratios',
legend=True,
)
plot.set_ylabel('Subreddit')
plot.set_xlabel('Submitters')
plot.get_xaxis().get_major_formatter().set_scientific(False)
plot.legend(loc='upper left', bbox_to_anchor=(-0.5, 1.005), fontsize=12)
plot.text(-0.45, -0.02, 'This is how bad the data is. I promise the total number of male and female users that we have data for are actually on this plot.', fontsize=10,
horizontalalignment='left',
verticalalignment='top',
transform=plot.axes.transAxes)
plot.get_legend().get_texts()[0].set_text('Male')
plot.get_legend().get_texts()[1].set_text('Female')
plot.get_legend().get_texts()[2].set_text('Total Commenters')