%%capture
%run shared.ipynb
questioning_KID = JUST Kjonnsidentitet questioning_Cis = JUST Cisness questioning_gender = Kjønnsidentitet and/or Cisness questioning_plus = JUST anyone who has SU in their orientation questioning = any or all of the above
NOTES FROM MIN:
DICTIONARY = the thing in squigly braces with colons. it is made up of:
KEYS on the left of the colon (usually strings but can be anything hashable)
hashable = things that can be turned into a number
list = modifiable, tuple = not, so tuples are hashable and can be keys; lists are not and cannot
VALUES on the right of the colon (can be literally anything)
It looks like this:
dictionary_name = {
"key" : value(s),
"another_key" : more_value(s),
}
Dictionary of tuples (values look like (thing, other_thing, "thing", etc.))
Dictionary of dictionaries, the values can be variable names for other dictionaries (like making a dictionary of all my groups), or nested literals
- global namespace is the dictionary that when I type LG_df, what does that actually correspond to - how does it find the dataframe?
- 'get a handle on' - how to find the thing you want to change, find a method or a function that returns that thing so that you can modify or interact with it in some way
EXPRESSION = a bunch of symbols that has a result, calling a function can be an expression
what you put inside brackets is an expression, whose result is always a series
one expression can have many terms
you can always store the result of an expression in a variable
rather than having df[big complicated expression]
variable = whatever you had in brackets
df[variable]
ARRAY = a sequence of things, a special list
INDEX = the thing (number or string) you use to look up items in a container ("indexing into an array" = the position in the array, 0 = the one at the beginning)
SERIES = the main thing that distinguishes a series from an array is that a series has a customizable index
is two arrays, kind of like a dictionary - the arrays are the index and the values
a column is a series - the index is like a row ID, the actual contents of the column are the values
if you do something like sorting or slicing it maintains the relationship between the index & value
a dataframe is just a collection of series that have the same index (that's why nans are there)
when you do arethmetic or something on a series, it usually means doing that operation element-wise (for each value,
do the thing)
orientation == bi - actually doing that operation on every item in the values, returning a new series with the
same index where the values are the result of whatever operation you did
df["Hyppighet_n"] = df.Hyppighet.apply(Hyppighet_map.get)
In the order that it gets 'done':
- get the dataframe called df
- .Hyppighet = look up the column called Hyppighet (now we have a series)
- .apply() is a method on series that says call this function and return a new series where each item in the
series is the result of calling that function on each item in the hyppighet series
- .get is getting an item out of a dictionary
- df["Hyppighet_n"] = store the resulting series in a new column with this name same kind of things .isin() or
.str.contains() or == or > (all booleans that return true/false)
THING[OTHERTHING] = "get item" (or "set item" if it's on the left side of an equals sign)
- how you get things out of a container of things (like a list or a tuple or a dataframe or a dictionary)
- In Pandas, if you treat it like a dictionary and give it a string, that will return a column. If you give it a 'mask'
it will return a new dataframe or series (if you give a series a mask you get a series back, df get df back) where
that mask is true
- with things like arrays and dataframes, one of the things you can pass to get item is a 'mask'
MASK: series of booleans (true/false) that has the same index as the dataframe or series (columns are series, series are not necessarily columns) you can mask a series or a dataframe, masking a dataframe is the same as masking all the series at once
df[df.column_name] or df[df.column_name == "something"] or df[df.column_name.isin("something")]
Different examples of applying one mask
saying give me the subset (rows) where this mask is true - what you need to do when you are f.eks. creating a group
or 'if this is true or this is true...' - combining a bunch of masks into one mask:
df[df.column_name == "something" | df.column_name.isin("something")]
inside the brackets is a mask - there are lots of ways to make a mask, any series of operations - the result of whatever you put together is a series of booleans with an index that matches the thing you are trying to mask
.dropna() changes the index because it's applying a mask - series.isna() = a series
if you use dropna inside your filter, you have to also do it outside (filter and mask are the same thing)
creating a mask vs. applying a mask
.isna() creates a mask
.dropna() applies the inverse of that mask is equivalent to df[~df.isna()]
When you have df[mask] you're changing the index (picking a subset) to be the items where the mask is true
when do you apply the mask? "apply the mask" means another_df = df[mask] or another_series = series[mask]
never use len on the mask (doesn't provide information, creating a mask doesn't change the length - always has one value for every row)
if you're interested in the values of the same column or another column where the mask is true, that's when you apply the mask
Kira's way: use brackets when applying a mask, no brackets means making it but not applying it
Min's way: *shakes head in disappointment*
Step 1: Creating the mask (big boolean expession that usually starts with df.) is saying find where these things are true
Step 2: Give me the subset of rows where it was true in a way that I can interact with it (apply the mask in order to get that subset of another column)
With a mask, there's only one question you can answer - for how many rows is this condition true (or false) mask.value_counts() or sum(mask)
If you want to answer other questions about a group for which those things are true, then apply the mask (which creates a new df or series) and store it as a new variable so you can work with it
.isin("T", "OT") = value of the column is exactly one of the items in the list (identical to a chained == "T" or == "OT" with separate parentheses for each).
.str.contains("T") will get T, Thing, Thingy, etc. = that substring appears anywhere in the column.
*df*?
KH_compare_df LGBTQIA_df LGBTQIA_norsk_df LG_df Skeiv_ID_df bib_compare2_df bib_compare3_df bib_compare_df bruk_df ch_age_df ch_df ch_region_df ch_urban_groups_df chcsts_df cis_region_df cishet_skeive_df cisness_df cs_urban_groups_df df exclusive_orientation_df gender_df gender_stuff_df intersectional_groups_df mwct_df n_orientations_df overall_df s_age_df s_region_df s_urban_groups_df synlighet_df tnb_region_df tnb_urban_groups_df
#pd.set_option('max_rows', None)
#Show all the columns and rows
#pd.options.display.max_columns = None
#pd.options.display.max_rows = None
#all column names
list(df.columns)
['Unnamed: 0', 'NR', 'Malgruppe', 'Alder', 'Region', 'Kjonnsidentitet', 'KID_egne_ord', 'Cis', 'Spesifiser_Cis', 'Pronomener', 'Intersex', 'Seksuell_orientering_1', 'Seksuell_orientering_2', 'Seksuell_orientering_3', 'Seksuell_orientering_4', 'Seksuell_orientering_5', 'Seksuell_orientering_6', 'Seksuell_orientering_7', 'Seksuell_orientering_8', 'SO_egne_ord', 'Annen_RO', 'RO_1', 'RO_2', 'RO_3', 'RO_4', 'RO_5', 'RO_6', 'RO_7', 'RO_8', 'RO_egne_ord', 'Synlig_Skeiv_1', 'Synlig_Skeiv_2', 'Synlig_Skeiv_3', 'Synlig_Skeiv_4', 'Synlig_Skeiv_5', 'Skeiv_ID', 'Marginalisert_ident', 'Spesifiser_marg_ident', 'Sist_besok', 'Hyppighet', 'Urban', 'Deichman', 'Valgt_andre_bib', 'UV', 'UBU', 'Andre_temaer', 'Arr', 'utlan', 'bla_i', 'datamaskin', 'still_spm', 'utrygt', 'm_med_respekt', 'feilkjonnet', 'antatt_orientering', 'Interaksjon_bibansatt_1', 'Interaksjon_bibansatt_2', 'Interaksjon_bibansatt_3', 'Interaksjon_bibansatt_4', 'Interaksjon_brukere_1', 'Interaksjon_brukere_2', 'Interaksjon_brukere_3', 'Interaksjon_brukere_4', 'Utvalg_KID', 'Utvalg_Orientering', 'Utvalg_Intersex', 'Utvalg_lykkelig', 'Utvalg_fag', 'Aldri_tenkt', 'Utrygt_stille_spm', 'Rep_matters_B', 'Bib_pleier_ha', 'Alltid_velkommen', 'Trygge_rom_gen_B', 'Ingen_rolle', 'Ingenting_tilby', 'Bibs_ansvar', 'Minoritetsstress_ansatte', 'Minoritetsstress_brukere', 'Lhbtiq_vennlig', 'Rom_for_forbedring', 'Andre_brukere', 'Helt_meg_selv', 'Ikke_velkommen', 'Ingen_rolle_IB', 'Trygge_rom_pers_IB', 'Trygge_rom_gen_IB', 'Godt_utvalg', 'Skeiv_medieforbruk_pos', 'Skeiv_medieforbruk_neg', 'Rep_matters_IB', 'Mer_skj_item', 'Mer_skj_skaper', 'Mer_skj_mangfold', 'Mer_faglitt', 'Mer_BU', 'Bedre_gjenfinning', 'Info', 'Kompetanseheving', 'Apne_ansatte', 'Tredje_KI', 'Selvbestemt_KID', 'Endre_KID', 'Toaletter', 'Pronomen_bruk', 'Nulltoleranse', 'Skilting', 'Behov_kompetanseheving', 'Overall_tilfredshet', 'Svartid', 'survey', 'KID_egne_ord2', 'Elapsed time', 'Er du over 16 år og IKKE skeiv/LHBTIQ+?', 'Jeg er interessert i å lese bøker om skeive/LHBTIQ+-karakterer', 'Holdning_UV', 'Holdning_UBU', 'Holdning_Andre_temaer', 'Holdning_Arr', 'Undervisning_om_skeive', 'Oblig_emne', 'HP_Arr', 'Overall_behov_KH', 'Individ_behov_KH', 'Interesse_KH', 'Skeiv', 'Seksuell_orientering', 'RO', 'Synlig_Skeiv', 'Interaksjon_bibansatt', 'Interaksjon_brukere', 'Orienteringer', 'mapped_age', 'Sist_besok_måneder', 'Hyppighet_n', 'Bruk', 'Tokjønnsnorm', 'Synlighet', 'Rep_matters_all', 'Trygge_rom_gen_all', 'Urban_rural']
#Format decimal as percentage:
queer_frac = (sum(df.Seksuell_orientering.str.contains("Q"))/len(alle_skeive))
print("{:.1%}".format(queer_frac))
#or
print (f"{queer_frac:.1%}")
41.1% 41.1%
#Make new categories
bibliotekarer = df[df.survey =="Bibliotekarer"]
cishet_bibliotekarer = df[(df.survey =="Bibliotekarer") & (~df.Skeiv)]
skeive_bibliotekarer = df[(df.survey =="Bibliotekarer") & (df.Skeiv)]
print(len(skeive_bibliotekarer))
ace_plus = df[df.Seksuell_orientering.str.contains("Ace")|df.RO.str.contains("Aro")]
len(ace_plus)
12
90
#x_groups is a dictionary, and the strings are the keys, and the dataframes (LG, gay, etc.) are the values.
#You can get one of the values by asking for one of the keys, so if you type x_groups["X"] the you will get the gay dataframe.
#Make them into a group df
cisness_df["feilkjonnet"].hist()
<AxesSubplot:>
alle_skeive.Orienteringer.str.strip(",").str.split(",").apply(len).value_counts(normalize=True).plot(kind='bar')
alle_skeive.Orienteringer.str.strip(",").str.split(",").apply(len).value_counts()
#alle_skeive.Orienteringer.unique()
1 279 2 217 3 71 4 26 5 19 6 19 8 5 7 4 10 1 14 1 Name: Orienteringer, dtype: int64
#Drops all 'neutral' responses from those who did not change the pre-set neutral on ANY questions (N=12)
# columns = ["Utvalg_KID", "Utvalg_Orientering", "Utvalg_Intersex", "Utvalg_lykkelig", "Utvalg_fag", "Aldri_tenkt", "Utrygt_stille_spm", "Rep_matters_B", "Bib_pleier_ha", "Alltid_velkommen", "Trygge_rom_gen_B", "Ingen_rolle", "Ingenting_tilby", "Bibs_ansvar", "Minoritetsstress_ansatte", "Minoritetsstress_brukere", "Lhbtiq_vennlig", "Rom_for_forbedring", "Andre_brukere", "Helt_meg_selv"]
# non_participants = True
# for column in columns:
# non_participants &= df[column] == 0
# df.loc[non_participants, columns] = pd.NA
for key in gender_keys.keys():
print_info_by_gender(key, column="Forklare_SIAN_B_U")
print()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/mambaforge/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3079 try: -> 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Forklare_SIAN_B_U' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-13-dcfad6d8fa39> in <module> 1 for key in gender_keys.keys(): ----> 2 print_info_by_gender(key, column="Forklare_SIAN_B_U") 3 print() <ipython-input-2-63b01baa4024> in print_info_by_gender(key, column) 9 def print_info_by_gender(key, column="column"): 10 group, label = gender_keys[key] ---> 11 group_na = len(group) - sum((group[column].isna() & group[column].isna())) 12 whole_group = len(group) 13 NA_frac = group_na/whole_group ~/mambaforge/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key) 3022 if self.columns.nlevels > 1: 3023 return self._getitem_multilevel(key) -> 3024 indexer = self.columns.get_loc(key) 3025 if is_integer(indexer): 3026 indexer = [indexer] ~/mambaforge/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err: -> 3082 raise KeyError(key) from err 3083 3084 if tolerance is not None: KeyError: 'Forklare_SIAN_B_U'
sum(s_women.Avrunding_B.isna())/len(s_women)
#Show non-captured entries (those with more than one orientation selected) by creating a union of all group indices and locate all rows not in that index
noncaptured = df[~df.NR.isin(exclusive_orientation_df.NR)]
#noncaptured.loc[noncaptured.Orienteringer == "Het,", ["RO", "Seksuell_orientering", "Kjonnsidentitet"]]
#noncaptured.Orienteringer.value_counts()