Summary Analysis of the 2017 GitHub Open Source Survey¶

By R. Stuart Geiger (@staeiou), Berkeley Institute for Data Science

Overview¶

This notebook analyzes the 2017 Open Source Survey, conducted by staff at GitHub, Inc. and other collaborators (see https://opensourcesurvey.org/2017 and https://github.com/github/open-source-survey). The survey was run in 2017, asking over 50 questions on a variety of topics. The survey's designers explain the motivation, design, and distribution of the survey:

In collaboration with researchers from academia, industry, and the community, GitHub designed a survey to gather high quality and novel data on open source software development practices and communities. We collected responses from 5,500 randomly sampled respondents sourced from over 3,800 open source repositories on GitHub.com, and over 500 responses from a non-random sample of communities that work on other platforms. The results are an open data set about the attitudes, experiences, and backgrounds of those who use, build, and maintain open source software."

Purpose and goal¶

The GitHub survey team presented analyses of some questions when releasing the survey, but there were many more questions asked that are relevant to researchers and community members. This report is an exploratory analysis of all questions asked in the survey, providing a basic summary of the responses to each question. This report presents and plots summary statistics -- mostly frequency counts, proportions, then a frequency or proportion bar graph -- of all questions asked in the survey. Most questions are presented individually, with panel questions grouped together as appropriate. There are no correlations, regressions, or descriptive breakouts between subgroups. Likert-style questions (e.g. Strongly agree <-> strongly disagree) have not been recoded to numerical, scalar values. There are no discussions or interpretations of results. This is left for future work.

The purpose of this notebook is to facilitate future research on this dataset by giving an overview of the kinds of questions asked in the survey, as well as serve as the basis for a PDF report, published on SocArXiv and OSF at https://osf.io/preprints/socarxiv/qps53/. The notebook is public on GitHub at https://github.com/staeiou/github-survey-analysis and others are encouraged to extend it as they see fit.

In [1]:

!pip install pandas seaborn

Requirement already satisfied: pandas in /home/staeiou/conda/lib/python3.5/site-packages
Requirement already satisfied: seaborn in /home/staeiou/conda/lib/python3.5/site-packages
Requirement already satisfied: python-dateutil>=2 in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: pytz>=2011k in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /home/staeiou/conda/lib/python3.5/site-packages (from python-dateutil>=2->pandas)

In [2]:

import pandas as pd
import matplotlib, matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline
pd.options.display.float_format = '{:.2f}%'.format # add % to all floats, all floats here are percentages

In [3]:

## For making pretty tables when nbconverting to latex

pd.set_option('display.notebook_repr_html', True)

def _repr_latex_(self):
    return "\centering{%s}" % self.to_latex()

pd.DataFrame._repr_latex_ = _repr_latex_  # monkey patch pandas DataFrame

Download and unzip data¶

In [4]:

!unzip -o data_for_public_release.zip

Archive:  data_for_public_release.zip
   creating: data_for_public_release/
  inflating: data_for_public_release/negative_incidents.csv  
  inflating: __MACOSX/data_for_public_release/._negative_incidents.csv  
  inflating: data_for_public_release/notes.txt  
  inflating: data_for_public_release/questionnaire.txt  
  inflating: __MACOSX/data_for_public_release/._questionnaire.txt  
  inflating: data_for_public_release/README.txt  
  inflating: __MACOSX/data_for_public_release/._README.txt  
  inflating: data_for_public_release/survey_data.csv  
  inflating: __MACOSX/data_for_public_release/._survey_data.csv

In [5]:

!ls data_for_public_release/

negative_incidents.csv	questionnaire.txt  survey_data.csv
notes.txt		README.txt

Data processing¶

Main dataset¶

Load main dataset into pandas¶

In [6]:

pd.options.display.max_rows = 500

In [7]:

survey_df = pd.read_csv("data_for_public_release/survey_data.csv")

In [8]:

print("survey_data.csv length:", len(survey_df))

survey_data.csv length: 6029

In [9]:

survey_complete_df = survey_df.query("STATUS == 'Complete'")
print("survey_data.csv completed responses:", len(survey_complete_df))

survey_data.csv completed responses: 3746

Explore the main dataset with some sample responses¶

In [10]:

survey_complete_df[0:3].transpose()

Out[10]:

	3	4	6
RESPONSE.ID	48	49	51
DATE.SUBMITTED	3/21/17 15:42	3/21/17 15:38	3/21/17 15:41
STATUS	Complete	Complete	Complete
PARTICIPATION.TYPE.FOLLOW	1	1	1
PARTICIPATION.TYPE.USE.APPLICATIONS	1	1	1
PARTICIPATION.TYPE.USE.DEPENDENCIES	1	1	1
PARTICIPATION.TYPE.CONTRIBUTE	1	1	0
PARTICIPATION.TYPE.OTHER	0	0	0
CONTRIBUTOR.TYPE.CONTRIBUTE.CODE	Frequently	Occasionally	NaN
CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS	Rarely	Rarely	NaN
CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE	Frequently	Rarely	NaN
CONTRIBUTOR.TYPE.FILE.BUGS	Frequently	Frequently	NaN
CONTRIBUTOR.TYPE.FEATURE.REQUESTS	Frequently	Frequently	NaN
CONTRIBUTOR.TYPE.COMMUNITY.ADMIN	Never	Occasionally	NaN
EMPLOYMENT.STATUS	Employed full time	Full time student	Employed full time
PROFESSIONAL.SOFTWARE	Frequently	NaN	Frequently
FUTURE.CONTRIBUTION.INTEREST	Very interested	Very interested	Very interested
FUTURE.CONTRIBUTION.LIKELIHOOD	Very likely	Very likely	Somewhat unlikely
OSS.USER.PRIORITIES.LICENSE	Very important to have	Very important to have	Very important to have
OSS.USER.PRIORITIES.CODE.OF.CONDUCT	Somewhat important not to have	Somewhat important to have	Not important either way
OSS.USER.PRIORITIES.CONTRIBUTING.GUIDE	Somewhat important to have	Very important to have	Somewhat important to have
OSS.USER.PRIORITIES.CLA	Not important either way	Very important to have	Don't know what this is
OSS.USER.PRIORITIES.ACTIVE.DEVELOPMENT	Somewhat important to have	Very important to have	Very important to have
OSS.USER.PRIORITIES.RESPONSIVE.MAINTAINERS	Somewhat important to have	Very important to have	Very important to have
OSS.USER.PRIORITIES.WELCOMING.COMMUNITY	Very important to have	Very important to have	Somewhat important to have
OSS.USER.PRIORITIES.WIDESPREAD.USE	Somewhat important to have	Not important either way	Somewhat important to have
OSS.CONTRIBUTOR.PRIORITIES.LICENSE	Not important either way	NaN	NaN
OSS.CONTRIBUTOR.PRIORITIES.CODE.OF.CONDUCT	Somewhat important not to have	NaN	NaN
OSS.CONTRIBUTOR.PRIORITIES.CONTRIBUTING.GUIDE	Not important either way	NaN	NaN
OSS.CONTRIBUTOR.PRIORITIES.CLA	Not important either way	NaN	NaN
OSS.CONTRIBUTOR.PRIORITIES.ACTIVE.DEVELOPMENT	Somewhat important to have	NaN	NaN
OSS.CONTRIBUTOR.PRIORITIES.RESPONSIVE.MAINTAINERS	Somewhat important to have	NaN	NaN
OSS.CONTRIBUTOR.PRIORITIES.WELCOMING.COMMUNITY	Somewhat important to have	NaN	NaN
OSS.CONTRIBUTOR.PRIORITIES.WIDESPREAD.USE	Somewhat important to have	NaN	NaN
SEEK.OPEN.SOURCE	Sometimes	Always	Always
OSS.UX	Generally easier to use	About the same	Generally easier to use
OSS.SECURITY	Generally more secure	Generally more secure	About the same
OSS.STABILITY	About the same	Generally less stable	About the same
INTERNAL.EFFICACY	Strongly agree	Strongly agree	Strongly agree
EXTERNAL.EFFICACY	Strongly agree	Strongly agree	Neither agree nor disagree
OSS.IDENTIFICATION	Neither agree nor disagree	Strongly agree	Neither agree nor disagree
USER.VALUES.STABILITY	Moderately important	Extremely important	Extremely important
USER.VALUES.INNOVATION	Not at all important	Very important	Moderately important
USER.VALUES.REPLICABILITY	Very important	Very important	Moderately important
USER.VALUES.COMPATIBILITY	Very important	Very important	Extremely important
USER.VALUES.SECURITY	Very important	Very important	Extremely important
USER.VALUES.COST	Very important	Not at all important	Very important
USER.VALUES.TRANSPARENCY	Very important	Extremely important	Extremely important
USER.VALUES.USER.EXPERIENCE	Extremely important	Moderately important	Very important
USER.VALUES.CUSTOMIZABILITY	Extremely important	Very important	Extremely important
USER.VALUES.SUPPORT	Slightly important	Moderately important	Not at all important
USER.VALUES.TRUSTED.PRODUCER	Very important	Slightly important	Moderately important
TRANSPARENCY.PRIVACY.BELIEFS	People should be able to contribute code witho...	People should be able to contribute code witho...	People should be able to contribute code witho...
INFO.AVAILABILITY	A lot of information about me	A lot of information about me	A little information about me
INFO.JOB	Yes	No	No
TRANSPARENCY.PRIVACY.PRACTICES.GENERAL	I include my real name.	I include my real name.	I don't publish this kind of content online.
TRANSPARENCY.PRIVACY.PRACTICES.OSS	I include my real name.	I include my real name.	NaN
RECEIVED.HELP	Yes	Yes	Yes
FIND.HELPER	Other - Please describe	I asked for help in a public forum (e.g. in a ...	I asked a specific person for help.
HELPER.PRIOR.RELATIONSHIP	We knew each other well.	Total strangers, I didn't know of them previou...	We knew each other well.
RECEIVED.HELP.TYPE	Writing code or otherwise implementing ideas.	Installing or using an application.	Installing or using an application.
PROVIDED.HELP	Yes	Yes	Yes
FIND.HELPEE	I reached out to them to offer unsolicited help.	They asked for help in a public forum (e.g. in...	They asked me directly for help.
HELPEE.PRIOR.RELATIONSHIP	Total strangers, I didn't know of them previou...	Total strangers, I didn't know of them previou...	We knew each other well.
PROVIDED.HELP.TYPE	Writing code or otherwise implementing ideas.	Installing or using an application.	Installing or using an application.
DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE	Yes	Yes	Yes
DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION	Yes	No	No
DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE	Yes	Yes	Yes
DISCOURAGING.BEHAVIOR.BAD.DOCS	Yes	Yes	Yes
DISCOURAGING.BEHAVIOR.CONFLICT	Yes	Yes	No
DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE	No	No	No
OSS.AS.JOB	Yes, directly- some or all of my work duties ...	NaN	NaN
OSS.AT.WORK	Frequently	NaN	Frequently
OSS.IP.POLICY	I am free to contribute without asking for per...	NaN	I'm not sure.
EMPLOYER.POLICY.APPLICATIONS	Use of open source applications is acceptable ...	NaN	Use of open source applications is encouraged.
EMPLOYER.POLICY.DEPENDENCIES	Use of open source dependencies is acceptable ...	NaN	Use of open source dependencies is encouraged.
OSS.HIRING	Very important	NaN	NaN
IMMIGRATION	No, I live in the country where I was born.	No, I live in the country where I was born.	Yes, and I intend to stay permanently.
MINORITY.HOMECOUNTRY	NaN	NaN	No
MINORITY.CURRENT.COUNTRY	No	No	No
GENDER	Man	Man	Man
TRANSGENDER.IDENTITY	No	No	No
SEXUAL.ORIENTATION	No	Yes	No
WRITTEN.ENGLISH	Very well	Very well	Very well
AGE	35 to 44 years	17 or younger	35 to 44 years
FORMAL.EDUCATION	Bachelor's degree	Secondary (high) school graduate or equivalent	Vocational/trade program or apprenticeship
PARENTS.FORMAL.EDUCATION	Bachelor's degree	Master's degree	Bachelor's degree
AGE.AT.FIRST.COMPUTER.INTERNET	13 - 17 years old	Younger than 13 years old	13 - 17 years old
LOCATION.OF.FIRST.COMPUTER.INTERNET	At home (belonging to me or a family member)	At home (belonging to me or a family member)	At home (belonging to me or a family member)
PARTICIPATION.TYPE.ANY.REPONSE	1	1	1
POPULATION	github	github	github
OFF.SITE.ID	NaN	NaN	NaN
TRANSLATED	0	0	0

Create lists of variables for bulk analysis¶

In [11]:

participation_type_vars = ['PARTICIPATION.TYPE.FOLLOW',
       'PARTICIPATION.TYPE.USE.APPLICATIONS',
       'PARTICIPATION.TYPE.USE.DEPENDENCIES', 'PARTICIPATION.TYPE.CONTRIBUTE',
       'PARTICIPATION.TYPE.OTHER']

contrib_type_vars = ['CONTRIBUTOR.TYPE.CONTRIBUTE.CODE',
       'CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS',
       'CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE', 'CONTRIBUTOR.TYPE.FILE.BUGS',
       'CONTRIBUTOR.TYPE.FEATURE.REQUESTS', 'CONTRIBUTOR.TYPE.COMMUNITY.ADMIN']

contrib_other_vars = ['EMPLOYMENT.STATUS', 'PROFESSIONAL.SOFTWARE',
       'FUTURE.CONTRIBUTION.INTEREST', 'FUTURE.CONTRIBUTION.LIKELIHOOD']

contrib_ident_vars = participation_type_vars + contrib_type_vars + contrib_other_vars

In [12]:

user_pri_vars = ['OSS.USER.PRIORITIES.LICENSE', 'OSS.USER.PRIORITIES.CODE.OF.CONDUCT',
       'OSS.USER.PRIORITIES.CONTRIBUTING.GUIDE', 'OSS.USER.PRIORITIES.CLA',
       'OSS.USER.PRIORITIES.ACTIVE.DEVELOPMENT',
       'OSS.USER.PRIORITIES.RESPONSIVE.MAINTAINERS',
       'OSS.USER.PRIORITIES.WELCOMING.COMMUNITY',
       'OSS.USER.PRIORITIES.WIDESPREAD.USE']

contrib_pri_vars = ['OSS.CONTRIBUTOR.PRIORITIES.LICENSE',
       'OSS.CONTRIBUTOR.PRIORITIES.CODE.OF.CONDUCT',
       'OSS.CONTRIBUTOR.PRIORITIES.CONTRIBUTING.GUIDE',
       'OSS.CONTRIBUTOR.PRIORITIES.CLA',
       'OSS.CONTRIBUTOR.PRIORITIES.ACTIVE.DEVELOPMENT',
       'OSS.CONTRIBUTOR.PRIORITIES.RESPONSIVE.MAINTAINERS',
       'OSS.CONTRIBUTOR.PRIORITIES.WELCOMING.COMMUNITY',
       'OSS.CONTRIBUTOR.PRIORITIES.WIDESPREAD.USE']

oss_values_vars = [ 'SEEK.OPEN.SOURCE',
       'OSS.UX', 'OSS.SECURITY', 'OSS.STABILITY', 'INTERNAL.EFFICACY',
       'EXTERNAL.EFFICACY', 'OSS.IDENTIFICATION']

user_values_vars = ['USER.VALUES.STABILITY',
       'USER.VALUES.INNOVATION', 'USER.VALUES.REPLICABILITY',
       'USER.VALUES.COMPATIBILITY', 'USER.VALUES.SECURITY', 'USER.VALUES.COST',
       'USER.VALUES.TRANSPARENCY', 'USER.VALUES.USER.EXPERIENCE',
       'USER.VALUES.CUSTOMIZABILITY', 'USER.VALUES.SUPPORT',
       'USER.VALUES.TRUSTED.PRODUCER']

values_pri_vars = user_pri_vars + contrib_pri_vars + user_values_vars + oss_values_vars 

In [13]:

privacy_transp_vars = ['TRANSPARENCY.PRIVACY.BELIEFS',
       'INFO.AVAILABILITY', 'INFO.JOB',
       'TRANSPARENCY.PRIVACY.PRACTICES.GENERAL',
       'TRANSPARENCY.PRIVACY.PRACTICES.OSS']

In [14]:

help_vars = ['RECEIVED.HELP', 'FIND.HELPER',
       'HELPER.PRIOR.RELATIONSHIP', 'RECEIVED.HELP.TYPE', 'PROVIDED.HELP',
       'FIND.HELPEE', 'HELPEE.PRIOR.RELATIONSHIP', 'PROVIDED.HELP.TYPE']

In [15]:

paid_work_vars = ['OSS.AS.JOB',
       'OSS.AT.WORK', 'OSS.IP.POLICY', 'EMPLOYER.POLICY.APPLICATIONS',
       'EMPLOYER.POLICY.DEPENDENCIES', 'OSS.HIRING']

In [16]:

discouraging_vars = ['DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE',
       'DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION',
       'DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE',
       'DISCOURAGING.BEHAVIOR.BAD.DOCS', 'DISCOURAGING.BEHAVIOR.CONFLICT',
       'DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE']

In [17]:

demographic_vars = ['IMMIGRATION',
       'MINORITY.HOMECOUNTRY', 'MINORITY.CURRENT.COUNTRY', 'GENDER',
       'TRANSGENDER.IDENTITY', 'SEXUAL.ORIENTATION', 'WRITTEN.ENGLISH', 'AGE',
       'FORMAL.EDUCATION', 'PARENTS.FORMAL.EDUCATION',
       'AGE.AT.FIRST.COMPUTER.INTERNET', 'LOCATION.OF.FIRST.COMPUTER.INTERNET',
       'PARTICIPATION.TYPE.ANY.REPONSE', 'POPULATION', 'OFF.SITE.ID',
       'TRANSLATED']

In [18]:

survey_vars = [contrib_ident_vars, values_pri_vars, privacy_transp_vars, \
               help_vars, paid_work_vars, discouraging_vars, demographic_vars]

Negative incidents¶

Load into pandas¶

In [19]:

neg_df = pd.read_csv("data_for_public_release/negative_incidents.csv")

In [20]:

print("negative_incidents.csv length:", len(survey_df))

negative_incidents.csv length: 6029

Explore the negative dataset with some sample responses¶

In [21]:

neg_df[0:3].transpose()

Out[21]:

	0	1	2
NEGATIVE.WITNESS.RUDENESS	1	1	0
NEGATIVE.WITNESS.NAME.CALLING	1	0	0
NEGATIVE.WITNESS.THREATS	0	0	0
NEGATIVE.WITNESS.IMPERSONATION	0	0	1
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT	0	0	0
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT	0	0	0
NEGATIVE.WITNESS.STALKING	0	0	0
NEGATIVE.WITNESS.SEXUAL.ADVANCES	0	0	0
NEGATIVE.WITNESS.STEREOTYPING	0	0	0
NEGATIVE.WITNESS.DOXXING	0	0	1
NEGATIVE.WITNESS.OTHER	0	0	0
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE	0	0	0
NEGATIVE.EXPERIENCE.RUDENESS	0	1	0
NEGATIVE.EXPERIENCE.NAME.CALLING	0	0	0
NEGATIVE.EXPERIENCE.THREATS	0	0	0
NEGATIVE.EXPERIENCE.IMPERSONATION	0	0	0
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT	0	0	0
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT	0	0	0
NEGATIVE.EXPERIENCE.STALKING	0	0	0
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES	0	0	0
NEGATIVE.EXPERIENCE.STEREOTYPING	0	0	0
NEGATIVE.EXPERIENCE.DOXXING	0	0	0
NEGATIVE.EXPERIENCE.OTHER	0	0	0
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE	1	0	1
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP	0	0	0
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT	0	0	0
NEGATIVE.RESPONSE.BLOCKED.USER	0	0	0
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS	0	0	0
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP	0	0	0
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL	0	0	0
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT	0	0	0
NEGATIVE.RESPONSE.OTHER	0	0	0
NEGATIVE.RESPONSE.IGNORED	0	1	0
RESPONSE.EFFECTIVENESS.ASKED.USER.TO.STOP	NaN	NaN	NaN
RESPONSE.EFFECTIVENESS.SOLICITED.COMMUNITY.SUPPORT	NaN	NaN	NaN
RESPONSE.EFFECTIVENESS.BLOCKED.USER	NaN	NaN	NaN
RESPONSE.EFFECTIVENESS.REPORTED.TO.MAINTAINERS	NaN	NaN	NaN
RESPONSE.EFFECTIVENESS.REPORTED.TO.HOST.OR.ISP	NaN	NaN	NaN
RESPONSE.EFFECTIVENESS.CONSULTED.LEGAL.COUNSEL	NaN	NaN	NaN
RESPONSE.EFFECTIVENESS.CONTACTED.LAW.ENFORCEMENT	NaN	NaN	NaN
RESPONSE.EFFECTIVENESS.OTHER	NaN	NaN	NaN
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING	0	0	1
NEGATIVE.CONSEQUENCES.PSEUDONYM	0	0	0
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE	0	0	0
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME	0	0	0
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE	0	0	0
NEGATIVE.CONSEQUENCES.SUGGEST.COC	0	0	0
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION	0	0	0
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION	0	1	0
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES	0	0	0
NEGATIVE.CONSEQUENCES.OTHER	0	0	0
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE	1	0	0
NEGATIVE.WITNESS.ANY.RESPONSE	1	1	1
NEGATIVE.EXPERIENCE.ANY.RESPONSE	1	1	1
NEGATIVE.RESPONSE.ANY.RESPONSE	0	1	0
NEGATIVE.CONSEQUENCES.ANY.RESPONSE	1	1	1
POPULATION	github	github	github

Create lists of variables for bulk analysis¶

In [22]:

neg_witness_vars = ['NEGATIVE.WITNESS.RUDENESS', 'NEGATIVE.WITNESS.NAME.CALLING',
       'NEGATIVE.WITNESS.THREATS', 'NEGATIVE.WITNESS.IMPERSONATION',
       'NEGATIVE.WITNESS.SUSTAINED.HARASSMENT',
       'NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT',
       'NEGATIVE.WITNESS.STALKING', 'NEGATIVE.WITNESS.SEXUAL.ADVANCES',
       'NEGATIVE.WITNESS.STEREOTYPING', 'NEGATIVE.WITNESS.DOXXING',
       'NEGATIVE.WITNESS.OTHER', 'NEGATIVE.WITNESS.NONE.OF.THE.ABOVE', 'NEGATIVE.WITNESS.ANY.RESPONSE']

In [23]:

neg_exp_vars = ['NEGATIVE.EXPERIENCE.RUDENESS', 'NEGATIVE.EXPERIENCE.NAME.CALLING',
       'NEGATIVE.EXPERIENCE.THREATS', 'NEGATIVE.EXPERIENCE.IMPERSONATION',
       'NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT',
       'NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT',
       'NEGATIVE.EXPERIENCE.STALKING', 'NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES',
       'NEGATIVE.EXPERIENCE.STEREOTYPING', 'NEGATIVE.EXPERIENCE.DOXXING',
       'NEGATIVE.EXPERIENCE.OTHER', 'NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE', 'NEGATIVE.EXPERIENCE.ANY.RESPONSE']

In [24]:

neg_resp_vars = ['NEGATIVE.RESPONSE.ASKED.USER.TO.STOP',
       'NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT',
       'NEGATIVE.RESPONSE.BLOCKED.USER',
       'NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS',
       'NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP',
       'NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL',
       'NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT',
       'NEGATIVE.RESPONSE.OTHER', 'NEGATIVE.RESPONSE.IGNORED', 'NEGATIVE.RESPONSE.ANY.RESPONSE']

In [25]:

neg_effect_vars = ['RESPONSE.EFFECTIVENESS.ASKED.USER.TO.STOP',
       'RESPONSE.EFFECTIVENESS.SOLICITED.COMMUNITY.SUPPORT',
       'RESPONSE.EFFECTIVENESS.BLOCKED.USER',
       'RESPONSE.EFFECTIVENESS.REPORTED.TO.MAINTAINERS',
       'RESPONSE.EFFECTIVENESS.REPORTED.TO.HOST.OR.ISP',
       'RESPONSE.EFFECTIVENESS.CONSULTED.LEGAL.COUNSEL',
       'RESPONSE.EFFECTIVENESS.CONTACTED.LAW.ENFORCEMENT',
       'RESPONSE.EFFECTIVENESS.OTHER']

In [26]:

neg_conseq_vars = ['NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING',
       'NEGATIVE.CONSEQUENCES.PSEUDONYM',
       'NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE',
       'NEGATIVE.CONSEQUENCES.CHANGE.USERNAME',
       'NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE',
       'NEGATIVE.CONSEQUENCES.SUGGEST.COC',
       'NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION',
       'NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION',
       'NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES', 'NEGATIVE.CONSEQUENCES.OTHER',
       'NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE', 'NEGATIVE.CONSEQUENCES.ANY.RESPONSE']

In [27]:

neg_anyresp_vars = ['NEGATIVE.WITNESS.ANY.RESPONSE', 'NEGATIVE.EXPERIENCE.ANY.RESPONSE',
       'NEGATIVE.RESPONSE.ANY.RESPONSE', 'NEGATIVE.CONSEQUENCES.ANY.RESPONSE']

Analysis¶

In [28]:

sns.set(font_scale=1.5)

Contributor identity¶

People participate in open source in different ways. Which of the following activities do you engage in?¶

Choose all that apply.

In [29]:

participation_type_resp= survey_df[participation_type_vars].apply(pd.Series.value_counts).transpose()
participation_type_resp.columns = ["No", "Yes"]
participation_type_resp

Out[29]:

	No	Yes
PARTICIPATION.TYPE.FOLLOW	1287	4742
PARTICIPATION.TYPE.USE.APPLICATIONS	454	5575
PARTICIPATION.TYPE.USE.DEPENDENCIES	946	5083
PARTICIPATION.TYPE.CONTRIBUTE	1722	4307
PARTICIPATION.TYPE.OTHER	5742	287

In [ ]:

In [30]:

participation_type_prop = survey_df[participation_type_vars].mean() * 100
participation_type_prop = participation_type_prop.sort_values()
pd.DataFrame(participation_type_prop, columns=["percent"])

Out[30]:

	percent
PARTICIPATION.TYPE.OTHER	4.76%
PARTICIPATION.TYPE.CONTRIBUTE	71.44%
PARTICIPATION.TYPE.FOLLOW	78.65%
PARTICIPATION.TYPE.USE.DEPENDENCIES	84.31%
PARTICIPATION.TYPE.USE.APPLICATIONS	92.47%

In [31]:

ax = participation_type_prop.plot(kind='barh')

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[19:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
plt.xlim(0,100)
ax.set_yticklabels(labels)

ax.set_xlabel("Percent of respondents")
t = plt.title("% of people who participate in the following activities:")

Contributon type: How often do you engage in each of the following activities?¶

In [32]:

contrib_type_responses = survey_df[contrib_type_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_type_responses = contrib_type_responses[["Never", "Rarely", "Occasionally", "Frequently"]]
contrib_type_responses = contrib_type_responses[["Frequently", "Occasionally", "Rarely", "Never"]]
contrib_type_responses = contrib_type_responses.sort_values(by='Frequently')
contrib_type_responses

Out[32]:

	Frequently	Occasionally	Rarely	Never
CONTRIBUTOR.TYPE.COMMUNITY.ADMIN	287	417	867	2412
CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS	460	1214	1665	661
CONTRIBUTOR.TYPE.FEATURE.REQUESTS	573	1625	1346	451
CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE	996	944	974	1090
CONTRIBUTOR.TYPE.FILE.BUGS	1067	2073	768	106
CONTRIBUTOR.TYPE.CONTRIBUTE.CODE	1160	1383	1301	189

In [33]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
contrib_type_responses.plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[17:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("How often do you engage in each of the following activities?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

Employment status¶

EMPLOYMENT.STATUS

In [34]:

prop_df = pd.DataFrame((survey_df['EMPLOYMENT.STATUS'].value_counts()))
prop_df.columns=["count"]
prop_df

Out[34]:

	count
Employed full time	3615
Full time student	1048
Employed part time	349
Temporarily not working	314
Other - please describe	184
Retired or permanently not working (e.g. due to disability)	90

In [35]:

prop_df = pd.DataFrame((survey_df['EMPLOYMENT.STATUS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[35]:

	percent
Employed full time	64.55%
Full time student	18.71%
Employed part time	6.23%
Temporarily not working	5.61%
Other - please describe	3.29%
Retired or permanently not working (e.g. due to disability)	1.61%

In [36]:

ax = pd.DataFrame(survey_df['EMPLOYMENT.STATUS'].value_counts()).plot(kind='barh')
plt.suptitle("Employment status")
t = ax.set_xlabel("Count of responses")

In your main job, how often do you write or otherwise directly contribute to producing software?¶

PROFESSIONAL.SOFTWARE

In [37]:

prop_df = pd.DataFrame((survey_df['PROFESSIONAL.SOFTWARE'].value_counts()))
prop_df.columns=["count"]
prop_df

Out[37]:

	count
Frequently	2747
Occasionally	542
Rarely	339
Never	279

In [38]:

prop_df = pd.DataFrame((survey_df['PROFESSIONAL.SOFTWARE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[38]:

	percent
Frequently	70.31%
Occasionally	13.87%
Rarely	8.68%
Never	7.14%

In [39]:

ax = pd.DataFrame(survey_df['PROFESSIONAL.SOFTWARE'].value_counts()).plot(kind='barh')
plt.title("In your main job, how often do you write or\notherwise directly contribute to producing software?")
t = ax.set_xlabel("Count of responses")

How interested are you in contributing to open source projects in the future?¶

FUTURE.CONTRIBUTION.INTEREST

In [40]:

prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts()))
prop_df.columns=["count"]
prop_df

Out[40]:

	count
Very interested	3929
Somewhat interested	1430
Not too interested	125
Not at all interested	24

In [41]:

prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts(normalize=True).round(4).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[41]:

	percent
Very interested	71.33%
Somewhat interested	25.96%
Not too interested	2.27%
Not at all interested	0.44%

In [42]:

ax = pd.DataFrame(survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts()).plot(kind='barh')
plt.title("How interested are you in contributing\nto open source projects in the future?")
t = ax.set_xlabel("Count of responses")

How likely are you to contribute to open source projects in the future?¶

In [43]:

prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts()))
prop_df.columns=["count"]
prop_df

Out[43]:

	count
Very likely	3271
Somewhat likely	1719
Somewhat unlikely	440
Very unlikely	81

In [44]:

prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[44]:

	percent
Very likely	59.35%
Somewhat likely	31.19%
Somewhat unlikely	7.98%
Very unlikely	1.47%

In [45]:

ax = pd.DataFrame(survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts()).plot(kind='barh')
plt.title("How likely are you to contribute to\nopen source projects in the future?")
t = ax.set_xlabel("Count of responses")

Priorities and values¶

When thinking about whether to use open source software, how important are the following things?¶

OSS.USER.PRIORITIES.*

In [46]:

user_pri_responses = survey_df[user_pri_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
user_pri_responses = user_pri_responses[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
user_pri_responses = user_pri_responses.sort_values(by="Very important to have")

In [47]:

idx = []
for i in user_pri_responses.index:
    idx.append(i[20:])
idx = pd.Series(idx)    
user_pri_responses.set_index(idx)

Out[47]:

	Very important to have	Somewhat important to have	Not important either way	Somewhat important not to have	Very important not to have	Don't know what this is
CLA	490	1024	2282	336	157	488
CODE.OF.CONDUCT	848	1461	1993	166	120	209
WIDESPREAD.USE	984	2067	1576	114	47	28
CONTRIBUTING.GUIDE	1212	1866	1516	95	62	62
WELCOMING.COMMUNITY	2062	1822	812	67	33	18
RESPONSIVE.MAINTAINERS	2575	1850	302	31	35	20
ACTIVE.DEVELOPMENT	2768	1722	267	30	31	16
LICENSE	3125	1160	435	31	33	47

In [48]:

user_pri_responses_prop = survey_df[user_pri_vars].apply(pd.Series.value_counts, normalize=True).round(4).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
user_pri_responses_prop = user_pri_responses_prop[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
user_pri_responses_prop = user_pri_responses_prop.sort_values(by="Very important to have")
user_pri_responses_prop = user_pri_responses_prop * 100

In [49]:

idx = []
for i in user_pri_responses_prop.index:
    idx.append(i[20:])
idx = pd.Series(idx)    
user_pri_responses_prop.set_index(idx)

Out[49]:

	Very important to have	Somewhat important to have	Not important either way	Somewhat important not to have	Very important not to have	Don't know what this is
CLA	10.26%	21.44%	47.77%	7.03%	3.29%	10.22%
CODE.OF.CONDUCT	17.68%	30.46%	41.55%	3.46%	2.50%	4.36%
WIDESPREAD.USE	20.43%	42.92%	32.72%	2.37%	0.98%	0.58%
CONTRIBUTING.GUIDE	25.18%	38.77%	31.50%	1.97%	1.29%	1.29%
WELCOMING.COMMUNITY	42.83%	37.85%	16.87%	1.39%	0.69%	0.37%
RESPONSIVE.MAINTAINERS	53.50%	38.44%	6.27%	0.64%	0.73%	0.42%
ACTIVE.DEVELOPMENT	57.26%	35.62%	5.52%	0.62%	0.64%	0.33%
LICENSE	64.69%	24.01%	9.00%	0.64%	0.68%	0.97%

In [50]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson", "xkcd:green"]
user_pri_responses.plot.barh(stacked=True, ax=ax, figsize=[12,8], color=colors)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[20:].replace(".", " ") # cut off "OSS.USER.PRIORITIES."
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)

plt.title("When thinking about whether to *use* open source software,\n how important are the following things?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.1), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

When thinking about whether to contribute to an open source project, how important are the following things?¶

OSS.CONTRIBUTOR.PRIORITIES.*

In [51]:

contrib_pri_responses = survey_df[contrib_pri_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_pri_responses = contrib_pri_responses[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]

contrib_pri_responses = contrib_pri_responses.sort_values(by="Very important to have")

In [52]:

idx = []
for i in contrib_pri_responses.index:
    idx.append(i[27:])
idx = pd.Series(idx)    
contrib_pri_responses.set_index(idx)

Out[52]:

	Very important to have	Somewhat important to have	Not important either way	Somewhat important not to have	Very important not to have	Don't know what this is
WIDESPREAD.USE	387	1016	1666	70	30	12
CLA	419	712	1266	327	166	280
CODE.OF.CONDUCT	655	1145	1085	119	84	96
CONTRIBUTING.GUIDE	1198	1396	500	41	18	24
ACTIVE.DEVELOPMENT	1368	1333	448	21	18	5
WELCOMING.COMMUNITY	1533	1199	411	21	15	7
RESPONSIVE.MAINTAINERS	1994	1022	138	7	16	7
LICENSE	2199	610	337	16	15	18

In [ ]:

In [53]:

contrib_pri_responses_prop = survey_df[contrib_pri_vars].apply(pd.Series.value_counts, normalize=True).round(4).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_pri_responses_prop = contrib_pri_responses_prop[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
contrib_pri_responses_prop = contrib_pri_responses_prop.sort_values(by="Very important to have")
contrib_pri_responses_prop = contrib_pri_responses_prop * 100

In [54]:

idx = []
for i in contrib_pri_responses_prop.index:
    idx.append(i[27:])
idx = pd.Series(idx)    
contrib_pri_responses_prop.set_index(idx)

Out[54]:

	Very important to have	Somewhat important to have	Not important either way	Somewhat important not to have	Very important not to have	Don't know what this is
WIDESPREAD.USE	12.17%	31.94%	52.37%	2.20%	0.94%	0.38%
CLA	13.22%	22.46%	39.94%	10.32%	5.24%	8.83%
CODE.OF.CONDUCT	20.57%	35.96%	34.08%	3.74%	2.64%	3.02%
CONTRIBUTING.GUIDE	37.71%	43.94%	15.74%	1.29%	0.57%	0.76%
ACTIVE.DEVELOPMENT	42.84%	41.75%	14.03%	0.66%	0.56%	0.16%
WELCOMING.COMMUNITY	48.12%	37.63%	12.90%	0.66%	0.47%	0.22%
RESPONSIVE.MAINTAINERS	62.63%	32.10%	4.33%	0.22%	0.50%	0.22%
LICENSE	68.83%	19.09%	10.55%	0.50%	0.47%	0.56%

In [55]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson", "xkcd:green"]
contrib_pri_responses.plot.barh(stacked=True, ax=ax, figsize=[12,8], color=colors)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[27:].replace(".", " ") # cut off "OSS.USER.PRIORITIES."
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)

plt.title("When thinking about whether to *contribute* to an open source project,\nhow important are the following things?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.1), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

How often do you try to find open source options over other kinds of software?¶

SEEK.OPEN.SOURCE

In [56]:

count_df = pd.DataFrame(data=survey_df['SEEK.OPEN.SOURCE'].value_counts())
count_df.columns = ["count"]
count_df

Out[56]:

	count
Always	3407
Sometimes	1111
Rarely	100
Never	25

In [57]:

prop_df = pd.DataFrame((survey_df['SEEK.OPEN.SOURCE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[57]:

	percent
Always	73.38%
Sometimes	23.93%
Rarely	2.15%
Never	0.54%

In [58]:

ax = pd.DataFrame(survey_df['SEEK.OPEN.SOURCE'].value_counts()).plot(kind='barh')
plt.title("How often do you try to find open\nsource options over other kinds of software?")
t = ax.set_xlabel("Count of responses")

Open source software usability¶

OSS.UX: Do you believe that open source software is generally easier to use than closed source (proprietary) software, harder to use, or about the same?

In [59]:

count_df = pd.DataFrame(data=survey_df['OSS.UX'].value_counts())
count_df.columns = ["count"]
count_df

Out[59]:

	count
About the same	2027
Generally easier to use	1597
Generally harder to use	897

In [60]:

prop_df = pd.DataFrame((survey_df['OSS.UX'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[60]:

	percent
About the same	44.84%
Generally easier to use	35.32%
Generally harder to use	19.84%

In [61]:

ax = pd.DataFrame(survey_df['OSS.UX'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is generally\neasier to use than closed source (proprietary)\nsoftware, harder to use, or about the same?")
t = ax.set_xlabel("Count of responses")

Open source software security¶

OSS.SECURITY: Do you believe that open source software is generally more secure than closed source (proprietary) software, less secure, or about the same?

In [62]:

count_df = pd.DataFrame(data=survey_df['OSS.SECURITY'].value_counts())
count_df.columns = ["count"]
count_df

Out[62]:

	count
Generally more secure	2688
About the same	1537
Generally less secure	295

In [63]:

prop_df = pd.DataFrame((survey_df['OSS.SECURITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[63]:

	percent
Generally more secure	59.47%
About the same	34.00%
Generally less secure	6.53%

In [64]:

ax = pd.DataFrame(survey_df['OSS.SECURITY'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is\ngenerally more secure than closed source (proprietary)\nsoftware, less secure, or about the same?")
t = ax.set_xlabel("Count of responses")

Open source software stability¶

OSS.STABILITY: Do you believe that open source software is generally more stable than closed source (proprietary) software, less stable, or about the same?

In [65]:

count_df = pd.DataFrame(data=survey_df['OSS.STABILITY'].value_counts())
count_df.columns = ["count"]
count_df

Out[65]:

	count
About the same	2240
Generally more stable	1399
Generally less stable	877

In [66]:

prop_df = pd.DataFrame((survey_df['OSS.STABILITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[66]:

	percent
About the same	49.60%
Generally more stable	30.98%
Generally less stable	19.42%

In [67]:

pd.DataFrame(survey_df['OSS.STABILITY'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is\ngenerally more stable than closed source\n(proprietary), less stable, or about the same?")
t = ax.set_xlabel("Count of responses")

In [ ]:

Identification with open source¶

How much do you agree or disagree with the following statements:

EXTERNAL.EFFICACY: The open source community values contributions from people like me.
INTERNAL.EFFICACY: I have the skills and understanding necessary to make meaningful contributions to open source projects.
OSS.IDENTIFICATION: I consider myself to be a member of the open source (and/or the Free/Libre software) community.

In [68]:

oss_id_vars = ["INTERNAL.EFFICACY", "EXTERNAL.EFFICACY", "OSS.IDENTIFICATION"]

In [69]:

oss_id_responses = survey_df[oss_id_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
oss_id_responses = oss_id_responses[["Strongly agree",
                                     "Somewhat agree",
                                     "Neither agree nor disagree",
                                     "Somewhat disagree",
                                     "Strongly disagree"]]
oss_id_responses = oss_id_responses.sort_values(by="Strongly agree")
oss_id_responses

Out[69]:

	Strongly agree	Somewhat agree	Neither agree nor disagree	Somewhat disagree	Strongly disagree
EXTERNAL.EFFICACY	1518	1610	1116	150	58
OSS.IDENTIFICATION	1579	1513	863	351	150
INTERNAL.EFFICACY	2052	1685	418	240	62

In [70]:

oss_id_responses_prop = survey_df[oss_id_vars].apply(pd.Series.value_counts, normalize=True).round(4) * 100 
oss_id_responses_prop.transpose()

Out[70]:

	Neither agree nor disagree	Somewhat agree	Somewhat disagree	Strongly agree	Strongly disagree
INTERNAL.EFFICACY	9.38%	37.81%	5.38%	46.04%	1.39%
EXTERNAL.EFFICACY	25.07%	36.16%	3.37%	34.10%	1.30%
OSS.IDENTIFICATION	19.37%	33.95%	7.88%	35.44%	3.37%

In [71]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson"]
oss_id_responses.plot.barh(stacked=True, ax=ax, figsize=[12,5], cmap=matplotlib.cm.coolwarm, edgecolor='black', linewidth=1)

#print(str(ax.get_yticklabels()))

ax.set_yticklabels(["The open source community values\ncontributions from people like me.",
                    "I consider myself to be a member\nof the open source (and/or the\nFree/Libre software) community.",
                    "I have the skills and understanding\nnecessary to make meaningful\ncontributions to open source projects."])


plt.title("How much do you agree or disagree with the following statements:")

plt.xlabel("Number of responses")


legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.25), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

Transparency vs privacy¶

Attribution¶

TRANSPARENCY.PRIVACY.BELIEFS: Which of the following statements is closest to your beliefs about attribution in software development?

Records of authorship should be required so that end users know who created the source code they are working with.
People should be able to contribute code without attribution, if they wish to remain anonymous.

In [72]:

counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts())
counts_df.columns=["count"]
counts_df

Out[72]:

	count
People should be able to contribute code without attribution, if they wish to remain anonymous.	2454
Records of authorship should be required so that end users know who created the source code they are working with.	1594

In [73]:

prop_df = pd.DataFrame((survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[73]:

	percent
People should be able to contribute code without attribution, if they wish to remain anonymous.	60.62%
Records of authorship should be required so that end users know who created the source code they are working with.	39.38%

In [74]:

ax = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts()).plot(kind='barh', figsize=[10,6])
plt.title("Which of the following statements is closest to your\nbeliefs about attribution in software development?")
ax.set_yticklabels(["People should be able to contribute\ncode without attribution, if\nthey wish to remain anonymous.",
                    "Records of authorship should be\nrequired so that end users know\nwho created the source code they are working with."])
t = ax.set_xlabel("Count of responses")

In general, how much information about you is publicly available online?¶

INFO.AVAILABILITY

In [75]:

count_df = pd.DataFrame(survey_df['INFO.AVAILABILITY'].value_counts())
count_df.columns=["count"]
count_df

Out[75]:

	count
Some information about me	1776
A little information about me	1133
A lot of information about me	1011
No information at all about me	140

In [76]:

prop_df = pd.DataFrame((survey_df['INFO.AVAILABILITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[76]:

	percent
Some information about me	43.74%
A little information about me	27.91%
A lot of information about me	24.90%
No information at all about me	3.45%

In [77]:

ax = pd.DataFrame(survey_df['INFO.AVAILABILITY'].value_counts()).plot(kind='barh')
plt.title("In general, how much information about\nyou is publicly available online?")
t = ax.set_xlabel("Count of responses")

Do you feel that you need to make information available about yourself online for professional reasons?¶

INFO.JOB

In [78]:

count_df = pd.DataFrame(survey_df['INFO.JOB'].value_counts())
count_df.columns = ["count"]
count_df

Out[78]:

	count
Yes	2327
No	1638

In [79]:

prop_df = pd.DataFrame((survey_df['INFO.JOB'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[79]:

	percent
Yes	58.69%
No	41.31%

In [80]:

ax = pd.DataFrame(survey_df['INFO.JOB'].value_counts()).plot(kind='barh', figsize=[10,6])
plt.title("Do you feel that you need to make information available\nabout yourself online for professional reasons?")
t = ax.set_xlabel("Count of responses")

General privacy practices¶

TRANSPARENCY.PRIVACY.PRACTICES.GENERAL

"Which of the following best describes your practices around publishing content online, such as posts on social media (e.g. Facebook, Instagram, Twitter, etc.), blogs, and other platforms (not including contributions to open source projects)?" (single choice)

In [81]:

counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.PRACTICES.GENERAL'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[81]:

	count
I include my real name.	1718
I usually use a consistent pseudonym that is easily linked to my real name online.	1141
I don't publish this kind of content online.	517
I usually use a consistent pseudonym that is not linked anywhere with my real name online.	363
I take precautions to use different pseudonymns on different platforms.	270

In [82]:

prop_df = pd.DataFrame((survey_df['TRANSPARENCY.PRIVACY.PRACTICES.GENERAL'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[82]:

	percent
I include my real name.	42.85%
I usually use a consistent pseudonym that is easily linked to my real name online.	28.46%
I don't publish this kind of content online.	12.90%
I usually use a consistent pseudonym that is not linked anywhere with my real name online.	9.05%
I take precautions to use different pseudonymns on different platforms.	6.73%

In [83]:

plot_counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.PRACTICES.GENERAL'].value_counts())
idx = ['I include my real name.',
       'I usually use a consistent pseudonym that\nis easily linked to my real name online.',
       'I don\'t publish this kind of content online.',
       'I usually use a consistent pseudonym that\nis not linked anywhere with my real name online.',
       'I take precautions to use different\npseudonymns on different platforms.']
plot_counts_df.index = idx

In [84]:

ax = plot_counts_df.plot(kind='barh', figsize=[12,6])
plt.title("Which of the following best describes your\npractices around publishing content online [...] \nnot including contributions to open source projects?")
t = ax.set_xlabel("Count of responses")

OSS privacy practices¶

"Which of the following best describes your practices when making open source contributions?"

In [85]:

counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.PRACTICES.OSS'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[85]:

	count
I include my real name.	1845
I usually contribute using a consistent pseudonym that is easily linked to my real name online.	766
I usually contribute using a consistent pseudonym that is not linked anywhere with my real name online.	273
I take precautions to use different usernames in different projects.	42

In [86]:

prop_df = pd.DataFrame((survey_df['TRANSPARENCY.PRIVACY.PRACTICES.OSS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[86]:

	percent
I include my real name.	63.06%
I usually contribute using a consistent pseudonym that is easily linked to my real name online.	26.18%
I usually contribute using a consistent pseudonym that is not linked anywhere with my real name online.	9.33%
I take precautions to use different usernames in different projects.	1.44%

In [87]:

plot_counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.PRACTICES.OSS'].value_counts())
idx = ['I include my real name.',
       'I usually use a consistent pseudonym that\nis easily linked to my real name online.',
       'I usually use a consistent pseudonym that\nis not linked anywhere with my real name online.',
       'I take precautions to use different\npseudonymns on different platforms.']
plot_counts_df.index = idx

In [88]:

ax = plot_counts_df.plot(kind='barh', figsize=[12,6])
plt.title("Which of the following best describes your\npractices when making open source contributions?")
t = ax.set_xlabel("Count of responses")

Mentorship / Help¶

RECEIVED.HELP

In [89]:

counts_df = pd.DataFrame(survey_df['RECEIVED.HELP'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[89]:

	count
Yes	2845
No	1064

In [90]:

prop_df = pd.DataFrame((survey_df['RECEIVED.HELP'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[90]:

	percent
Yes	72.78%
No	27.22%

In [91]:

ax = pd.DataFrame(survey_df['RECEIVED.HELP'].value_counts()).plot(kind='barh')
plt.title("Have you ever received any kind of help from other people\nrelated to using or contributing to an open source project?")
t = ax.set_xlabel("Count of responses")

Thinking of the most recent case where someone helped you, how did you find someone to help you?¶

In [92]:

counts_df = pd.DataFrame(survey_df['FIND.HELPER'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[92]:

	count
I asked for help in a public forum (e.g. in a GitHub Issue, project mailing list, etc.) and someone responded.	2057
I asked a specific person for help.	403
Someone offered me unsolicited help.	272
Other - Please describe	64

In [93]:

prop_df = pd.DataFrame((survey_df['FIND.HELPER'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[93]:

	percent
I asked for help in a public forum (e.g. in a GitHub Issue, project mailing list, etc.) and someone responded.	73.57%
I asked a specific person for help.	14.41%
Someone offered me unsolicited help.	9.73%
Other - Please describe	2.29%

In [94]:

ax = pd.DataFrame(survey_df['FIND.HELPER'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
plt.title("How did you find someone to help you?")
t = ax.set_yticklabels(['I asked for help in a public forum\n(e.g. in a GitHub Issue, project mailing list, etc.)\nand someone responded.',
 'I asked a specific person for help.',
 'Someone offered me unsolicited help.',
 'Other - Please describe'])

Which best describes your prior relationship with the person who helped you?¶

HELPER.PRIOR.RELATIONSHIP

In [95]:

counts_df = pd.DataFrame(survey_df['HELPER.PRIOR.RELATIONSHIP'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[95]:

	count
Total strangers, I didn't know of them previously.	1565
I knew of them through their contributions to projects, but didn't know them personally.	809
We knew each other a little.	211
We knew each other well.	208

In [96]:

prop_df = pd.DataFrame((survey_df['HELPER.PRIOR.RELATIONSHIP'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[96]:

	percent
Total strangers, I didn't know of them previously.	56.03%
I knew of them through their contributions to projects, but didn't know them personally.	28.97%
We knew each other a little.	7.55%
We knew each other well.	7.45%

In [97]:

ax = pd.DataFrame(survey_df['HELPER.PRIOR.RELATIONSHIP'].value_counts()).plot(kind='barh')
plt.title("Which best describes your prior\nrelationship with the person who helped you?")
t = ax.set_xlabel("Count of responses")

What kind of problem did they help you with?¶

RECEIVED.HELP.TYPE

In [98]:

counts_df = pd.DataFrame(survey_df['RECEIVED.HELP.TYPE'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[98]:

	count
Writing code or otherwise implementing ideas.	1633
Installing or using an application.	820
Understanding community norms (e.g. how to submit a contribution, how to communicate effectively).	181
Other (please describe)	142
Introductions to other people	13

In [99]:

prop_df = pd.DataFrame((survey_df['RECEIVED.HELP.TYPE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[99]:

	percent
Writing code or otherwise implementing ideas.	58.55%
Installing or using an application.	29.40%
Understanding community norms (e.g. how to submit a contribution, how to communicate effectively).	6.49%
Other (please describe)	5.09%
Introductions to other people	0.47%

In [100]:

ax = pd.DataFrame(survey_df['RECEIVED.HELP.TYPE'].value_counts()).plot(kind='barh')
plt.title("What kind of problem did they help you with?")
t = ax.set_xlabel("Count of responses")

t = ax.set_yticklabels(['Writing code or otherwise implementing ideas.',
       'Installing or using an application.',
       'Understanding community norms (e.g. how to submit\na contribution, how to communicate effectively).',
       'Other (please describe)', 'Introductions to other people'])

Have you ever provided help for another person on an open source project?¶

PROVIDED.HELP

In [101]:

counts_df = pd.DataFrame(survey_df['PROVIDED.HELP'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[101]:

	count
Yes	2891
No	1013

In [102]:

prop_df = pd.DataFrame((survey_df['PROVIDED.HELP'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[102]:

	percent
Yes	74.05%
No	25.95%

In [103]:

ax = pd.DataFrame(survey_df['PROVIDED.HELP'].value_counts()).plot(kind='barh')
plt.title("Have you ever provided help for another\nperson on an open source project?")
t = ax.set_xlabel("Count of responses")

In [ ]:

Thinking of the most recent case where you helped someone, how did you come to help this person?¶

FIND.HELPEE

In [104]:

counts_df = pd.DataFrame(survey_df['FIND.HELPEE'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[104]:

	count
They asked for help in a public forum (e.g. in a GitHub Issue, project mailing list, etc.) and I responded.	1839
They asked me directly for help.	566
I reached out to them to offer unsolicited help.	405
Other (please describe)	28

In [105]:

prop_df = pd.DataFrame((survey_df['FIND.HELPEE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[105]:

	percent
They asked for help in a public forum (e.g. in a GitHub Issue, project mailing list, etc.) and I responded.	64.80%
They asked me directly for help.	19.94%
I reached out to them to offer unsolicited help.	14.27%
Other (please describe)	0.99%

In [106]:

ax = pd.DataFrame(survey_df['FIND.HELPEE'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("How did you come to help this person?")
t = ax.set_yticklabels(['They asked for help in a public forum\n(e.g. in a GitHub Issue, project mailing list, etc.)\n and I responded.',
       'They asked me directly for help.',
       'I reached out to them to offer unsolicited help.',
       'Other (please describe)'])

Which best describes your prior relationship with the person you helped?¶

HELPEE.PRIOR.RELATIONSHIP

In [107]:

counts_df = pd.DataFrame(survey_df['HELPEE.PRIOR.RELATIONSHIP'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[107]:

	count
Total strangers, I didn't know of them previously.	1984
We knew each other well.	292
I knew of them through their contributions to projects, but didn't know them personally.	288
We knew each other a little.	275

In [108]:

prop_df = pd.DataFrame((survey_df['HELPEE.PRIOR.RELATIONSHIP'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[108]:

	percent
Total strangers, I didn't know of them previously.	69.88%
We knew each other well.	10.29%
I knew of them through their contributions to projects, but didn't know them personally.	10.14%
We knew each other a little.	9.69%

In [109]:

ax = pd.DataFrame(survey_df['HELPEE.PRIOR.RELATIONSHIP'].value_counts()).plot(kind='barh')
plt.title("Which best describes your prior\nrelationship with the person you helped?")
t = ax.set_xlabel("Count of responses")

What kind of problem did you help them with?¶

PROVIDED.HELP.TYPE

In [110]:

counts_df = pd.DataFrame(survey_df['PROVIDED.HELP.TYPE'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[110]:

	count
Writing code or otherwise implementing ideas.	1602
Installing or using an application.	1028
Other (please describe)	101
Understanding community norms (e.g. how to submit a contribution, how to communicate effectively).	99
Introductions to other people.	8

In [111]:

prop_df = pd.DataFrame((survey_df['PROVIDED.HELP.TYPE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[111]:

	percent
Writing code or otherwise implementing ideas.	56.45%
Installing or using an application.	36.22%
Other (please describe)	3.56%
Understanding community norms (e.g. how to submit a contribution, how to communicate effectively).	3.49%
Introductions to other people.	0.28%

In [112]:

ax = pd.DataFrame(survey_df['PROVIDED.HELP.TYPE'].value_counts()).plot(kind='barh')
plt.title("What kind of problem did you help them with?")
t = ax.set_xlabel("Count of responses")
t = ax.set_yticklabels(['Writing code or otherwise implementing ideas.',
       'Installing or using an application.', 'Other (please describe)',
       'Understanding community norms (e.g. how to\nsubmit a contribution, how to communicate effectively).',
       'Introductions to other people.'])

Open Source Software in Paid Work¶

Do you contribute to open source as part of your professional work?¶

OSS.AS.JOB: Do you contribute to open source as part of your professional work? In other words, are you paid for any of your time spent on open source contributions?

Yes, indirectly- I contribute to open source in carrying out my work duties, but I am not required or expected to do so.
No.
Yes, directly- some or all of my work duties include contributing to open source projects.

In [113]:

counts_df = pd.DataFrame(survey_df['OSS.AS.JOB'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[113]:

	count
Yes, indirectly- I contribute to open source in carrying out my work duties, but I am not required or expected to do so.	896
No.	687
Yes, directly- some or all of my work duties include contributing to open source projects.	464

In [114]:

prop_df = pd.DataFrame((survey_df['OSS.AS.JOB'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[114]:

	percent
Yes, indirectly- I contribute to open source in carrying out my work duties, but I am not required or expected to do so.	43.77%
No.	33.56%
Yes, directly- some or all of my work duties include contributing to open source projects.	22.67%

In [115]:

oss_as_job_df = pd.DataFrame(survey_df['OSS.AS.JOB'].value_counts())
oss_as_job_df.index = ["Yes, indirectly", "No", "Yes, directly"]
ax = oss_as_job_df.plot(kind='barh')
plt.title("Do you contribute to open source\nas part of your professional work?")
t = ax.set_xlabel("Count of responses")

How often do you use open source software in your professional work?¶

OSS.AT.WORK

In [116]:

counts_df = pd.DataFrame(survey_df['OSS.AT.WORK'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[116]:

	count
Frequently	2191
Sometimes	300
Rarely	110
Never	65

In [117]:

prop_df = pd.DataFrame((survey_df['OSS.AT.WORK'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[117]:

	percent
Frequently	82.18%
Sometimes	11.25%
Rarely	4.13%
Never	2.44%

In [118]:

ax = pd.DataFrame(survey_df['OSS.AT.WORK'].value_counts()).plot(kind='barh')
plt.title("How often do you use open source\nsoftware in your professional work?")
t = ax.set_xlabel("Count of responses")

How does your employer's intellectual property agreement/policy affect your free-time contributions to open source unrelated to your work?¶

OSS.IP.POLICY

In [119]:

counts_df = pd.DataFrame(survey_df['OSS.IP.POLICY'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[119]:

	count
I am free to contribute without asking for permission.	1178
My employer doesn't have a clear policy on this.	695
I am permitted to contribute to open source, but need to ask for permission.	287
I'm not sure.	238
Not applicable	180
I am not permitted to contribute to open source at all.	63

In [120]:

prop_df = pd.DataFrame((survey_df['OSS.IP.POLICY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[120]:

	percent
I am free to contribute without asking for permission.	44.60%
My employer doesn't have a clear policy on this.	26.32%
I am permitted to contribute to open source, but need to ask for permission.	10.87%
I'm not sure.	9.01%
Not applicable	6.82%
I am not permitted to contribute to open source at all.	2.39%

In [121]:

ax = pd.DataFrame(survey_df['OSS.IP.POLICY'].value_counts()).plot(kind='barh')
plt.title("How does your employer's intellectual property\nagreement/policy affect your free-time contributions\nto open source unrelated to your work?")
t = ax.set_xlabel("Count of responses")

Which is closest to your employer’s policy on using open source software applications?¶

In [122]:

counts_df = pd.DataFrame(survey_df['EMPLOYER.POLICY.APPLICATIONS'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[122]:

	count
Use of open source applications is encouraged.	1174
Use of open source applications is acceptable if it is the most appropriate tool.	916
My employer doesn't have a clear policy on this.	338
Not applicable	88
I'm not sure.	83
Use of open source applications is rarely, if ever, permitted.	42

In [123]:

prop_df = pd.DataFrame((survey_df['EMPLOYER.POLICY.APPLICATIONS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[123]:

	percent
Use of open source applications is encouraged.	44.45%
Use of open source applications is acceptable if it is the most appropriate tool.	34.68%
My employer doesn't have a clear policy on this.	12.80%
Not applicable	3.33%
I'm not sure.	3.14%
Use of open source applications is rarely, if ever, permitted.	1.59%

In [124]:

ax = pd.DataFrame(survey_df['EMPLOYER.POLICY.APPLICATIONS'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Which is closest to your employer’s policy\non using open source software applications?")

How important do you think your involvement in open source was to getting your current job?¶

OSS.HIRING

In [125]:

counts_df = pd.DataFrame(survey_df['OSS.HIRING'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[125]:

	count
Very important	618
Somewhat important	448
Not at all important	361
Not too important	352
Not applicable-I hadn't made any contributions when I got this job.	254

In [126]:

prop_df = pd.DataFrame((survey_df['OSS.HIRING'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[126]:

	percent
Very important	30.40%
Somewhat important	22.04%
Not at all important	17.76%
Not too important	17.31%
Not applicable-I hadn't made any contributions when I got this job.	12.49%

In [127]:

ax = pd.DataFrame(survey_df['OSS.HIRING'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("How important do you think your involvement\nin open source was to getting your current job?")

Demographics¶

Do you currently live in a country other than the one in which you were born?¶

IMMIGRATION

In [128]:

counts_df = pd.DataFrame(survey_df['IMMIGRATION'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[128]:

	count
No, I live in the country where I was born.	2764
Yes, and I intend to stay permanently.	513
Yes, and I am not sure about my future plans.	292
Yes, and I intend to stay temporarily.	165

In [129]:

prop_df = pd.DataFrame((survey_df['IMMIGRATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[129]:

	percent
No, I live in the country where I was born.	74.02%
Yes, and I intend to stay permanently.	13.74%
Yes, and I am not sure about my future plans.	7.82%
Yes, and I intend to stay temporarily.	4.42%

In [130]:

ax = pd.DataFrame(survey_df['IMMIGRATION'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Do you currently live in a country other\nthan the one in which you were born?")

Thinking of where you were born, are you a member of an ethnicity or nationality that is a considered a minority in that country?¶

MINORITY.HOMECOUNTRY

In [131]:

counts_df = pd.DataFrame(survey_df['MINORITY.HOMECOUNTRY'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[131]:

	count
No	754
Yes	124
Not sure	45
Prefer not to say	34

In [132]:

prop_df = pd.DataFrame((survey_df['MINORITY.HOMECOUNTRY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[132]:

	percent
No	78.79%
Yes	12.96%
Not sure	4.70%
Prefer not to say	3.55%

In [133]:

ax = pd.DataFrame(survey_df['MINORITY.HOMECOUNTRY'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Thinking of where you were born, are\nyou a member of an ethnicity or nationality that\nis a considered a minority in that country?")

In [ ]:

Thinking of where you currently live, are you a member of an ethnicity or nationality that is a considered a minority in that country?¶

MINORITY.CURRENT.COUNTRY

In [134]:

counts_df = pd.DataFrame(survey_df['MINORITY.CURRENT.COUNTRY'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[134]:

	count
No	2837
Yes	546
Not sure	193
Prefer not to say	156

In [135]:

prop_df = pd.DataFrame((survey_df['MINORITY.CURRENT.COUNTRY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[135]:

	percent
No	76.02%
Yes	14.63%
Not sure	5.17%
Prefer not to say	4.18%

In [136]:

ax = pd.DataFrame(survey_df['MINORITY.CURRENT.COUNTRY'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Thinking of where you currently live, are you\na member of an ethnicity or nationality that is a\nconsidered a minority in that country?")

What is your gender?¶

GENDER

In [137]:

counts_df = pd.DataFrame(survey_df['GENDER'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[137]:

	count
Man	3387
Prefer not to say	173
Woman	125
Non-binary or Other	39

In [138]:

prop_df = pd.DataFrame((survey_df['GENDER'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[138]:

	percent
Man	90.95%
Prefer not to say	4.65%
Woman	3.36%
Non-binary or Other	1.05%

In [139]:

ax = pd.DataFrame(survey_df['GENDER'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("What is your gender?")

Do you identify as transgender?¶

TRANSGENDER.IDENTITY

In [140]:

counts_df = pd.DataFrame(survey_df['TRANSGENDER.IDENTITY'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[140]:

	count
No	3494
Prefer not to say	158
Yes	33
Not sure	30

In [141]:

prop_df = pd.DataFrame((survey_df['TRANSGENDER.IDENTITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[141]:

	percent
No	94.05%
Prefer not to say	4.25%
Yes	0.89%
Not sure	0.81%

In [142]:

ax = pd.DataFrame(survey_df['TRANSGENDER.IDENTITY'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Do you identify as transgender?")

In [ ]:

Do you identify as gay, lesbian, or bisexual, asexual, or any other minority sexual orientation?¶

SEXUAL.ORIENTATION

In [143]:

counts_df = pd.DataFrame(survey_df['SEXUAL.ORIENTATION'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[143]:

	count
No	3187
Yes	246
Prefer not to say	201
Not sure	85

In [144]:

prop_df = pd.DataFrame((survey_df['SEXUAL.ORIENTATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[144]:

	percent
No	85.70%
Yes	6.61%
Prefer not to say	5.40%
Not sure	2.29%

In [145]:

ax = pd.DataFrame(survey_df['SEXUAL.ORIENTATION'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Do you identify as gay, lesbian, or bisexual,\nasexual, or any other minority sexual orientation?")

How well can you read and write in English?¶

WRITTEN.ENGLISH

In [146]:

counts_df = pd.DataFrame(survey_df['WRITTEN.ENGLISH'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[146]:

	count
Very well	2865
Moderately well	742
Not very well	108
Not at all	6

In [147]:

prop_df = pd.DataFrame((survey_df['WRITTEN.ENGLISH'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[147]:

	percent
Very well	77.00%
Moderately well	19.94%
Not very well	2.90%
Not at all	0.16%

In [148]:

ax = pd.DataFrame(survey_df['WRITTEN.ENGLISH'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("How well can you read and write in English?")

In [ ]:

What is your age?¶

AGE

In [149]:

counts_df = pd.DataFrame(survey_df['AGE'].value_counts().sort_index())
counts_df.columns = ["count"]
counts_df

Out[149]:

	count
17 or younger	139
18 to 24 years	871
25 to 34 years	1400
35 to 44 years	772
45 to 54 years	267
55 to 64 years	93
65 years or older	36

In [150]:

prop_df = pd.DataFrame((survey_df['AGE'].value_counts(normalize=True).round(4)*100).sort_index())
prop_df.columns=["percent"]
prop_df

Out[150]:

	percent
17 or younger	3.88%
18 to 24 years	24.34%
25 to 34 years	39.13%
35 to 44 years	21.58%
45 to 54 years	7.46%
55 to 64 years	2.60%
65 years or older	1.01%

In [151]:

ax = pd.DataFrame(survey_df['AGE'].value_counts().sort_index()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("What is your age?")

In [ ]:

What is highest level of formal education that you have completed?¶

FORMAL.EDUCATION

In [152]:

counts_df = pd.DataFrame(survey_df['FORMAL.EDUCATION'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[152]:

	count
Bachelor's degree	1321
Master's degree	852
Some college, no degree	640
Secondary (high) school graduate or equivalent	375
Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)	256
Vocational/trade program or apprenticeship	127
Less than secondary (high) school	126

In [153]:

prop_df = pd.DataFrame((survey_df['FORMAL.EDUCATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[153]:

	percent
Bachelor's degree	35.73%
Master's degree	23.05%
Some college, no degree	17.31%
Secondary (high) school graduate or equivalent	10.14%
Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)	6.92%
Vocational/trade program or apprenticeship	3.44%
Less than secondary (high) school	3.41%

In [154]:

order = ["Less than secondary (high) school",
         "Secondary (high) school graduate or equivalent",
         "Vocational/trade program or apprenticeship",
         "Some college, no degree",
         "Bachelor's degree",
         "Master's degree",
         "Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)"]

edu_counts_df = survey_df['FORMAL.EDUCATION'].value_counts()[order]

ax = edu_counts_df.plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("What is highest level of formal education that you have completed?")

In [ ]:

What is the highest level of formal education that either of your parents completed?¶

PARENTS.FORMAL.EDUCATION

In [155]:

counts_df = pd.DataFrame(survey_df['PARENTS.FORMAL.EDUCATION'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[155]:

	count
Bachelor's degree	961
Master's degree	871
Secondary (high) school graduate or equivalent	566
Some college, no degree	388
Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)	387
Vocational/trade program or apprenticeship	257
Less than secondary (high) school	243

In [156]:

prop_df = pd.DataFrame((survey_df['PARENTS.FORMAL.EDUCATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[156]:

	percent
Bachelor's degree	26.16%
Master's degree	23.71%
Secondary (high) school graduate or equivalent	15.41%
Some college, no degree	10.56%
Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)	10.54%
Vocational/trade program or apprenticeship	7.00%
Less than secondary (high) school	6.62%

In [157]:

order = ["Less than secondary (high) school",
         "Secondary (high) school graduate or equivalent",
         "Vocational/trade program or apprenticeship",
         "Some college, no degree",
         "Bachelor's degree",
         "Master's degree",
         "Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)"]

edu_counts_df = survey_df['PARENTS.FORMAL.EDUCATION'].value_counts()[order]

ax = edu_counts_df.plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("What is highest level of formal education that either of your parents completed?")

In [ ]:

How old were you when you first had regular access to a computer with an internet connection?¶

AGE.AT.FIRST.COMPUTER.INTERNET

In [158]:

counts_df = pd.DataFrame(survey_df['AGE.AT.FIRST.COMPUTER.INTERNET'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[158]:

	count
Younger than 13 years old	1478
13 - 17 years old	1313
18 - 24 years old	695
25 - 45 years old	202
Older than 45 years old	23

In [159]:

prop_df = pd.DataFrame((survey_df['AGE.AT.FIRST.COMPUTER.INTERNET'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[159]:

	percent
Younger than 13 years old	39.83%
13 - 17 years old	35.38%
18 - 24 years old	18.73%
25 - 45 years old	5.44%
Older than 45 years old	0.62%

In [160]:

ax = pd.DataFrame(survey_df['AGE.AT.FIRST.COMPUTER.INTERNET'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("How old were you when you first had regular\naccess to a computer with an internet connection?")

In [ ]:

Where did you first have regular access to a computer with internet connection?¶

LOCATION.OF.FIRST.COMPUTER.INTERNET

In [161]:

counts_df = pd.DataFrame(survey_df['LOCATION.OF.FIRST.COMPUTER.INTERNET'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[161]:

	count
At home (belonging to me or a family member)	2520
In a classroom, computer lab, or library at school	746
At an internet cafe or similar space	182
Other (please describe)	106
At a public library or community center	87
At work (recoded from open ended)	70

In [162]:

prop_df = pd.DataFrame((survey_df['LOCATION.OF.FIRST.COMPUTER.INTERNET'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[162]:

	percent
At home (belonging to me or a family member)	67.91%
In a classroom, computer lab, or library at school	20.10%
At an internet cafe or similar space	4.90%
Other (please describe)	2.86%
At a public library or community center	2.34%
At work (recoded from open ended)	1.89%

In [163]:

ax = pd.DataFrame(survey_df['LOCATION.OF.FIRST.COMPUTER.INTERNET'].value_counts()).plot(kind='barh', figsize=[8.5,6])
ax.set_xlabel("Count of responses")
t = plt.title("Where did you first have regular access to a\ncomputer with internet connection?")

Where was the respondent surveyed from?¶

POPLATION

In [164]:

counts_df = pd.DataFrame(survey_df['POPULATION'].value_counts())
counts_df.columns = ["count"]
counts_df

Out[164]:

	count
github	5495
off site community	534

In [165]:

prop_df = pd.DataFrame((survey_df['POPULATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df

Out[165]:

	percent
github	91.14%
off site community	8.86%

In [166]:

ax = pd.DataFrame(survey_df['POPULATION'].value_counts()).plot(kind='barh', figsize=[8.5,6])
ax.set_xlabel("Count of responses")
t = plt.title("Where was the respondent surveyed from?")

Harassment / Inclusiveness of OSS¶

Have you ever observed any of the following in the context of an open source project?¶

DISCOURAGING.BEHAVIOR.*

In [167]:

discouraging_responses = survey_df[discouraging_vars].apply(pd.Series.value_counts).transpose()[["Yes", "No"]]
discouraging_responses

Out[167]:

	Yes	No
DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE	3017	792
DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION	1210	2580
DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE	2195	1598
DISCOURAGING.BEHAVIOR.BAD.DOCS	3559	263
DISCOURAGING.BEHAVIOR.CONFLICT	1830	1966
DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE	649	3158

In [168]:

discouraging_percent = pd.DataFrame(discouraging_responses["Yes"] / discouraging_responses.sum(axis=1) * 100, columns=["percent_yes"]).sort_values(by="percent_yes")
discouraging_percent.round(2)

Out[168]:

	percent_yes
DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE	17.05%
DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION	31.93%
DISCOURAGING.BEHAVIOR.CONFLICT	48.21%
DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE	57.87%
DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE	79.21%
DISCOURAGING.BEHAVIOR.BAD.DOCS	93.12%

In [169]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
discouraging_responses.sort_values(by="No").plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[22:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("Have you ever observed any of the following in the context of an open source project?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

In [ ]:

Have you ever witnessed any of the following behaviors directed at another person in the context of an open source project? (not including something directed at you)¶

NEGATIVE.WITNESS.*

In [170]:

neg_witness_responses = neg_df[neg_witness_vars].apply(pd.Series.value_counts).transpose()[[1,0]]
neg_witness_responses.columns = ["Yes", "Blank"]
neg_witness_responses

Out[170]:

	Yes	Blank
NEGATIVE.WITNESS.RUDENESS	1753	4276
NEGATIVE.WITNESS.NAME.CALLING	789	5240
NEGATIVE.WITNESS.THREATS	162	5867
NEGATIVE.WITNESS.IMPERSONATION	177	5852
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT	237	5792
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT	175	5854
NEGATIVE.WITNESS.STALKING	108	5921
NEGATIVE.WITNESS.SEXUAL.ADVANCES	136	5893
NEGATIVE.WITNESS.STEREOTYPING	423	5606
NEGATIVE.WITNESS.DOXXING	151	5878
NEGATIVE.WITNESS.OTHER	78	5951
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE	1721	4308
NEGATIVE.WITNESS.ANY.RESPONSE	3664	2365

Only 3,664 respondents clicked any boxes in this question, meaning 2,365 did not click the "none of the above" or an option (or even get to this question). We have to adjust the no responses accordingly.

In [171]:

neg_witness_responses_adj = neg_witness_responses
neg_witness_responses_adj["Blank"] = neg_witness_responses_adj["Blank"] - 2365
neg_witness_responses_adj_df = pd.DataFrame(neg_witness_responses_adj["Yes"] / (neg_witness_responses_adj["Yes"] + neg_witness_responses_adj["Blank"]) * 100, columns=["percent_yes"])

In [172]:

neg_witness_responses_adj.columns = ["Yes", "No"]
neg_witness_responses_adj[:-1]

Out[172]:

	Yes	No
NEGATIVE.WITNESS.RUDENESS	1753	1911
NEGATIVE.WITNESS.NAME.CALLING	789	2875
NEGATIVE.WITNESS.THREATS	162	3502
NEGATIVE.WITNESS.IMPERSONATION	177	3487
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT	237	3427
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT	175	3489
NEGATIVE.WITNESS.STALKING	108	3556
NEGATIVE.WITNESS.SEXUAL.ADVANCES	136	3528
NEGATIVE.WITNESS.STEREOTYPING	423	3241
NEGATIVE.WITNESS.DOXXING	151	3513
NEGATIVE.WITNESS.OTHER	78	3586
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE	1721	1943

In [173]:

neg_witness_responses_adj_df.sort_values(by="percent_yes").round(2)

Out[173]:

	percent_yes
NEGATIVE.WITNESS.OTHER	2.13%
NEGATIVE.WITNESS.STALKING	2.95%
NEGATIVE.WITNESS.SEXUAL.ADVANCES	3.71%
NEGATIVE.WITNESS.DOXXING	4.12%
NEGATIVE.WITNESS.THREATS	4.42%
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT	4.78%
NEGATIVE.WITNESS.IMPERSONATION	4.83%
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT	6.47%
NEGATIVE.WITNESS.STEREOTYPING	11.54%
NEGATIVE.WITNESS.NAME.CALLING	21.53%
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE	46.97%
NEGATIVE.WITNESS.RUDENESS	47.84%
NEGATIVE.WITNESS.ANY.RESPONSE	100.00%

In [174]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
neg_witness_responses_adj[:-1].sort_values(by='No').plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[17:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("Have you ever witnessed any of the following behaviors\ndirected at another person in the context of an open source\nproject? (not including something directed at you)")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

Have you ever experienced any of the following behaviors directed at you in the context of an open source project?¶

In [175]:

neg_exp_responses = neg_df[neg_exp_vars].apply(pd.Series.value_counts).transpose()[[1,0]]
neg_exp_responses.columns = ["Yes", "Blank"]
neg_exp_responses

Out[175]:

	Yes	Blank
NEGATIVE.EXPERIENCE.RUDENESS	646	5383
NEGATIVE.EXPERIENCE.NAME.CALLING	192	5837
NEGATIVE.EXPERIENCE.THREATS	43	5986
NEGATIVE.EXPERIENCE.IMPERSONATION	45	5984
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT	55	5974
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT	42	5987
NEGATIVE.EXPERIENCE.STALKING	35	5994
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES	25	6004
NEGATIVE.EXPERIENCE.STEREOTYPING	114	5915
NEGATIVE.EXPERIENCE.DOXXING	23	6006
NEGATIVE.EXPERIENCE.OTHER	39	5990
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE	2900	3129
NEGATIVE.EXPERIENCE.ANY.RESPONSE	3638	2391

Only 3,638 respondents clicked any boxes in this question, meaning 2,391 did not click the "none of the above" (or even get to this question). We have to adjust the no responses accordingly.

In [176]:

neg_exp_responses_adj = neg_exp_responses
neg_exp_responses_adj["Blank"] = neg_exp_responses_adj["Blank"] - 2391
neg_exp_responses_adj_df = pd.DataFrame(neg_exp_responses["Yes"] / (neg_exp_responses_adj["Yes"] + neg_exp_responses_adj["Blank"]) * 100, columns=["percent_yes"])

In [177]:

neg_exp_responses_adj.columns = ["Yes", "No"]
neg_exp_responses_adj[:-1]

Out[177]:

	Yes	No
NEGATIVE.EXPERIENCE.RUDENESS	646	2992
NEGATIVE.EXPERIENCE.NAME.CALLING	192	3446
NEGATIVE.EXPERIENCE.THREATS	43	3595
NEGATIVE.EXPERIENCE.IMPERSONATION	45	3593
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT	55	3583
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT	42	3596
NEGATIVE.EXPERIENCE.STALKING	35	3603
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES	25	3613
NEGATIVE.EXPERIENCE.STEREOTYPING	114	3524
NEGATIVE.EXPERIENCE.DOXXING	23	3615
NEGATIVE.EXPERIENCE.OTHER	39	3599
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE	2900	738

In [178]:

neg_exp_responses_adj_df.sort_values(by="percent_yes").round(2)

Out[178]:

	percent_yes
NEGATIVE.EXPERIENCE.DOXXING	0.63%
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES	0.69%
NEGATIVE.EXPERIENCE.STALKING	0.96%
NEGATIVE.EXPERIENCE.OTHER	1.07%
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT	1.15%
NEGATIVE.EXPERIENCE.THREATS	1.18%
NEGATIVE.EXPERIENCE.IMPERSONATION	1.24%
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT	1.51%
NEGATIVE.EXPERIENCE.STEREOTYPING	3.13%
NEGATIVE.EXPERIENCE.NAME.CALLING	5.28%
NEGATIVE.EXPERIENCE.RUDENESS	17.76%
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE	79.71%
NEGATIVE.EXPERIENCE.ANY.RESPONSE	100.00%

In [179]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
neg_exp_responses_adj[:-1].sort_values(by='No').plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[20:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("Have you ever experienced any of the following behaviors\ndirected at you in the context of an open source project?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

Thinking of the last time you experienced harassment, how did you respond?¶

NEGATIVE.RESPONSE.*

In [180]:

neg_resp_responses = neg_df[neg_resp_vars].apply(pd.Series.value_counts).transpose()[[1,0]]
neg_resp_responses.columns = ["Yes", "Blank"]
neg_resp_responses

Out[180]:

	Yes	Blank
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP	194	5835
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT	112	5917
NEGATIVE.RESPONSE.BLOCKED.USER	170	5859
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS	95	5934
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP	20	6009
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL	8	6021
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT	9	6020
NEGATIVE.RESPONSE.OTHER	71	5958
NEGATIVE.RESPONSE.IGNORED	350	5679
NEGATIVE.RESPONSE.ANY.RESPONSE	719	5310

Only 719 respondents clicked any boxes in this question, meaning 5,310 did not click on "I did not react / ignored the incident" or any response (or even get to this question). We have to adjust the no responses accordingly.

In [181]:

neg_resp_responses_adj = neg_resp_responses
neg_resp_responses_adj["Blank"] = neg_resp_responses_adj["Blank"] - 5310
neg_resp_responses_adj_df = pd.DataFrame(neg_resp_responses_adj["Yes"] / (neg_resp_responses_adj["Yes"] + neg_resp_responses_adj["Blank"]) * 100, columns=["percent_yes"])

In [182]:

neg_resp_responses_adj.columns = ["Yes", "No"]
neg_resp_responses_adj[:-1]

Out[182]:

	Yes	No
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP	194	525
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT	112	607
NEGATIVE.RESPONSE.BLOCKED.USER	170	549
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS	95	624
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP	20	699
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL	8	711
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT	9	710
NEGATIVE.RESPONSE.OTHER	71	648
NEGATIVE.RESPONSE.IGNORED	350	369

In [183]:

neg_resp_responses_adj_df.sort_values(by="percent_yes").round(2)

Out[183]:

	percent_yes
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL	1.11%
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT	1.25%
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP	2.78%
NEGATIVE.RESPONSE.OTHER	9.87%
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS	13.21%
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT	15.58%
NEGATIVE.RESPONSE.BLOCKED.USER	23.64%
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP	26.98%
NEGATIVE.RESPONSE.IGNORED	48.68%
NEGATIVE.RESPONSE.ANY.RESPONSE	100.00%

In [184]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
neg_resp_responses_adj[:-1].sort_values(by='No').plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[18:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("Thinking of the last time you experienced\nharassment, how did you respond?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

In [ ]:

How effective were the following responses?¶

RESPONSE.EFFECTIVENESS.*

In [185]:

neg_effect_responses = neg_df[neg_effect_vars].apply(pd.Series.value_counts).transpose()
neg_effect_responses = neg_effect_responses.replace(np.nan, 0).sort_values(by="Mostly effective")
neg_effect_responses = neg_effect_responses[["Not at all effective", "A little effective", "Somewhat effective", "Mostly effective", "Completely effective"]]

In [186]:

idx = []
for i in neg_effect_responses.index:
    idx.append(i[23:].replace(".", " "))
neg_effect_responses.index = idx
neg_effect_responses.astype(int)

Out[186]:

	Not at all effective	A little effective	Somewhat effective	Mostly effective	Completely effective
CONTACTED LAW ENFORCEMENT	4	0	2	0	3
CONSULTED LEGAL COUNSEL	1	1	3	2	1
REPORTED TO HOST OR ISP	6	4	6	3	1
OTHER	4	0	4	10	11
REPORTED TO MAINTAINERS	10	11	31	30	13
SOLICITED COMMUNITY SUPPORT	6	22	38	32	14
ASKED USER TO STOP	48	51	50	33	11
BLOCKED USER	6	20	28	56	58

In [187]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues
neg_effect_responses.plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

plt.title("How effective were the following responses?\n(counting number of responses)")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

In [188]:

neg_effect_responses_prop = neg_df[neg_effect_vars].apply(pd.Series.value_counts, normalize=True).round(4).transpose()
neg_effect_responses_prop = neg_effect_responses_prop.replace(np.nan, 0).sort_values(by="Completely effective")
neg_effect_responses_prop = neg_effect_responses_prop[["Not at all effective", "A little effective", "Somewhat effective", "Mostly effective", "Completely effective"]]
neg_effect_responses_prop = neg_effect_responses_prop * 100

In [189]:

idx = []
for i in neg_effect_responses_prop.index:
    idx.append(i[23:].replace(".", " "))
neg_effect_responses_prop.index = idx
neg_effect_responses_prop

Out[189]:

	Not at all effective	A little effective	Somewhat effective	Mostly effective	Completely effective
REPORTED TO HOST OR ISP	30.00%	20.00%	30.00%	15.00%	5.00%
ASKED USER TO STOP	24.87%	26.42%	25.91%	17.10%	5.70%
SOLICITED COMMUNITY SUPPORT	5.36%	19.64%	33.93%	28.57%	12.50%
CONSULTED LEGAL COUNSEL	12.50%	12.50%	37.50%	25.00%	12.50%
REPORTED TO MAINTAINERS	10.53%	11.58%	32.63%	31.58%	13.68%
CONTACTED LAW ENFORCEMENT	44.44%	0.00%	22.22%	0.00%	33.33%
BLOCKED USER	3.57%	11.90%	16.67%	33.33%	34.52%
OTHER	13.79%	0.00%	13.79%	34.48%	37.93%

In [190]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues
neg_effect_responses_prop.plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

plt.title("How effective were the following responses?\n(proportion of responses)")

plt.xlabel("Proportion of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

As a result of experiencing or witnessing harassment, which, if any, of the following have you done?¶

NEGATIVE.CONSEQUENCES.*

In [ ]:

In [191]:

neg_conseq_responses = neg_df[neg_conseq_vars].apply(pd.Series.value_counts).transpose()[[1,0]]
neg_conseq_responses.columns = ["Yes", "Blank"]
neg_conseq_responses

Out[191]:

	Yes	Blank
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING	390	5639
NEGATIVE.CONSEQUENCES.PSEUDONYM	50	5979
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE	166	5863
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME	48	5981
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE	79	5950
NEGATIVE.CONSEQUENCES.SUGGEST.COC	116	5913
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION	301	5728
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION	248	5781
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES	85	5944
NEGATIVE.CONSEQUENCES.OTHER	90	5939
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE	1094	4935
NEGATIVE.CONSEQUENCES.ANY.RESPONSE	1953	4076

Only 1,953 respondents clicked any boxes in this question, meaning 4,076 did not click a response or the "none of the above" option (or even get to the question). We have to adjust the no responses accordingly.

In [192]:

neg_conseq_responses_adj = neg_conseq_responses
neg_conseq_responses_adj["Blank"] = neg_conseq_responses_adj["Blank"] - 4076
neg_conseq_responses_adj_df = pd.DataFrame(neg_conseq_responses_adj["Yes"] / (neg_conseq_responses_adj["Yes"] + neg_conseq_responses_adj["Blank"]) * 100, columns=["percent_yes"])

In [193]:

neg_conseq_responses_adj.columns = ['Yes', 'No']
neg_conseq_responses_adj[:-1]

Out[193]:

	Yes	No
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING	390	1563
NEGATIVE.CONSEQUENCES.PSEUDONYM	50	1903
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE	166	1787
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME	48	1905
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE	79	1874
NEGATIVE.CONSEQUENCES.SUGGEST.COC	116	1837
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION	301	1652
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION	248	1705
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES	85	1868
NEGATIVE.CONSEQUENCES.OTHER	90	1863
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE	1094	859

In [194]:

neg_conseq_responses_adj_df.sort_values(by="percent_yes").round(2)

Out[194]:

	percent_yes
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME	2.46%
NEGATIVE.CONSEQUENCES.PSEUDONYM	2.56%
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE	4.05%
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES	4.35%
NEGATIVE.CONSEQUENCES.OTHER	4.61%
NEGATIVE.CONSEQUENCES.SUGGEST.COC	5.94%
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE	8.50%
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION	12.70%
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION	15.41%
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING	19.97%
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE	56.02%
NEGATIVE.CONSEQUENCES.ANY.RESPONSE	100.00%

In [195]:

sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
neg_conseq_responses_adj[:-1].sort_values(by='No').plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[22:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("As a result of experiencing or witnessing\nharassment, which, if any, of the following have you done?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

In [ ]:

Summary Analysis of the 2017 GitHub Open Source Survey¶

Overview¶

Purpose and goal¶

Download and unzip data¶

Data processing¶

Main dataset¶

Load main dataset into pandas¶

Explore the main dataset with some sample responses¶

Create lists of variables for bulk analysis¶

Negative incidents¶

Load into pandas¶

Explore the negative dataset with some sample responses¶

Create lists of variables for bulk analysis¶

Analysis¶

Contributor identity¶

People participate in open source in different ways. Which of the following activities do you engage in?¶

Contributon type: How often do you engage in each of the following activities?¶

Employment status¶

In your main job, how often do you write or otherwise directly contribute to producing software?¶

How interested are you in contributing to open source projects in the future?¶

How likely are you to contribute to open source projects in the future?¶

Priorities and values¶

When thinking about whether to use open source software, how important are the following things?¶

When thinking about whether to contribute to an open source project, how important are the following things?¶

How often do you try to find open source options over other kinds of software?¶

Open source software usability¶

Open source software security¶

Open source software stability¶

Identification with open source¶

Transparency vs privacy¶

Attribution¶

In general, how much information about you is publicly available online?¶

Do you feel that you need to make information available about yourself online for professional reasons?¶

General privacy practices¶

OSS privacy practices¶

Mentorship / Help¶

Have you ever received any kind of help from other people related to using or contributing to an open source project?¶

Thinking of the most recent case where someone helped you, how did you find someone to help you?¶

Which best describes your prior relationship with the person who helped you?¶

What kind of problem did they help you with?¶

Have you ever provided help for another person on an open source project?¶

Thinking of the most recent case where you helped someone, how did you come to help this person?¶

Which best describes your prior relationship with the person you helped?¶

What kind of problem did you help them with?¶

Open Source Software in Paid Work¶

Do you contribute to open source as part of your professional work?¶

How often do you use open source software in your professional work?¶

How does your employer's intellectual property agreement/policy affect your free-time contributions to open source unrelated to your work?¶

Which is closest to your employer’s policy on using open source software applications?¶

How important do you think your involvement in open source was to getting your current job?¶

Demographics¶

Do you currently live in a country other than the one in which you were born?¶

Thinking of where you were born, are you a member of an ethnicity or nationality that is a considered a minority in that country?¶

Thinking of where you currently live, are you a member of an ethnicity or nationality that is a considered a minority in that country?¶

What is your gender?¶

Do you identify as transgender?¶

Do you identify as gay, lesbian, or bisexual, asexual, or any other minority sexual orientation?¶

How well can you read and write in English?¶

What is your age?¶

What is highest level of formal education that you have completed?¶

What is the highest level of formal education that either of your parents completed?¶

How old were you when you first had regular access to a computer with an internet connection?¶

Where did you first have regular access to a computer with internet connection?¶

Where was the respondent surveyed from?¶

Harassment / Inclusiveness of OSS¶

Have you ever observed any of the following in the context of an open source project?¶

Have you ever witnessed any of the following behaviors directed at another person in the context of an open source project? (not including something directed at you)¶

Have you ever experienced any of the following behaviors directed at you in the context of an open source project?¶

Thinking of the last time you experienced harassment, how did you respond?¶

How effective were the following responses?¶

As a result of experiencing or witnessing harassment, which, if any, of the following have you done?¶