#!/usr/bin/env python
# coding: utf-8

# # Users' Data: Legal & Ethical Considerations

# Before we dive into collecting data from the internet, we need to discuss some serious questions. Is it legal or ethical to computationally collect data from the internet? Is it legal or ethical to publish research that includes internet users' data without their knowledge?
# 
# ## Legal Considerations
# 
# If internet data is publicly available (e.g., tweets from a public Twitter account), it is generally considered legal to collect this data, even if a particular platform says that you cannot. In 2019, the Ninth Circuit Court of Appeals ruled that scraping publicly accessible websites likely does not violate federal anti-hacking laws. You can [read more about this legal ruling](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data#:~:text=Linkedin%20Protects%20Scraping%20of%20Public%20Data,-Share%20It%20Share&text=In%20a%20long%2Dawaited%20decision,and%20Abuse%20Act%20(CFAA).) from the Electronic Frontier Foundation.
# 
# ## Institutional Review Boards (IRBs)
# 
# Research that involves human participants (e.g., surveys, interviews, blood draws) needs to be approved by an Institutional Review Board (IRB). But research about publicly available internet data does not typically require IRB approval.
# 
# The [Cornell Institutional Review Board](https://researchservices.cornell.edu/sites/default/files/2019-05/IRB%20Policy%2020.pdf) recommends being cautious with regard to data mining from the internet, however, and seeking "formal confirmation of non-human participant research status":
# 
# > If the individual or social media/network site has not placed any restrictions on
# access to information about himself/herself (e.g., information available on a public
# website, blog, twitter feed, chat room, etc.), the following best practices should be
# followed:
# > - The researcher should send a project description to the IRB office and seek a
# formal confirmation of non-human participant research status for the study. We
# believe that in most cases, this will not be considered human participant
# research, but caution is recommended before a researcher makes his/her own
# determination, because of the emerging ethical sensitivities in this area. 
# 
# 
# ## Publishing, Privacy, & Citation
# 
# Just because something is legal or gets approved by an IRB does not mean it is ethical. Collecting, sharing, and publishing internet data created by or about individuals can lead to unwanted public scrutiny, harm, and other negative consequences for those individuals. For these reasons, some researchers attempt to anonymize internet data before sharing it or before publishing an article that cites a post specifically. Yet anonymizing internet data also does not give credit to internet users as creators and authors.
# 
# There is no single, simple answer to the many difficult questions raised by internet data collection. It is important to develop an ethical framework that responds to the specifics of your particular research project or use case (e.g., the platform, the people involved, the context, the potential consequences, etc.).
# 
# In my own research, I have started seeking explicit permission from internet users when I want to quote them in a published article. In this book, I only share internet data that meets a certain threshold of publicness, such as tweets from verified Twitter accounts or Reddit posts with a certain number of upvotes. This is an approach that I have developed based on some of the models and readings included below.
# 
# 
# ## Models & Examples of Social Media Data in Published Research
# 
# Below are a few examples of how researchers have approached social media data in published research:
# 
# ### Paraphrasing Posts
# 
# * In Maria Antoniak, David Mimno, and Karen Levy's [article about  a Reddit subcommunity dedicated to birthstories (r/BabyBumps)](https://maria-antoniak.github.io/resources/2019_cscw_birth_stories.pdf), they paraphrased Reddit submissions discussed in the article and then deleted all collected Reddit data after the article was published.
# 
# ### Linking to Posts & Using "Reasonably Public" Thresholds
# 
# * In Deen Freelon, Charlton McIlwain, and Meredith D. Clark's [report about the #BlackLivesMatter movement](https://cmsimpact.org/wp-content/uploads/2016/03/beyond_the_hashtags_2016.pdf), they included links to tweets rather than the full text of tweets and only linked to tweets with a minimum of 100 retweets published by Twitter users who had at least 3,000 followers or were verified. They embargoed their Twitter data for a year and then publicly released a list of tweet IDs. Tweet IDs can be used by third-parties to re-download any tweets that have not been deleted yet, as I discuss in the lesson ["Twitter Data Sharing"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Twitter-Data-Sharing.html).
# 
# ### Direct Collaboration & Conversation with Users 
# * In Moya Bailey's [article about the #GirlsLikeUs hashtag](http://www.digitalhumanities.org/dhq/vol/9/2/000209/000209.html), created by trans advocate Janet Mock, she asked for Mock's permission to work on the project before it began and collaborated with Mock to develop research questions and determine the project's direction.
# 
# ## Further Recommended Reading

# * [Doc Now White Paper](https://www.docnow.io/docs/docnow-whitepaper-2018.pdf), Bergis Jules, Ed Summers, Dr. Vernon Mitchell, Jr.
# * <a href="https://cmci.colorado.edu/~cafi5706/ICWSM2020_datascraping.pdf">No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service</a>, Casey Fiesler, Nathan Beard, Brian C. Keegan
# * [#transform(ing)DH Writing and Research: An Autoethnography of Digital Humanities and Feminist Ethics](http://www.digitalhumanities.org/dhq/vol/9/2/000209/000209.html), Moya Bailey
# * [The #TwitterEthics Manifesto](https://modelviewculture.com/pieces/the-twitterethics-manifesto), Dorothy Kim and Eunsong Kim