Please read the assignment overview page carefully before proceeding. The page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment.
If you fail to follow these simple instructions, it will negatively impact your grade!
Due date and time: The assignment is due on Feb 27th at 23:59. Hand in your Jupyter notebook file (with extension .ipynb
) via DTU Learn (Assignment 1).
Remember to include in the first cell of your notebook:
Exercise: Web-scraping the list of participants to the International Conference in Computational Social Science
You can find the programme of the 2023 edition of the conference at this link. As you can see the conference programme included many different contributions: keynote presentations, parallel talks, tutorials, posters.
- Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accuarate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling.
- Some instructions for success:
- First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.
- Use the BeautifulSoup Python package to navigate through the hierarchy and extract the elements you need from the page.
- You can use the find_all method to find elements that match specific filters. Check the documentation of the library for detailed explanations on how to set filters.
- Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted charachters)
- The overall idea is to adapt the procedure I have used here for the specific page you are scraping.
- Create the set of unique researchers that joined the conference and store it into a file.
- Important: If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible.
- Optional: For a more complete represenation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at this link; (ii) the organizers of tutorials, that can be found at this link
- How many unique researchers do you get?
- Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices (answer in max 150 words).
Exercise: Ready made data vs Custom made data In this exercise, I want to make sure you have understood they key points of my lecture and the reading.
- What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book (answer in max 150 words).
- How do you think these differences can influence the interpretation of the results in each study? (answer in max 150 words)
Exercise : Collecting Research Articles from IC2S2 Authors
In this exercise, we'll leverage the OpenAlex API to gather information on research articles authored by participants of the IC2S2 2023 conference, referred to as IC2S2 authors. Before you start, please ensure you read through the entire exercise.
Steps:
Retrieve Data: Starting with the authors you identified in Week 2, Exercise 2, use the OpenAlex API works endpoint to fetch the research articles they have authored. For each article, retrieve the following details:
- id: The unique OpenAlex ID for the work.
- publication_year: The year the work was published.
- cited_by_count: The number of times the work has been cited by other works.
- author_ids: The OpenAlex IDs for the authors of the work.
- title: The title of the work.
- abstract_inverted_index: The abstract of the work, formatted as an inverted index.
Important Note on Paging: By default, the OpenAlex API limits responses to 25 works per request. For more efficient data retrieval, I suggest to adjust this limit to 200 works per request. Even with this adjustment, you will need to implement pagination to access all available works for a given query. This ensures you can systematically retrieve the complete set of works beyond the initial 200. Find guidance on implementing pagination here.
Data Storage: Organize the retrieved information into two Pandas DataFrames and save them to two files in a suitable format:
- The IC2S2 papers dataset should include: id, publication_year, cited_by_count, author_ids.
- The IC2S2 abstracts dataset should include: id, title, abstract_inverted_index.
Filters: To ensure the data we collect is relevant and manageable, apply the following filters:
- Only include IC2S2 authors with a total work count between 5 and 5,000.
- Retrieve only works that have received more than 10 citations.
- Limit to works authored by fewer than 10 individuals.
- Include only works relevant to Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science), as defined by their Concepts. Note: here we only consider Concepts at level=0 (the most coarse definition of concepts).
Efficiency Tips: Writing efficient code in this exercise is crucial. To speed up your process:
- Apply filters directly in your request: When possible, use the filter parameter of the works endpoint to apply the filters above directly in your API request, ensuring only relevant data is returned. Learn about combining multiple filters here.
- Bulk requests: Instead of sending one request for each author, you can use the filter parameter to query works by multiple authors in a single request. Note: My testing suggests that can only include up to 25 authors per request.
- Use multiprocessing: Implement multiprocessing to handle multiple requests simultaneously. I highly recommmend Joblib’s Parallel function for that, and tqdm can help monitor progress of your jobs. Remember to stay within the rate limit of 10 requests per second.
For reference, employing these strategies allowed me to fetch the data in about 30 seconds using 5 cores on my laptop. I obtained a dataset of approximately 25 MB (including both the IC2S2 abstracts and IC2S2 papers files).
Data Overview and Reflection questions: Answer the following questions:
- Dataset summary. How many works are listed in your IC2S2 papers dataframe? How many unique researchers have co-authored these works?
- Efficiency in code. Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? (answer in max 150 words)
- Filtering Criteria and Dataset Relevance Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? (answer in max 150 words)
Exercise: Constructing the Computational Social Scientists Network
In this exercise, we will create a network of researchers in the field of Computational Social Science using the NetworkX library. In our network, nodes represent authors of academic papers, with a direct link from node A to node B indicating a joint paper written by both. The link's weight reflects the number of papers written by both A and B.
Part 1: Network Construction
Weighted Edgelist Creation: Start with your dataframe of papers. Construct a weighted edgelist where each list element is a tuple containing three elements: the author ids of two collaborating authors and the total number of papers they've co-authored. Ensure each author pair is listed only once.
Graph Construction:
- Use NetworkX to create an undirected
Graph
.- Employ the
add_weighted_edges_from
function to populate the graph with the weighted edgelist from step 1, creating a weighted, undirected graph.Node Attributes:
- For each node, add attributes for the author's display name, country, citation count, and the year of their first publication in Computational Social Science. The display name and country can be retrieved from your authors dataset. The year of their first publication and the citation count can be retrieved from the papers dataset.
- Save the network as a JSON file.
Part 2: Preliminary Network Analysis Now, with the network constructed, perform a basic analysis to explore its features.
Network Metrics:
- What is the total number of nodes (authors) and links (collaborations) in the network?
- Calculate the network's density (the ratio of actual links to the maximum possible number of links). Would you say that the network is sparse? Justify your answer.
- Is the network fully connected (i.e., is there a direct or indirect path between every pair of nodes within the network), or is it disconnected?
- If the network is disconnected, how many connected components does it have? A connected component is defined as a subset of nodes within the network where a path exists between any pair of nodes in that subset.
- How many isolated nodes are there in your network? An isolated node is defined as a node with no connections to any other node in the network.
- Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why? (answer in max 150 words)
Degree Analysis:
- Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree). What do these metrics tell us about the network? (answer in max 150 words)
Top Authors:
- Identify the top 5 authors by degree. What role do these node play in the network?
- Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons? (answer in max 150 words)