A Summary of lecture "AWS ML Foundations Course", via Udacity
Testing your code is essential before deployment. It helps you catch errors and faulty conclusions before they make any major impact. Today, employers are looking for data scientists with the skills to properly prepare their code for an industry setting, wich includes testing their code.
Problems that could occur in data science aren't always easily detectable; you might have values being encoded incorrectly, features being used inappropriately, unexpected data breaking assumptions.
To catch these errors, you have to check for the quality and accuracy of your analysis in addition to the quality of your code. Proper testing is necessary to avoid unexpected surprises and have confidence in your results.
TEST DRIVEN DEVELOPMENT: a development process where you write tests for tasks before you even write the code to implement those tasks.
UNIT TEST: a test that covers a "unit" of code, usually a single function, independently from the rest of the program.
We want to test our functions in a way that is repeatable and automated. Ideally, we'd run a test program that runs all our unit tests and cleanly lets us know which ones failed and which ones succeeded. Fortunately, there are great tools available in Python that we can use to create effective unit tests!
Unit Test Advantages and Disadvantages
The advantage of unit tests is that they are isolated from the rest of your program, and thus, no dependencies are involved. They don't require access to databases, APIs, or other external sources of information. However, passing unit tests isn’t always enough to prove that our program is working successfully. To show that all the parts of our program work with each other properly, communicating and transferring data between them correctly, we use integration tests. In this lesson, we'll focus on unit tests; however, when you start building larger programs, you will want to use integration tests as well.
You can read about integration testing and how integration tests relate to unit tests here. That article contains other very useful links as well.
To install pytest
, run pip install -U pytest
in your terminal. You can see more information on getting started here.
test_
test_
inside the test filepytest
into your terminal in the directory of your test file and it will detect these tests for you!test_
is the default - if you wish to change this, you can learn how to in this pytest
configuration
In the test output, periods represent successful unit tests and F's represent failed unit tests. Since all you see is what test functions failed, it's wise to have only one assert
statement per test. Otherwise, you wouldn't know exactly how many tests failed, and which tests failed.
Your tests won't be stopped by failed assert
statements, but it will stop if you have syntax errors.
root@1c2d457fb26f:/home/workspace# pytest
========================================================================== test session starts ===========================================================================
platform linux -- Python 3.6.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
rootdir: /home/workspace
plugins: requests-mock-1.5.2
collected 4 items
test_compute_launch.py ..F. [100%]
================================================================================ FAILURES ================================================================================
___________________________________________________________________ test_days_until_launch_0_negative ____________________________________________________________________
def test_days_until_launch_0_negative():
> assert(days_until_launch(83, 64) == 0)
E assert -19 == 0
E + where -19 = days_until_launch(83, 64)
test_compute_launch.py:10: AssertionError
======================================================================== short test summary info =========================================================================
FAILED test_compute_launch.py::test_days_until_launch_0_negative - assert -19 == 0
====================================================================== 1 failed, 3 passed in 0.32s =======================================================================
After fixing code,
root@1c2d457fb26f:/home/workspace# pytest
========================================================================== test session starts ===========================================================================
platform linux -- Python 3.6.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
rootdir: /home/workspace
plugins: requests-mock-1.5.2
collected 4 items
test_compute_launch.py .... [100%]
=========================================================================== 4 passed in 0.05s ============================================================================
Test driven development for data science is relatively new and has a lot of experimentation and breakthroughs appearing, which you can learn more about in the resources below.
Logging is valuable for understanding the events that occur while running your program. For example, if you run your model over night and see that it's producing ridiculous results the next day, log messages can really help you understand more about the context in which this occurred. Lets learn about the qualities that make a log message effective.
Logging is the process of recording messages to describe events that have occurred while running your software. Let's take a look at a few examples, and learn tips for writing good log messages.
Tip: Be professional and clear
Bad: Hmmm... this isn't working???
Bad: idk.... :(
Good: Couldn't parse file.
Tip: Be concise and use normal capitalization
Bad: Start Product Recommendation Process
Bad: We have completed the steps necessary and will now proceed with the recommendation process for the records in our product database.
Good: Generating product recommendations.
Tip: Choose the appropriate level for logging
Tip: Provide any useful information
Bad: Failed to read location data
Good: Failed to read location data: store_id 8324971
Code reviews benefit everyone in a team to promote best programming practices and prepare code for production. Let's go over what to look for in a code review and some tips on how to conduct one.
First, let's look over some of the questions we may ask ourselves while reviewing code. These are simply from the concepts we've covered in these last two lessons!
Is the code clean and modular?
Is the code efficient?
Is documentation effective?
Is the code well tested?
Is the logging effective?
Now that we know what we are looking for, let's go over some tips on how to actually write your code review. When your coworker finishes up some code that they want to merge to the team's code base, they might send it to you for review. You provide feedback and suggestions, and then they may make changes and send it back to you. When you are happy with the code, you approve and it gets merged to the team's code base.
As you may have noticed, with code reviews you are now dealing with people, not just computers. So it's important to be thoughtful of their ideas and efforts. You are in a team and there will be differences in preferences. The goal of code review isn't to make all code follow your personal preferences, but a standard of quality for the whole team.
Tip: Use a code linter
This isn't really a tip for code review, but can save you lots of time from code review! Using a Python code linter like pylint can automatically check for coding standards and PEP 8 guidelines for you! It's also a good idea to agree on a style guide as a team to handle disagreements on code style, whether that's an existing style guide or one you create together incrementally as a team.
Tip: Explain issues and make suggestions
Rather than commanding people to change their code a specific way because it's better, it will go a long way to explain to them the consequences of the current code and suggest changes to improve it. They will be much more receptive to your feedback if they understand your thought process and are accepting recommendations, rather than following commands. They also may have done it a certain way intentionally, and framing it as a suggestion promotes a constructive discussion, rather than opposition.
BAD: Make model evaluation code its own module - too repetitive.
BETTER: Make the model evaluation code its own module. This will simplify models.py to be less repetitive and focus primarily on building models.
GOOD: How about we consider making the model evaluation code its own module? This would simplify models.py to only include code for building models. Organizing these evaluations methods into separate functions would also allow us to reuse them with different models without repeating code.
Tip: Keep your comments objective
Try to avoid using the words "I" and "you" in your comments. You want to avoid comments that sound personal to bring the attention of the review to the code and not to themselves.
BAD: I wouldn't groupby genre twice like you did here... Just compute it once and use that for your aggregations.
BAD: You create this groupby dataframe twice here. Just compute it once, save it as groupby_genre and then use that to get your average prices and views.
GOOD: Can we group by genre at the beginning of the function and then save that as a groupby object? We could then reference that object to get the average prices and views without computing groupby twice.
Tip: Provide code examples
When providing a code review, you can save the author time and make it easy for them to act on your feedback by writing out your code suggestions. This shows you are willing to spend some extra time to review their code and help them out. It can also just be much quicker for you to demonstrate concepts through code rather than explanations.
Let's say you were reviewing code that included the following lines:
first_names = []
last_names = []
for name in enumerate(df.name):
first, last = name.split(' ')
first_names.append(first)
last_names.append(last)
df['first_name'] = first_names
df['last_names'] = last_names
BAD: You can do this all in one step by using the pandas str.split method.
GOOD: We can actually simplify this step to the line below using the pandas str.split method. Found this on this stack overflow post: https://stackoverflow.com/questions/14745022/how-to-split-a-column-into-two-columns
df['first_name'], df['last_name'] = df['name'].str.split(' ', 1).str