Skip to main content
An official website of the United States government

Chapter 2: Coding Basics for Cancer Data Science

Do you need to know computer science to do data science? Aren’t data scientists just computer scientists? In this chapter, we’ll answer those questions and set you up with tips and resources to write code for cancer research projects.

Watch the Video

Watch 5 Tips for Learning to Code (approx. 6 minutes long).

Test Your Knowledge

What are two common programming languages used by data scientists?

A. Python, R
B. Python, Tensorflow
C. Panda, R

The correct answer is A. Many data scientists studying cancer research will have skills in Python, R, or both! 

The incorrect answers are: 

B. Tensorflow is an open-source machine learning platform, not a programming language. Python is the recommended programming languages for using Tensorflow. 

C. Panda is a specific package in Python that you may use to organize data into objects to help you better analyze the data.

Why is it important to write comments in your code for others to see how you did your work?

A. Reminds you that you need to fix a bug in the code.
B. Allows others to see the logic of your work so they can replicate it.
C. Allows others to comment back on where you can improve your code.

The correct answer is B. Comments are a useful way to let others know what that section, class, function, or object does. This will help them navigate and troubleshoot the code as they use your script.

The incorrect answers are: 

A. While programming, you can use comments to help you debug and navigate code, but when you are ready to share that code, leaving reminders about bugs in the code can be confusing for others.

C. Unless someone shares their file with you, you won’t see comments from others trying to troubleshoot and work with your script.

Why is it important for others to reproduce your code?

A. It allows others to replicate your study and build on it as needed.
B. It allows others to apply your code to their own study.
C. It allows others to validate your work.
D. All of the above.

The correct answer is D. Yes, all of these are benefits to making your code reproducible.

The incorrect answer are: 

A. While this does allow others to replicate and build upon your study, it is not limited to just this benefit.

B. While this does allow your code to be applied to another person’s study, it is not limited to just this benefit.

C. While it does lend itself to validation, it is not limited to just this benefit.

  • Python Introductory Series, Lesson 1: Watch this course recording from NCI’s Bioinformatics Training & Education Program (BTEP) for information and guidance on Python (i.e., its command syntax, where you can find Python packages, and how it’s used to start a Jupyter Lab session on Biowulf). Note: Given this is a recording, you will have to access the resource tools specified within the video to follow along with the instructor.
  • An Introduction to Python for  (Part 1) (Part 2): Watch this beginner-oriented, two-part series on Python, presented by NIH’s National Institute on Minority Health and Health Disparities.    
  • Introduction to R: Watch this NCI BTEP course recording for an introduction to R and RStudio. Note: Given this is a recording, you will have to access the resource tools specified within the video (i.e., DNAnexus and RStudio) to follow along with the instructor).
  • GitHub Automation for Scientists: With this course, you’ll walk through the “why’s” and “how’s” for using automation to enhance your scientific software development process. This course is designed particularly for students in the biomedical sciences and researchers who use informatics tools.
  • Choose a Programming Language: Refer to this recording from the NCI Center for Cancer Research’s Bioinformatics Training and Education Program to learn about R and Python—from installation to execution.

Keep Going

Continue to Chapter 5 to learn about big data technologies we think can accelerate your education and research.

Instructor

Daoud Meerzaman, Ph.D., NCI Center for Biomedical Informatics and Information Technology (CBIIT)
Dr. Meerzaman is the branch chief for the Computational Genomics and Bioinformatics Branch at NCI CBIIT. He leads a team of bioinformaticists, computational biologists, and developers to analyze cancer research data for NCI.

For questions and feedback about this chapter, email our team at ncicbiit@mail.nih.gov

  • Updated:

If you would like to reproduce some or all of this content, see Reuse of NCI Information for guidance about copyright and permissions. In the case of permitted digital reproduction, please credit the National Cancer Institute as the source and link to the original NCI product using the original product's title; e.g., “Chapter 2: Coding Basics for Cancer Data Science was originally published by the National Cancer Institute.”

Email