This course introduces relevant programming techniques for data analytics. Topics include programming languages, relevant software packages, good programming practices, linear algebra in data analytics, numerical computing, and 4~5 machine learning algorithms as running problems. After completing the course, students will gain the skills to implement a data analytics pipeline (data collection, data retrieval, data analysis, data visualization) and several "handy" machine learning algorithms.

Piazza Discussion Forum

We will use Piazza for discussion (e.g., homework, project). Post your questions there, and the teaching staff and your fellow classmates will be able to help answer them quickly. You can also use Piazza to find project teammates.

T-square will only be used for submission of assignments and projects.

Office Hours

Instructor Da Kuang Thu, 2-3pm, Klaus 1315
Instructor Polo Chau Thu, 4-5pm, Klaus 1315
TA Lianxiao (Shawn) Qiu Mon, 1-2pm, Klaus 2108

Prerequisites

Students should have some experience in programming with any language, for example, working knowledge of variables, operators, statements, control flows, reference, functions, classes, etc., and should feel comfortable reading documentation.
Additional formal prerequisites
Undergraduate semester level CS 1371 (Computing for Engineers) or a different programming course, minimum grade of D.

Schedule (tentative)

Date Topic Wed Fri Events
Aug 20, 22 * Course introduction
* Course survey
* Introduction to Python and its data structures
Slides Slides  
27, 29 * Python exercises Q&A
* Data collection
  • wget, urllib/urllib2, API
  Slides  
Sep 3, 5 * Data collection (cont'd)
  • BeautifulSoup
* Topics in Python
Slides Slides HW1 out (Wed)
10, 12 * Charting/Visualization
  • Charting in R
R resources
(Link 1) (Link 2) (Link 3)
Slides  
17, 19 * Data storage and retrieval in sqlite
* Basic linear algebra overview
  • Vectors, matrices
Slides Notes HW1 due (Fri)
24, 26 * Dense and sparse matrices (including Numpy)
* Good programming practices
Slides    
Oct 1, 3 * Basic linear algebra overview
  • Matrix-vector multiplication
  • Norms
Notes Notes HW2 out (Mon)
8, 10 * Linear regression
  • Least squares
  • Computing Least squares
  • Case study
Notes Scripts HW2 due (Fri)
HW3 out (Sat)
15, 17 * Logistic regression
  • Regression vs. Classification
  • Gradient descent
  • Case study
Slides Scripts  
22, 24 * Computer architecture overview
* Vectorization in Numpy and R
Slides Scripts HW3 due (Fri/Sat)
29, 31 * K-means clustering
* Project proposal presentations
Slides   Project proposal due (Thu)
Nov 5, 7 * K-means clustering: Case studies
* Efficient implementation of K-means
Scripts (README)    
12, 14 * Numerical software stacks
* Singular value decomposition (SVD)
Slides Notes  
19, 21 * SVD, eigenvalue decomposition (EVD), and PCA
* Computing SVD, EVD and PCA
Notes   Progress report due (Wed)
26, 28 (Thanksgiving holiday)      
Dec 3, 5 * Latest research; popular topics
* Final project presentations
Slides   Final report due (Thu)

Grading

Late Submissions Policy

Textbooks, references, and reading materials

Homework (tentative)

Please note that while collaboration is allowed, individual collaborators *must* write up their own answers. All GT students must observe the honor code.

Project

Team project: 2-3 people. Description and grading policy available now! (proposal + presentation, progress report, final report + presentation)

Dataset Ideas

Auditors

Auditors must first obtain instructor's permission of the instructor, then enroll in the course. The auditor must attend all lectures, and optionally complete the assignments.

Acknowledgements & Related Classes

Many thanks to our colleagues for sharing their course materials:
Prof. Le Song - Introduction to Computational Data Analysis - Spring 2014