COMP 306/406
Data Mining
Syllabus
Spring, 2022


Course Information
Comp 306-001 (4631)
Comp 406-001 (4632)

Time/Place:  Asynchronous Online
Instructor:
  Channah F. Naiman
Email:
  cnaiman@luc.edu
           
When emailing me, please be sure to put COMP 306/406 in the subject line!!
MS Teams Code: 
1ke9vnc 
       
(If you have not yet used MS Teams, download the app here, which you can install on your phone or laptop.  Once in Teams click on Teams-->Join a Team, and then enter our code


Catalog Description

As more data are collected by businesses and scientific institutions, knowledge exploration techniques are needed to gain useful business intelligence. This course covers the theory and practice of the analysis (mining) of extremely large datasets.

Data mining is for relatively unstructured data for which more sophisticated techniques are needed. The course aims to cover powerful data mining techniques including clustering, association rules, and classification. We also briefly introduce  high volume data processing mechanisms by modeling warehouse schemas such as snowflake  and star. OLAP query retrieval techniques are also introduced.  We learn the basics of Data Warehousing structure and query formulation, as it impacts a data miner.  We do not query against  an actual  data warehouse.  There are other courses for that, if you are interested in Data Warehousing.


Outcome

Students will be able to define, apply and critically analyze data mining approaches for fields such as security, health care, science, marketing and text analysis.


Prerequisites

COMP 231:  Data Structures and Algorithms for Informatics  or COMP 251: Introduction to Database Systems or COMP 271: Data Structures 

AND

STAT 103: Fundamentals of Statistics  or STAT 203: Statistics or ISSCM 241: Business Statistics or PSYC 304: Statistics or instructor permission

Statistics is listed as a prerequisite; however, we have a "crash course" in basic statistics for those coming into the course with little statistics, or for those who need a review.  In addition, although database design is not listed as a prerequisite, there will be several references to relevant database design topics.  


Textbooks and references  We are exploring different types of concepts software tools in this course.  Therefore, different texts and resources will be required for the different modules.  However, several of them are free or very inexpensive.

Required:  For general reference (language and platform independent); for homework problems, lectures and examples:

This book is useful for reference and conceptual examples (language independent, with nice illustrations) of some of the underlying concepts/algorithms (such as apriori, basic classification and clustering algorithms and more).  The book doesn't have a more recent edition, but it is something of a classic as a data mining text.  Since you can find this quite inexpensively on the internet, I am including it here for reference. (Also, because it is not a new book, I have found the entire pdf for free at a legitimate university site, see below.)  The book is now approaching 10 years old, which is the limit that we can really use it, even for a classic.  That's really a shame, since the newer book that shares a lot of the examples, figures and topics is by Tan, and I have found the explanations and overall structure to be less helpful.  So this semester, you kind of lucked out.

Title: Data Mining: Concepts and Techniques, Third Edition
Authors: Jiawei Han, Micheline Kamber, Jian Pei
Publisher: Morgan Kaufmann; 3rd edition (2012)
ISBN-10: 0123814790  or try this


Required:  For the RapidMiner and R lab part of the course:\

           Data Mining for the Masses

****There is an excellent fourth edition out that is an online and updated version of the third edition.  It has updated powerpoint slides, short explanatory videos, review questions and other support materials.  Some students have really liked this version of the lab text, so I have created a course link where you can purchase this online text for $69.99.

The third edition also has the R applications.  It is not as updated as the 4th edition, which is constantly being updated.  However, you may be able to get by with it.  It also doesn't come with the support materials that many students have found helpful.  For lab assignments, I have included both  page numbers to the 3ed and the section and figure numbers to the 4ed.

Title:  Data Mining for the Masses, 3ed, with Implementations in RapidMiner and R
            Author:  Matthew North
             ISBN
-13: 978-1727102475

            Support site for the third edition



Reference:  or implementations in R (assignments, labs, cases, etc.), for reference and some good examples:
Title: Data Mining For Business and Analytics
Concepts, Techniques and Applications in R.
Authors: Shmueli, Bruce, Yahav, Patel, Lichtendahl
Publisher: Morgan Kaufmann; 3rd edition (July 6, 2011)
ISBN-10:
1118879368
ISBN-13: 978- 1118879368

For ggplot examples: (You don't have to buy the book.  It is based off of his website.  Illustrated examples, if you are interested in Data Visualization for your project presentation.)  (Or just take the DataViz class.)
Alboukadel Kassambara.  Guide to Create Beautiful Graphics in R,  STHDA, 2013.  isbn:  9781532916960.  Most examples, with small modifications, are available on his wonderful website and his R support website.

 

Course Objectives and Goals

After taking this course, students should be able to:

What this course is NOT:
Software
We will be using the data mining applications package RapidMiner in this course.  You may download the current version here. (You will have to register for an account, which is free.)  Please check the Orientation Module on Sakai for more information and instructions on installation.  The Community Edition is free, but it has a limitation of 10,000 rows.  If you sign up with your luc email address, you should automatically have an educational license.  This is important, as we have a major lab that requires more than the 10,000 rows, and you may require many rows for your project.  You can check this inside of RM by clicking on Settings-->Manage Licenses.  If it does not show up correctly, then you can request an educational license directly from RM.   Please install RM as soon as possible.  Although I cannot enforce deadlines before the course begins, I do request that you submit a screen shot of your RM installation in the Orientation Module, which is sent out shortly before classes begin.

We are also using R, with RStudio as an IDE (although you are welcome to use any other IDE of choice, such as Jupyter notebook or anything else).  The Orientation also walks you through the installation of R and RStudio.

Weeks 1 and 2 have some introductory labs and videos to familiarize you with both RapidMiner and R.
Academic Honesty
Students are expected to have read the statement on academic integrity available http://www.luc.edu/academics/catalog/undergrad/reg_academicintegrity.shtml. This policy applies to the course. The minimum penalty for academic dishonesty is a grade of F for that assignment. Multiple instances or a single severe instance on a major exam or assignment may result in a grade of F for the course. All cases of academic dishonesty will be reported to the department office and the relevant college office where they will be placed in your school record.  

Academic dishonesty includes, but is not limited to, working together on assignments that are not group assignments, copying or sharing assignments or exam information with other students except in group assignments, submitting as your own information from current or former students of this course, copying information from anywhere on the web and submitting it as your own work, and submitting anything as your own work which you have not personally created for this course. If you do wish to use materials that are not your own, please check with me ahead of time and cite you source clearly. When in doubt, ask first!

Be aware that I have updated the midterm exam with modified questions and with additional questions on classification.  I have changed the values for many for the textbook problems that are used for homework problems.  For those problems that require open-ended answers, please br very careful to state the answers in your own words, not in the words of the Instructor's Manual, nor in the words of students who have previously taken this course.

Regarding the project:  Project requirements must be approved of by me, and I may modify the requirements for a specific dataset/team.  Late changes to the project requirements will usually not be allowed and may not be made without permission.  Teams must document participation by posting versions to Github or similar.  A completed project with no record of intermediate versions will not receive credit. Team members who cannot demonstrate participation in the project will not receive credit, or may receive reduced credit.

Lateness Policy:

"There's no such thing as an emergency.  There is only poor planning."  While this clearly does not apply to actual (and verifiable) medical and family emergencies, if you wait until the last day before something is due, and then your Internet connection goes down, this does not qualify as an emergency.  Give yourself plenty of time to submit your assignments on time.  If I see that most of the class needs extra time for a specific assignment  (and has been working on it!!) I may be willing to extend the deadline.  But in general, your poor planning or poor time management does not constitute a reason for me to extend the deadline for you.  I am especially careful not to do so as this would be unfair to the other students who turn in their work on time.   We have limited number of sessions, during which time we have an exam, a project,  labs (some quite intense), and homework assignments.  Do not  fall behind in your work.  Do not wait until the last minute.  I will not be sympathetic.  You may have heard that I am, in fact, sympathetic.  That is no longer the case.  I have evolved.  Late assignments are worth only half credit.  This is true even if you have a valid reason for submitting the homework late.   Usually, late assginments must be submitted within one week of the due date for half credit. For some assignments, you can't submit it late at all.  And for some, I do not allow an entire week for late submission, but only a few days.  Please check Sakai for exact due dates and the last time for a late submission for a specific assignment.  Further, they can only be submitted late if I have not posted the answers to the homework.   After one week (or the late submission deadline), you will receive zero points for any unsubmitted assignments. No exceptions.

Due dates.  Assignments are due as specified in the syllabus Course Schedule and on Sakai.  I scheduled due dates in order to give you appropriate time to work on and complete assignments.  Do not assume that they are all due on the same day of the week.  They are not.  All assignments and due dates are posted on the Course Schedule on the Syllabus, and also on the Course Calendar on Sakai.

Late Credit. Do not assume that there is an automatic "half credit" for late assignments.  There is not.

Extenstions and "submit until".  Any extensions in due dates will be sent as email announcements on Sakai.  In the rare event that I allow an individual student to submit an assignment late, it will be graded as half credit.  Some assignments have a “submit until” date listed on the Sakai assignment.  That is not the due date.  The “submit until” date  is only valid when I give permission to a student or to the class to extend the due date.

Help at the last minute.  The purpose of the due dates is so that you won't fall behind.  I take due dates seriously.  So should you.  It is your best interest NOT to wait until the last minute to begin working on your homework and labs.  I cannot guarantee that I will be able to help you on the due date.  I have many other students, and generally, when a student waits until the last minute, he or she is less prepared and needs even more time.  This would not allow me to maximize my availability to all students.

There's not such thing as an emergency.  This is an online class.  Assume that there will be technical issues, or that your internet connection may occasionally go down.  Barring a catastrophic internet disaster or a true (and verifiable!) last-minute medical emergency, there is no such thing as an emergency.  There is only poor planning.

Religious Holidays:  Students with religious holiday conflicts:  Please let me know within the first two weeks of class if you have a religious holiday conflict with any exam or homework due date, so that we can plan on an accommodation.

Students with Disabilities: Loyola University Chicago provides reasonable accommodations for students with disabilities. Any student requesting accommodations related to a disability or other condition is required to register with the Student Accessibility Center (SAC). Professors will receive an accommodation notification from SAC, preferably within the first two weeks of class. Students are encouraged to meet with their professor individually in order to discuss their accommodations. All information will remain confidential.  Please note that in this class, software may be used to audio record class lectures in order to provide equal access to students with disabilities.  Students approved for this accommodation use recordings for their personal study only and recordings may not be shared with other people or used in any way against the faculty member, other lecturers, or students whose classroom comments are recorded as part of the class activity.  Recordings are deleted at the end of the semester.  For more information about registering with SAC or questions about accommodations, please contact SAC at 773-508-3700 or SAC@luc.edu.

Students who are allowed to take their exams in the SAC office are encouraged to do so.  Should you choose to take the exam in the classroom, I cannot guarantee that the classroom environment will be quiet enough to provide you with the environment that your disability may require.  If you choose to take the exam in the classroom, you are taking that risk.


Additional notes for this course:


Online Recording Policy

In this class software may be used to record live class discussions. As a student in this class, your participation in live class discussions will be recorded. These recordings will be made available only to students enrolled in the class, to assist those who cannot attend the live session or to serve as a resource for those who would like to review content that was presented. All recordings will become unavailable to students in the class when the Sakai course is unpublished (i.e. shortly after the course ends, per the Sakai administrative schedule: https://www.luc.edu/itrs/sakai/sakaiadministrativeschedule/). Students who prefer to participate via audio only will be allowed to disable their video camera so only audio will be captured. Please discuss this option with your professor. The use of all video recordings will be in keeping with the University Privacy Statement shown below:
Privacy Statement
Assuring privacy among faculty and students engaged in online and face-to-face instructional activities helps promote open and robust conversations and mitigates concerns that comments made within the context of the class will be shared beyond the classroom. As such, recordings of instructional activities occurring in online or face-to-face classes may be used solely for internal class purposes by the faculty member and students registered for the course, and only during the period in which the course is offered. Students will be informed of such recordings by a statement in the syllabus for the course in which they will be recorded. Instructors who wish to make subsequent use of recordings that include student activity may do so only with informed written consent of the students involved or if all student activity is removed from the recording. Recordings including student activity that have been initiated by the instructor may be retained by the instructor only for individual use. 



Course Components and Grading
-->Important note about team submissions:  Repeating what was written above under Homework:  If I announce that an assignment may be worked on in a team (for instance, pair programming), each team member must submit something on Sakai.  If you are the team member submitting the assignment, you must also submit a note on Sakai, listing each team member for whom you are submitting the assignment.  If someone else is submitting the assignment, you must submit a note in the Assignment comment box telling me who is submitting the assignment for your team.  Do not assume that just because your team member submitted the assignment that you will automatically get credit.  You will not. Both of you MUST submit a comment letting me know who submitted it on whose behalf.

93 - 100 A
90 - 92  A-
87 -89  B+
83 - 86 B
80-82  B-
77 - 79  C+
73-76 C
70-72  C-
67-69   D+
60 - 66 D
59 and lower F


The table below lists the points value for each graded component of the course.

Week Beginning Week Assignment Type Assignment Name Points Due Date
before semester Orientation Orientation Video Tour and Syllabus
10
24-Jan
    Orientation Greetings Forum 5 24-Jan
    Orientation Install RM 5 24-Jan
    Orientation Install R-Studio 5 24-Jan
18-Jan Week 1 Lab (RM) Install RM Repositories 10 26-Jan
24-Jan Week 2 Lab (RM) RM Getting Started 15 31-Jan
    Lab (R ) Intro R 10 31-Jan
    Homework Chapter 2 15 31-Jan
31-Jan Week 3 Homework Chapter 3 15 7-Feb
7-Feb Week 4 Lab (RM and R) DMM-Ch3:  Data Prep
DMM-Ch4:  Correlation
15 10-Feb
    Lab (RM)
(links for R info)
Visualization,
Discretization 3 ways
15 14-Feb
    Homework Chapter 4 15 14-Feb
14-Feb Week 5 Lab (RM and R) DMM-Ch 5 (RM): Assoc -FP
10 17-Feb


DMM-Ch 5 (R):  Assoc Rules
10
17-Feb
    Lab (RM) 202_Single-Rule 10
17-Feb
    Homework Chapter 6 20 20-Feb
21-Feb
Week 6
Project
Exploring Datasets, prelim.
20
23-Feb
28-Feb Week 7
Lab:  Text Mining FP/Clustering 20 16-Mar
    Lab:  Text Mining Zipf/Mandelbrot 35 16-Mar
    Lab:  Text Mining Web crawling/Word Clouds 35 16-Mar


Project Lab
Explore Datasets, continued
15
02-Mar
14-Mar Week 8 Lab (RM) Classification Models:
Decision Trees Bayes, CrossValidation,
ROC/LIFT
30 20-Mar
depends on Midterm
    Homework Chapter 8 20 20-Mar
depends on Midterm
21-Mar Week 9 Midterm Exam
250 23-Mar
28-Mar
Week 10
Project Zoom Meetings
Project Proposal Zoom Meetings
15
30-Mar


Lab (RM) KNN, NN, CTS using NN 25 4-Apr
    Homework Chapter 9 15 4-Apr
4-Apr Week 11 Lab (RM) Affinity Marketing 50 13-Apr
11-Apr Week 12 Project Progress Report 5 18-Apr


Project
Project Freeze
0
18-Apr
18-Apr Week 13 Homework Chapter 10 10 25-Apr
    Lab (RM and R) DMM: K-Means, Clustering 10 25-Apr
2-May Week 15
Project Models (+interpretation)
125 02-May
    Project Presentation 50
02-May
    Project Report 25 02-May


Project Excellence 50 02-May
Participation, Prompt Submissions, meeting attendance, etc.
10

                                                        TOTAL POINTS 1000  



Course Schedule

This schedule is a guide.  Exact dates and topics may be subject to change.  It is my best estimate, but we may have to adjust the schedule slightly.  You are responsible for all announcement/changes made in class or posted on Sakai.  

Week
Week Beginning
Topic
Text/Files/Links
Due

Before Class Begins Orientation Module see Sakai Orientation module!!

  • 1/24  but preferably before class starts
1

1/18

Chapter 1: Intro to Course
Intro to Data Mining
Crash Course in Stats, Part 1 (central tendency, dispersion)
Getting to know your data

 

Lab (RM):  Install Repositories


2

1/24

Chapter 2:  Data Visualization and Similarity Measures


Crash Course in Stats, Part 2 (Probability Distributions)









  • DUE (1/24): all Orientation assignments




  • Data Visualization, additional material


  • Lab (RM):  Getting started (off the RM website)
  • Lab (R):  Intro to R (time permitting)
Getting started RM  website  follow-along files
  • DUE (1/26):  Week 1 lab
3 1/31
  • Chapter 3:  Data Preparation; Data Reduction;Attribute Reduction;
    Discretization; Missing Values
  • Crash Course in Stats, Part 3 (Hypothesis Testing)
  • Project Team Signup
  • DUE (1/31):
    • HW, Chapter 2
    • Lab (RM):  Getting Started
    • Lab(R):  Intro R
4

2/07


  • Chapter 4: Data Warehousing, briefly








  • DUE(2/7): HW, Chapter 3
  • DUE: (2/10): DMM Labs Ch 3-4

  • Lab (RM and R):  Visualization, Discretization, Correlation
  • Lab:  Discretization 3 ways
  • DUE (2/14):  Discretization Lab
  • DUE (2/14):  HW, Chapter 4
5

2/14


  • Chapter 6:  Frequent Patterns
  • Demo Problem 6.6
  • Demo p. 257 FP Growth 
  • Project Team Signup (on Sakai)




 

  • DUE (2/17):
    • DMM Ch. 5 Association Rules
    • DMM Ch. 5 FP Growth
    • 202_SingleRule Lab
  • DUE (2/20):  HW, Chapter 6
  • Labs:  FP and Association Rules (including DMM with RM and R, and also an additional lab named "202_SingleRule", which in not in DMM)
  • Labs:
6 2/21
  • catch-up Frequent Patterns and labs
  • Begin discussion of Dataset Exploration for Project
  • DUE (2/23) or during zoom meeting:
    • Project:  Explore Datasets (preliminary)
  • DUE (2/20):  HW, Chapter 6
7

2/28

Midterm Review
Three Text Mining Labs: (see Sakai for instructions, videos, and process downloads):
These are much more serious labs than in earlier weeks.  You will love them!!  Do NOT wait until the last minute to work on them.
  • Text Mining using FP and Clustering
  • Text Mining using Zipf-Mandelbrot Distribution
  • Web crawling and Word Clouds
Documentation for RapidMinder charts (pdf)


Optional Zoom meetings re:  team datasets!!
  • DUE (3/02) or during zoom meeting:
    • Project:  Explore Datasets (Final)
8

3/14

  • Project Proposal
  • Chapter 8: Classification
  • Lab:  Rules, Decision Trees, KNN, Bayes, CrossValidation,
    ROC Charts, Lift Chart.  Many short labs to demonstrate the concepts.


  • DUE (3/16) Labs:  Text Mining (3 labs)
  • DUE (3/20) Labs:  Classification
    • This due date may change, depending on the exact date of Midterm
  • DUE (3/20):  HW, Chapter 8


9 3/21 Midterm Exam (scheduled for 3/23)
(withdraw deadline is still TBA on the Academic Calendar)
  • zoom team meetings for project proposal
10 3/28
  • Classification, continued (KNN, Neural Networks)
  • Lab:  Neural Networks, possible Medical lab
  • Project Proposal Zoom meetings.  Sign up on doodle (link will be posted)

  • zoom team meetings for project proposal
11

4/04

  • Lab:  Affinity Marketing using RapidMiner
    (very complex lab, do NOT wait until the last minute!!)
  • Project Solidification:  Team Meetings
    Project Freeze!! (No changes of project freeze or requirements after this date!)

  • DUE (4/04):  Lab: KNN,  NN
  • DUE (4/04):  HW(Ch9)
12 4/11
  • Project Progress Report--detail progress, progress, plans
  • Optional zoom meetings
  • Continue working on Affinity Marketing

13

4/18

  • Clustering
  • Lab:  K-Means Clustering, DMM Chapter 13 (No video for this lab). Complete both the RM and R labs, through page 120. Please submit screen shots similar to p. 114, Figure 6-6 and p. 119, Figure 6-12. For the 4ed online, it is Section 6.7, Figure 6.6 and Section 6.9, Figure 6.12..
  • Optional team zoom meetings

  • DUE (4/25):  Lab:  DMM KMeans Clustering 
  • DUE (4/25):  HW (Ch 10)
14

4/25

Work on Projects, Questions, Project Team zoom meetings



15

5/02

Project Presentations, video or zoom



Academic Calendar:  Undergraduate

Academic Calendar:  Graduate