Kambiz Saffarizadeh

Assistant Professor of Management at Marquette University

CIS4730 - Unstructured Data Management

Schedule and Syllabus

This is the syllabus for the Summer 2019 iteration of the course.

WeekDateLectureLab
1Jun 11Introduction to unstructured dataIntroducing Python, Jupyter, and Virtual Environment
2Jun 13XML and JSONData types in Python, ‘numpy’, and ‘pandas’
3Jun 18NoSQL databasesData input, output, manipulation, and summary in ‘pandas’
4Jun 20MongoDBRobo 3T (Robomongo) pymongo
5Jun 25Web crawling and web APIsFlow control in Python Web scraping with ‘requests’
6Jun 27Regular expressionsText processing with ‘bs4’ and ‘re’
7Jul 2Information retrievalTokenization, stemming, and lemmatization using ‘nltk’ and ‘re’
8Jul 4Independence Day / No class
9Jul 9Scoring, weighting, and vector spaceTFIDF
10Jul 11Text categorization and clusteringText-mining using ‘sklearn’
11Jul 16Deep learning applicationsVisualization with ‘matplotlib’
12Jul 18Team project overview
Managing analytics projects (guest lecturer)
Review and course wrap-up
13Jul 23Exam
14Jul 25Python labs reviewWorking on your project
15Jul 30Emerging topics and applicationsWorking on your project
16Aug 1Project presentation

Course Details

Name: Kambiz Saffari

Office: Robinson College of Business, 35 Broad Street, 9th Floor, Room 910

Email: ksaffarizadeh1@gsu.edu

Office Hours: By appointment

Semester: Summer 2019

Class Hours: Tuesday and Thursday 4:45PM–7:15PM

Class Location: Aderhold Learning Center 31

Click here to add the course to your calendar

Prerequisites:

  • CIS 3260 Introduction to Programming
  • CIS 3730 Database Management System

Over 90 percent of digital data is unstructured–much of which is locked away across a variety of different data stores, in different locations and in varying formats. This course will discuss various issues and challenges in unstructured data management. At the same time, this course will introduce the best practices, underlying principles, and emerging technologies in storing, retrieving, and analyzing unstructured data.

This course aims to prepare students with fundamental knowledge and skills in unstructured data management. After successful completion of this course, students will be able to:

  1. Articulate the similarities and differences in managing structured and unstructured data
  2. Collect and integrate unstructured data from multiple sources
  3. Apply techniques to manage and store unstructured data
  4. Prepare unstructured data for analysis
  5. Use unstructured data to answer managerial questions and support decision-making
  6. Develop and apply Python programs for unstructured data management

In general, each class meeting consists of two parts, separated by a 15-minutes break:

  • Lecture: The first half of the class (before the break) in which we introduce and discuss theoretical knowledge on the weekly subject.
  • Lab: The second half of the class (after the break) in which we develop and practice hands-on skills for unstructured data management.

All course materials (lecture slides, lab notes, readings, project instructions, etc.) will be distributed electronically through the course website on iCollege.

  • Automate the Boring Stuff with Python: Practical Programming for Total Beginners, by Al Sweigart (No Starch Press, 2015)

https://www.amazon.com/dp/1593275994

  • Web Scraping with Python: Collecting More Data from the Modern Web 2nd Edition

https://www.amazon.com/dp/1491985577

  • The Definitive Guide to MongoDB: A Complete Guide to Dealing with Big Data Using MongoDB, Second Edition, by David Hows, Peter Membrey, Eelco Plugge and Tim Hawkins (Apress, 2013). Free from GSU library online access link:

https://gsu.skillport.com/skillportfe/main.action?path=summary/BOOKS/106761#summary/BOOKS/RW$5354:_ss_book:106761

Because of the lab components of this course, students are required to either use the computers in the classroom or bring their own laptop. The computer/laptop should be able to install and execute the following software (all free!):

  1. Python 3: Python is a popular programming language for data analytics. We will learn basic Python programming in our lab meetings. The final project and all lab exercises will be based on Python 3.
  2. Jupyter Notebook: Jupyter Notebook is an environment for Python programming with many user-friendly features for data analytics and documentation.
  3. MongoDB: MongoDB is the most popular NoSQL database. We will learn about how to operate and query MongoDB databases.

Gradable items are listed below (see the Expectations section for more details on each of these items). There will be occasional opportunities for extra credits.

Exam

20%

Lecture Assignments

10%

Lab Assignments

40%

Final Project

20%

Participation

10%

 

Exam questions will be a mix of multiple choice, true/false, and short-answer questions. The exam will cover only materials from lectures, including the slides and the required readings. In other words, the exam will not include lab-related materials. The exam will only be given in class on the day and time listed in the syllabus. Students missing an exam will receive a zero on that exam.

A make-up exam will not be given. The exam may be taken on a different date only if the instructor is given a legitimate reason (jury duties, religious holiday, scheduled surgeries, pregnancy, etc.) Proof of reasons must be scanned and sent to the instructor ahead of time (unless the absence was due to a legitimate emergency, in which case legal proof must be sent afterward).

We have two lecture assignments for this course. In each lecture assignment, you will be given a short video on a specific topic. The two topics (and their respective video) for this semester are the following:

  1. What is a vector?
    • Video URL: https://www.youtube.com/watch?v=fNk_zzaMoSs
    • Due: 4:45 PM, July 9, 2019
    • Note: Do not confuse this with the vectors in Python. In this assignment, you should summarize the idea of vectors solely based on content in the video. You won’t earn any points if your summary is mainly about the vectors in Python.
  2. How does a neural network recognize handwritten digits?

Lab sessions will have several hands-on exercises and assignments. Requirements and details of these lab assignments will be provided in class handouts/slides. Each of the assignments is usually just a few lines of Python script which reinforce topics that we have learned during the class. In four of these lab sessions, you will be asked to upload your work to iCollege. You are encouraged to discuss with peers if you have questions, but each person needs to turn in their work individually.

Past experience shows that many students are able to submit their work by the end of the lab session, but submissions are open until the beginning of the next class.

Submissions should be made through the iCollege course website by their deadlines. Email/late submission will not be accepted and will lead to receiving zero on the assignment. It is the student’s responsibility to ensure the submissions went through in time. Network or technical issues (either on your machine or iCollege) cannot be used as an excuse.

The final project requires group work, and each group should consist of 3 students. For the group project, you need to identify a business problem that can be solved by acquiring, filtering, extracting, validating, cleansing, and analyzing unstructured data. The business problem can be from your organization (or a team member’s organization). You need to identify a source through which you can collect the unstructured data you need to solve the business problem. You need to write the proper code to collect, analyze, visualize, and interpret the data. Based on your findings you need to provide some recommendations to solve the business problem you identified.

Detailed information about the project will be given in a separate handout. Project codes are due by 4:45 pm, Aug 1, 2019. To encourage teamwork spirit and minimize the free rider phenomenon, team members will evaluate each other’s contribution right after the project presentation.

Submissions should be made through the iCollege course website by their deadlines. Email/late submission will not be accepted and will lead to receiving zero on the assignment. It is the student’s responsibility to ensure the submissions went through in time. Network or technical issues (either on your machine or iCollege) cannot be used as an excuse.

Class attendance, while not mandatory, is expected. The instructor encourages everyone to participate in class activities, discussions, and respond to questions from other students. In evaluating your class participation in discussions, both the quantity and quality of participation is taken into account. Principles for class participation include:

  • Show a respectful and positive attitude towards him/her self, classmates, and teacher
  • Assist peers during lab sessions
  • Work with others in paired/group-based exercises
  • Contribute to classroom discussion
  • Attend classes and focus on class work (that is, not using social media, sending emails/texts, or doing anything irrelevant to class activities)

Participation is both self- and instructor-evaluated (5 points each). For the self-evaluation, I will provide a self-evaluation form to you in the last class meeting. For the instructor-evaluated part, I first give everyone a baseline score (3 points), which is then adjusted upward or downward.

Based on my impression, if your overall participation is slightly higher/lower than other students, I add/subtract 0.5 points to the baseline score. If your participation is significantly higher/lower than other students I add/subtract 1+ points to the baseline.

All assignments for this course are governed by GSU’s policy on academic honesty regarding plagiarism and cheating. Violation of these policies carries penalties up to and including receiving a grade of 0 for the assignment and/or receiving a grade of F for the course. Students are expected to be familiar with these policies.

Any changes to this syllabus will be based on class feedback and announced during lectures and via e-mail.