CIS4730: Unstructured Data Management

Schedule and Syllabus

This is the syllabus for the Summer 2019 iteration of the course.

Week Date Lecture Lab
1 Jun 11 Introduction to unstructured data Introducing Python, Jupyter, and Virtual Environment
2 Jun 13 XML and JSON Data types in Python, ‘numpy’, and ‘pandas’
3 Jun 18 NoSQL databases Data input, output, manipulation, and summary in ‘pandas’
4 Jun 20 MongoDB Robo 3T (Robomongo) pymongo
5 Jun 25 Web crawling and web APIs Flow control in Python Web scraping with ‘requests’
6 Jun 27 Regular expressions Text processing with ‘bs4’ and ‘re’
7 Jul 2 Information retrieval Tokenization, stemming, and lemmatization using ‘nltk’ and ‘re’
8 Jul 4 Independence Day / No class
9 Jul 9 Scoring, weighting, and vector space TFIDF
10 Jul 11 Text categorization and clustering Text-mining using ‘sklearn’
11 Jul 16 Deep learning applications Visualization with ‘matplotlib’
12 Jul 18 Team project overview Managing analytics projects (guest lecturer) Review and course wrap-up
13 Jul 23 Exam
14 Jul 25 Python labs review Working on your project
15 Jul 30 Emerging topics and applications Working on your project
16 Aug 1 Project presentation
Week Date Lecture Lab
1 Jun 11 Introduction to unstructured data Introducing Python, Jupyter, and Virtual Environment
2 Jun 13 XML and JSON Data types in Python, ‘numpy’, and ‘pandas’
3 Jun 18 NoSQL databases Data input, output, manipulation, and summary in ‘pandas’
4 Jun 20 MongoDB Robo 3T (Robomongo) pymongo
5 Jun 25 Web crawling and web APIs Flow control in Python Web scraping with ‘requests’
6 Jun 27 Regular expressions Text processing with ‘bs4’ and ‘re’
7 Jul 2 Information retrieval Tokenization, stemming, and lemmatization using ‘nltk’ and ‘re’
8 Jul 4 Independence Day / No class
9 Jul 9 Scoring, weighting, and vector space TFIDF
10 Jul 11 Text categorization and clustering Text-mining using ‘sklearn’
11 Jul 16 Deep learning applications Visualization with ‘matplotlib’
12 Jul 18 Team project overview Managing analytics projects (guest lecturer) Review and course wrap-up
13 Jul 23 Exam
14 Jul 25 Python labs review Working on your project
15 Jul 30 Emerging topics and applications Working on your project
16 Aug 1 Project presentation

Course Details

Name: Kambiz Saffari

Office: Robinson College of Business, 35 Broad Street, 9th Floor, Room 910

Email: ksaffarizadeh1@gsu.edu

Office Hours: By appointment

Semester: Summer 2019

Class Hours: Tuesday and Thursday 4:45PM–7:15PM

Class Location: Aderhold Learning Center 31

Click here to add the course to your calendar

Prerequisites:

  • CIS 3260 Introduction to Programming
  • CIS 3730 Database Management System

Over 90 percent of digital data is unstructured–much of which is locked away across a variety of different data stores, in different locations and in varying formats. This course will discuss various issues and challenges in unstructured data management. At the same time, this course will introduce the best practices, underlying principles, and emerging technologies in storing, retrieving, and analyzing unstructured data.

This course aims to prepare students with fundamental knowledge and skills in unstructured data management. After successful completion of this course, students will be able to:

  1. Articulate the similarities and differences in managing structured and unstructured data
  2. Collect and integrate unstructured data from multiple sources
  3. Apply techniques to manage and store unstructured data
  4. Prepare unstructured data for analysis
  5. Use unstructured data to answer managerial questions and support decision-making
  6. Develop and apply Python programs for unstructured data management

In general, each class meeting consists of two parts, separated by a 15-minutes break:

  • Lecture: The first half of the class (before the break) in which we introduce and discuss theoretical knowledge on the weekly subject.
  • Lab: The second half of the class (after the break) in which we develop and practice hands-on skills for unstructured data management.

All course materials (lecture slides, lab notes, readings, project instructions, etc.) will be distributed electronically through the course website on iCollege.

  • Automate the Boring Stuff with Python: Practical Programming for Total Beginners, by Al Sweigart (No Starch Press, 2015)

https://www.amazon.com/dp/1593275994

  • Web Scraping with Python: Collecting More Data from the Modern Web 2nd Edition

https://www.amazon.com/dp/1491985577

  • The Definitive Guide to MongoDB: A Complete Guide to Dealing with Big Data Using MongoDB, Second Edition, by David Hows, Peter Membrey, Eelco Plugge and Tim Hawkins (Apress, 2013). Free from GSU library online access link:

https://gsu.skillport.com/skillportfe/main.action?path=summary/BOOKS/106761#summary/BOOKS/RW$5354:_ss_book:106761

Because of the lab components of this course, students are required to either use the computers in the classroom or bring their own laptop. The computer/laptop should be able to install and execute the following software (all free!):

  1. Python 3: Python is a popular programming language for data analytics. We will learn basic Python programming in our lab meetings. The final project and all lab exercises will be based on Python 3.
  2. Jupyter Notebook: Jupyter Notebook is an environment for Python programming with many user-friendly features for data analytics and documentation.
  3. MongoDB: MongoDB is the most popular NoSQL database. We will learn about how to operate and query MongoDB databases.

Gradable items are listed below (see the Expectations section for more details on each of these items). There will be occasional opportunities for extra credits.

Exam

20%

Lecture Assignments

10%

Lab Assignments

40%

Final Project

20%

Participation

10%

 

Exam questions will be a mix of multiple choice, true/false, and short-answer questions. The exam will cover only materials from lectures, including the slides and the required readings. In other words, the exam will not include lab-related materials. The exam will only be given in class on the day and time listed in the syllabus. Students missing an exam will receive a zero on that exam.

A make-up exam will not be given. The exam may be taken on a different date only if the instructor is given a legitimate reason (jury duties, religious holiday, scheduled surgeries, pregnancy, etc.) Proof of reasons must be scanned and sent to the instructor ahead of time (unless the absence was due to a legitimate emergency, in which case legal proof must be sent afterward).

We have two lecture assignments for this course. In each lecture assignment, you will be given a short video on a specific topic. The two topics (and their respective video) for this semester are the following:

  1. What is a vector?
    • Video URL: https://www.youtube.com/watch?v=fNk_zzaMoSs
    • Due: 4:45 PM, July 9, 2019
    • Note: Do not confuse this with the vectors in Python. In this assignment, you should summarize the idea of vectors solely based on content in the video. You won’t earn any points if your summary is mainly about the vectors in Python.
  2. How does a neural network recognize handwritten digits?

Lab sessions will have several hands-on exercises and assignments. Requirements and details of these lab assignments will be provided in class handouts/slides. Each of the assignments is usually just a few lines of Python script which reinforce topics that we have learned during the class. In four of these lab sessions, you will be asked to upload your work to iCollege. You are encouraged to discuss with peers if you have questions, but each person needs to turn in their work individually.

Past experience shows that many students are able to submit their work by the end of the lab session, but submissions are open until the beginning of the next class.

Submissions should be made through the iCollege course website by their deadlines. Email/late submission will not be accepted and will lead to receiving zero on the assignment. It is the student’s responsibility to ensure the submissions went through in time. Network or technical issues (either on your machine or iCollege) cannot be used as an excuse.

The final project requires group work, and each group should consist of 3 students. For the group project, you need to identify a business problem that can be solved by acquiring, filtering, extracting, validating, cleansing, and analyzing unstructured data. The business problem can be from your organization (or a team member’s organization). You need to identify a source through which you can collect the unstructured data you need to solve the business problem. You need to write the proper code to collect, analyze, visualize, and interpret the data. Based on your findings you need to provide some recommendations to solve the business problem you identified.

Detailed information about the project will be given in a separate handout. Project codes are due by 4:45 pm, Aug 1, 2019. To encourage teamwork spirit and minimize the free rider phenomenon, team members will evaluate each other’s contribution right after the project presentation.

Submissions should be made through the iCollege course website by their deadlines. Email/late submission will not be accepted and will lead to receiving zero on the assignment. It is the student’s responsibility to ensure the submissions went through in time. Network or technical issues (either on your machine or iCollege) cannot be used as an excuse.

Class attendance, while not mandatory, is expected. The instructor encourages everyone to participate in class activities, discussions, and respond to questions from other students. In evaluating your class participation in discussions, both the quantity and quality of participation is taken into account. Principles for class participation include:

  • Show a respectful and positive attitude towards him/her self, classmates, and teacher
  • Assist peers during lab sessions
  • Work with others in paired/group-based exercises
  • Contribute to classroom discussion
  • Attend classes and focus on class work (that is, not using social media, sending emails/texts, or doing anything irrelevant to class activities)

Participation is both self- and instructor-evaluated (5 points each). For the self-evaluation, I will provide a self-evaluation form to you in the last class meeting. For the instructor-evaluated part, I first give everyone a baseline score (3 points), which is then adjusted upward or downward.

Based on my impression, if your overall participation is slightly higher/lower than other students, I add/subtract 0.5 points to the baseline score. If your participation is significantly higher/lower than other students I add/subtract 1+ points to the baseline.

All assignments for this course are governed by GSU’s policy on academic honesty regarding plagiarism and cheating. Violation of these policies carries penalties up to and including receiving a grade of 0 for the assignment and/or receiving a grade of F for the course. Students are expected to be familiar with these policies.

Any changes to this syllabus will be based on class feedback and announced during lectures and via e-mail.