MOOCdb

From moocdb
Jump to: navigation, search

Template:Infobox software


MOOCdb is a shared data model standard for the data emanating from Massive Open Online Courses.

Contents

Introduction

Advantages

The benefits of standardization 
The data schema standardization implies that the raw data from every course offering will eventually be formatted the same way. It establishes simple conventions like “storing event timestamps in the same format” to common tables, fields in the tables, and relational links between different tables. It makes it possible to populate a database, which because of its organization can appropriately support scientific discovery and inquiry. Standardization supports cross-platform collaborations, sharing query scripts, and the definition of variables which can be derived in exactly the same way irrespective of which MOOC database they come from.
Concise data storage 
MOOCdb’s proposed schema is “loss-less” with respect to research relevant information, i.e. no information is lost in translating raw data to it. However, the use of multiple related tables provides more efficient storage. Some information, such as fields specific to a provider that is not generalizable and irrelevant to researchers is culled.
Access Control Levels for Anonymized Data 
The data schema offers an organized means of structuring anonymized user identities safeguard them further. It provides 3 different levels of assurance via cross-referencing information across 3 tables. By controlling which tables are released, a different level of assurance is available. See Section 5 for details.
Savings in effort 
A schema speeds up database compilation by eliminating repeated schema design. Investigating a dataset using one or more existing scripts helps speed up research. Sharing of data extraction scripts Scripts for data extraction and descriptive statistics extraction can be open source and shared by everyone because they reference data organized according to the schema. Some of these scripts could be very general and widely applicable, for example: “For every video component, provide the distribution of time spent by each student watching it” and some would be specific for a research question, for example generation of data for Bayesian knowledge tracing on the problem responses. These scripts would evolve to become optimized by the community and perpetually updated.
Crowd source potential 
Machine learning frequently involves humans identifying explanatory variables that could drive a response. Enabling the crowd to help propose variables could greatly scale the community’s progress in mining MOOC data. We intentionally consider the data schema to be independent of the data itself so that people at large, when shown the schema, optional prototypical synthetic data and a problem, can posit an explanatory variable, write a script, test it with the prototypical data and submit it to an analyst. The analyst can assess the information content in the variable with regards to the problem at hand and rank and feed it back to the crowd, eventually incorporating highly rated variables into learning.
A unified description for external experts 
For experts from external fields like“Very Large Databases/Big Data” or ”Data Privacy”, MOOCdb’s standardization presents data science in education as unified. This allows them to technically assist us with techniques such as new database efficiencies or privacy protection methods.
Sharing and reproducing the results 
When they publish research, analysts can share the scripts by depositing them into a public archive where they are retrievable and cross-referenced to their donor and publication. This allows results to be tested on additional data with precisely the same methods.

History

Schema description

The schema organizes the information in terms of 4 different behavioral interaction modes:

  1. Observing
  2. Submitting
  3. Collaborating
  4. Feedback

It organizes user identification information in a manner that allows different privacy preserving options. Our goal is that the schema becomes agreed upon by the community, generalizes across platforms and preserves all the information needed for data science and analytics.

Observing mode

In this mode, students simply browse and observe a variety of resources available on the website. These resources include the wiki, forums, lecture videos, homeworks, book, tutorials. Each unique resource is usually identifable by a unique resource identifier uri. A single webpage, identifed by an url, can contain multiple uris that might not be visible to the user, but her access is recorded in the data as an access to the specifc uri within the url. For the case where there is only resource on a webpage, the url and uri are the same. Additionally, uris can occur on multiple urls. Through our schema we capture: the exact resource which the user accessed, and when available the context in which the resource was accessed by the user.

Observing mode tables.
Observing mode tables.

Observed Events Table

Holds all the observed events for a student. Each row corresponds to a student's visit to a resource URI. The url_id stores the url where the particular resource was accessed by the user.

Field Type Example Description
observed_event_id int 1 observed_events table id.
user_id int 1 users table id.
url_id int 1 urls table id.
observed_event_timestamp datetime 2012-07-23 01:06:07 Date time when the observed event occurred.
observed_event_duration int 10 Duration of the event in minutes. Maximum set to 60.
oberved_event_ip int 847858707 IP stored as an integer. For example 847858707 corresponds to the IP 50.137.76.19.
oberved_event_os int 3 Key value for the operating system.
oberved_event_agent int 5 Key value for the user agent.

Resources Table

Maps a uri to a resource. It also stores the information about the hierarchy between resources, e.g. exercises and sub-exercises.

Field Type Example Description
resource_id int 2 resources table id.
resource_name varchar Resource information Place-holder for the resource URI. It can be used to give a more descriptive name to the URI.
resource_uri varchar http://6002x.mitx.mit.edu/info Actual URI of the resource.
resource_type_id int 1 resource_types table id.
resource_parent_id int 1 resource_id of the parent resource.
resource_child_number int 3 resource_id of the child resource.
resource_relevant_week int 3 The week that this resource is associated with. For example, when the material was introduced.
resource_release_timestamp datatime 2012-03-02 18:34:04 Date time when the resource was added.

Resource Types Table

Maps each resource with its type. See the type_name description for the different kinds available.

Field Type Example Description
resource_type_id int 0 resource_types table id.
resource_type_content varchar lecture Element in the set: {testing, informational, problem, exam, tutorial, profile, book, wiki, lecture, other}.
resource_type_medium varchar video Element in the set: {video, text}.

URLs Table

Univocally identifies URLs in the website.

Field Type Example Description
url_id int 1 urls table id.
url varchar http://6002x.mitx.mit.edu/courseware/ Actual URL of the resource.

Resource URLs Table

Field Type Example Description
resources_urls_id int 1 resources_urls table id.
resource_id int 6 resources table id.
url_id int 7 urls table id.

Submitting mode

This mode records student interactions with the assessment modules of the course. In this mode the student submits a response/answer to a problem/question in the course and receives feedback and evaluation.

A typical MOOC student is in the submitting mode when trying homework problems, exams, quizzes, exercises in between lectures and labs (for engineering and computer science). In contrast to submissions in campus settings, MOOC submissions are more complex. While question types rarely difer much, that is they can be multiple choice or require an analytical answer, a program or an essay, submission software enables students to submit answers and check them multiple times, save their intermediate work and submit later. Some courses limit the maximum number of submissions per question for either all or some subset of questions [2,5]. The homeworks can have soft and/or hard deadlines [2] resulting in data indicating what deadline a student met.

Due to the online nature of submissions, assessments are handled in different ways. Assessments could be done by the computer via simple check mechanisms or automated algorithms [1], peer review [3]., evaluation by instructors and/or graders. For some courses multiple assessors are used [6]. The MOOCdb schema captures this situations.

Submitting mode tables.
Submitting mode tables.

Problem Types Table

Maps each problem with its type: homework, exercise, midterm or final.

Field Type Example Description
problem_type_id int 0 problem_types table id.
problem_type_name varchar Homework Element in the set: {Unknown, Homework, Lecture quiz, Lab, Midterm example, Midterm exam, Final exam, Sandbox}.

Problems Table

Each row corresponds to a problem, capturing the problem hierarchy (i.e. problem, sub-problem, ...).

problem hierarchy
Example of problem hierarchy tree. The node numeration correspond to the problem_id
Problem Table example
Problems Table associated. Notice the order_id indicating the order as a sub-problem with respect to the parent.
Field Type Example Description
problem_id int 1 problems table id.
problem_name varchar filenameL6e2_3_1 Identifier for the problem.
problem_parent_id int 38 problem_id of the parent. NULL if no parent.
problem_child_number int 2 Relative order as a subproblem to the parent. Order set by the instructor (i.e. subproblem (a), (b), ...). NULL if no parent.
problem_type_id int 2 problem_types table id.
problem_release_timestamp datetime 2012-03-10 02:12:44 Date and time when the problem was released.
problem_soft_deadline datetime 2012-04-02 12:59:59 Soft deadline after which you get credit with penalty.
problem_hard_deadline datetime 2012-04-02 12:59:59 Hard deadline after which you get no credit.
problem_max_submission int 1000 Maximum number of submissions allowed.
problem_max_duration int 1000000 tbd.
problem_weight int 1 tbd.
resource_id int 74 resources table id.

Submissions Table

Each entry corresponds to a submission made by a student.

Field Type Example Description
submission_id int 1 submissions table id.
user_id int 1 users table id.
problem_id int 620 problems table id.
submission_timestamp datetime 2012-04-14 16:09:53 Date and time where the resoruce was submitted.
submission_attempt_number int 1 Number of the attempt.
submission_answer text a Actual answer submitted by the student. It can take many different forms depending on the kind of problem associated with (number answers, multiple choice, entering formulas,...).
submission_is_submitted bit 1 Allows for courses to include submissions that are not graded yet (such as a draft).
submission_ip int 847858707 IP stored as an integer. For example 847858707 corresponds to the IP 50.137.76.19.
submission_os int 3 Key value for the operating system.
submission_agent int 5 Key value for the user agent.

Assessments Table

Each row contains an assessment for a submission. With this table we allow multiple assessments per submission. At the same time this allows multiple graders per submission.

Field type example Description
assessment_id int 1 assessments table id.
submission_id int 1 submissions table id.
assessment_feedback text Well done! Feedback text given by the instructor.
assessment_grade double 1 Grade on the assessment. Double from 0 to 1.
assessment_grade_with_penalty double 0.52 Grade on the assessment with penalty after the soft deadline. Double from 0 to 1. Always lower or equal than assessment_grade.
assessment_grader_id int 1 users table id. The grader must be registered as another user.
assessment_timestamp datetime 2012-06-05 04:56:22 Date and time that the assessment was published.

Collaborating mode

This mode captures the collaboration between students in mediums such as forum or wiki.

In a forum, a student either initiates a new thread or responds to an existing thread. Additionally students can up vote and down vote the answers from other students.

In a wiki, students edit, add, delete and initiate a new topic. Although the method of collaboration differs, different student-to-student interactions can be captured with a single record system. This generalizability also allows for future types of student collaborations.

Another advantage of this method of recording wiki and forum data is that the log is per-student and transaction based. This allows researchers to more easily get data about a student, rather than a thread or wiki page. Figure 6 shows the schema and the links.

Collaborating mode tables.
Collaborating mode tables.


Collaborations Table

Each row corresponds to a student attempt to collaborate. Examples of collaboration are a student is deleting on a wiki or the comment on a forum the student is voting on.

The table captures the collaboration hierarchy in the same fashion the problems table did. For example, to record a response on a forum, the post would be the collaboration. Its parent would be the question (also a collaboration) that the response replies to.


Field Type Example Description
collaboration_id int 1 collaborations table id.
user_id int 7 users table id.
collaboration_type_id int 1 collaboration_type table id.
collaboration_timestamp datetime 2012-07-16 05:23:23 Date time that the collaboration was submitted.
collaboration_content mediumtext tbd. Text submitted in the collaboration.
collaboration_parent_id int 2 collaboration_id of the parent. Examples of parent-son are: post - response, wiki - deletion, comment - up-vote. NULL if it has no parent.
collaboration_child_number int 2 Relative order as a sub-collaboration to the parent. Order set by the collaboration time stamp. NULL if no parent.
collaborations_ip int 847858707 IP stored as an integer. For example 847858707 corresponds to the IP 50.137.76.19.
collaborations_os int 3 Key value for the operating system.
collaborations_agent int 5 Key value for the user agent.
resource_id int 1677 resources table id. Resource associated with the collaboration.
collaboration_thread_id int 132 If the collaboration is a forum post, it will have a thread id. Null if N/A.

Collaboration Types Table

Maps each collaboration with each type. See the field description for the different types.

Field Type Example Description
collaboration_type_id int 1 collaboration_types table id.
collaboration_type_name varchar forum_question Element in the set:{forum_question, forum_answer, forum_comment, wiki revision}.

Feedback mode

This mode captures the student feedback to the MOOC providers or lecturers. Often this is in the form of a survey response or a course rating, but the schema is flexible enough to allow other kind of interactions.

Feedback mode tables.
Feedback mode tables.

Feedbacks Table

Each row corresponds to a student feedback. Each student feedback consists of a question (optional) and its answer.

Field Type Example Description
feedback_id int 1 feedback table id.
user_id int 3 users table id.
answer_id int 2 answers table id.
question_id int 4 questions table id.
feedback_timestamp dateteime 2012-07-16 05:23:23 Date time the feedback was registered.

Questions Table

Each row contains a question the student has to respond to. For example, survey questions or the topic of what is being rated.

Field Type Example Description
question_id int 1 questions table id.
question_content text What did you like most about the course Question of the feedback.
question_type tbd. tbd. tbd.
question_reference int 2 resource_id of the resource the question applies to.
survey_id int 5 Used to group questions together into a survey.

Answers Table

Maps the student actual feedback with its id.

Field Type Example Description
answer_id int 1 answers table id.
answer_content text It was an interesting course. Contents of the students answer. It can handle multiple types of answers.

Surveys Table

Defines a survey which is a group of questions.

Field Type Example Description
survey_id int 1 surveys table id.
survey_start_timestamp datetime 2012-03-10 00:00:00 Date time the survey is released.
survey_end_timestamp datetime 2012-03-12 00:00:00 Date time the survey is closed.



User information

A final set of tables in MOOCdb structure anonymous, access controlled personal information concerning MOOC users. They decrease the likelihood that the anonymized information will be used to re-identify a student or group.

TBD: Overall view of the tables.

The advantage of having multiple localized anonymized identifiers for users is that it supports a flexible access policy.

Access Mode User PII Global User Course User Collaboration Likelihood of re-identification
1 NO YES YES NO Probably
2 NO NO YES NO Not likely
3 NO NO NO NO Not likely
4 NO YES YES YES Easily
5 NO NO YES YES Easily
6 NO NO NO YES Probably
Access Mode 1
At this level, the table that maps the global user id to course user id, global user table is shared along with the every course's course user table. This allows researchers to track the user across multiple courses. Even though the user PII table and collaborations mode tables are not provided across multiple courses, in some cases, with extremely hard work, a user can be identified. Such breaches usually happen when users themselves share a part of their information which is linked to some unique data item.
Access Mode 2
At this level, the course user table is shared, but the table that maps the user id at the course level to the global user id, global user table is not shared. This does not allow researchers to track through multiple courses, but allows linking the information that pertains to the same user in multiple tables. Without the access to the collaborations table, it is highly unlikely that one can identify the user unless a user identifies several rare or specific pieces of information about his behavior in the course.
Access Mode 3
At this level, the user table for the course is not shared. This leaves almost no possible way that the researchers can link the information across multiple tables to the same user, thus achieving highest possible privacy protection in the current framework. We recommend that data at this level be guarded with minimal restrictions.
Access Mode 4
At this level, the table that maps the user global id to user course id is shared along with the entire user table for the course. This allows researchers to track the user across multiple courses. Though the username and other identifiable information is not made available, due to availability of the collaborations table , this allows identifiability of at least (if not all) the subset of the users who post on the forums.
Access Mode 5
At this level, the user table for a course is shared, but the table that maps the user id at the course level to the global user id is not shared. This does not allow researchers to track the user through multiple courses, but allows linking the information that pertains to the same user in multiple tables. Again due to the availability of collaborations table, a subset of users can be identified based on their forum posts.
Access Mode 6
At this level, the user table for a course is shared, but the table that maps the user id at the course level to the global user id is not shared. This does not allow researchers to track the user through multiple courses, but allows linking the information that pertains to the same user in multiple tables. Again due to the availability of collaborations table, a subset of users can be identified based on their forum posts.


User tables.
User mode tables.


User PII Table

Stores the demographics of the user. This table should require control authorization to be shared. It should reside encrypted at a secure, isolated location.

Field Description
username tbd.
global_user_id tbd.
gender tbd.
age tbd.
IP tbd.
country tbd.
timezone_offset tbd.


Global User Table

Captures a user across different courses.

Field Description
global_user_id tbd.
course_id tbd.
course_user_id tbd.


Course User Table

Decouples the user id associated to a course with its different modes: observed, submissions, collaborations and feedback. It adds a layer of anonymity by grating access per mode.

Field Description
course_id tbd.
course_user_id tbd.
observing_user_id tbd.
submitting_user_id tbd.
collaborating_user_id tbd.
feedback_user_id tbd.
final_grade tbd.
type tbd.


Course Table

Stores information about the course.

Field Description
course_id tbd.
course_name tbd.
course_start_date tbd.
course_end_date tbd.


Frameworks

We provide 3 initial frameworks that make use of the MOOCdb schema. The first is an analytics framework whose output can be plugged into different visualization techniques. The second provides scripts to create data exports for some specific research studies. The third framework develops tools that provide access to the DB from some well known statistical analysis packages.

MOOC En Images

MOOC En Images is an analytics framework that uses the schema. In this an analyst specifies a statistic and defines the data over which this statistic is calculated. The data is defined by defining cuts along the three axes: time, student cohort and space shown in the next figure:

MOOC En Images Axes.
Through the MOOC En Images analytics framework an analyst can cut the data along different axes. An analyst can define the a student cohort, a time period and the country/location if needed. The framework then is able to extract the data and evaluate the statistic the analyst is interested in and present it visually. .

An example visualization of an aggregate statistic is shown in the next image. In the example, the aggregate statistic is the average number of home work submissions. This statistic is then calculated for students who got the certificate from each country seperately. To enable interesting attractive visualizations we are building tools that integrate visualization frameworks like flotr2, flot, Dojo, Google charts and d3js.

MOOC En Images Example.
An example of visualization generated from data stored in the MOOCdb. The data is from the first offering of the MITx course: 6.002x. Here we show the average number of submissions made by the students who got certificate calculated country-by-country. We see that Mongolia had a very high average.

Access tools

To enable scientists to query the data easily without the need to know the database querying language, we built plugins for MATLAB, R, and Python. This allows the users to download the sql queries from the public repository and use them in the language of their choice. This has two advantages. First the plugin allows users to easily access the data and second it allows processing the data in memory without the need to use the disk.

Data exports

For some standard research studies, researchers expect data in a certain format and stored in csv. Since the schema is standardized the scripts that will export the data for different studies are generated and shared. For our research purposes we are developing scripts that will extract the data for Bayesian knowledge tracing and item response theory.

See also

MOOCDB Edx and Coursera use cases

References

Contributors

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox