"False"
Main menu hidden.
Syllabus:

# Data preprocessing and visualisation, 7.5 Credits

Swedish name: Bearbetning och visualisering av data

This syllabus is valid: 2021-01-04 valid to 2023-12-31 (newer version of the syllabus exists)

Course code: 5DV217

Credit points: 7.5

Education level: First cycle

Main Field of Study and progress level: Computing Science: First cycle, has less than 60 credits in first-cycle course/s as entry requirements
Mathematical Statistics: First cycle, has less than 60 credits in first-cycle course/s as entry requirements

Grading scale: Three-grade scale

Responsible department: Department of Computing Science

Established by: Faculty Board of Science and Technology, 2021-01-13

## Contents

The objective of Data Science is to enable society, companies and citizens to understand and use the ever-increasing amount of collected data in ways that make it possible to detect potential problems or improvements to the current state of affairs. Data Science should also empower humans to estimate and understand the potential result of different actions. There's a saying about "lies, damned lies, and statistics", which expresses the fact that data-based statistics can be presented in very convincing ways even when the conclusions are false. This course attempts to teach how to detect such false information and ensure more ethical use of Data Science.

One example of practical use of Data Science is analyzing and presenting epidemic-related data and statistics in correct and human-understandable ways so that decisions and actions can be taken based on rational information. Data Science methods are also used for estimating effects of actions for reducing global warming, dimensioning road networks, choosing where to install new shopping centers or restaurants, optimizing the energy usage of buildings, …. To put it shortly, Data Science is one of the most crucial domains for deciding how our current and future society is to be built. More and more companies are also coming to realize the importance of Data Science. Regardless of industry or size, organizations that wish to remain competitive in the age of big data need to efficiently develop and implement Data Science capabilities or risk being left behind

Module 1, theory, 4.0 credits.
This course on Data preprocessing and visualization provides an introduction to the domain of Data Science. The students will learn how to import, manipulate and preprocess data coming from various real-world data sources with the objective to present it in ways that allow gaining insight into the underlying systems or phenomena. Preprocessing of data may produce improved insight into the meaning of data by statistical measurements, presented as numerical tables that summarize the data in various ways. However, in most cases, humans tend to understand visual presentations of data better than purely numerical presentations. The course will teach how to use basic data visualizations such a point and line plots, bar charts, histograms, boxplots and violin plots. 3D visualization techniques will be taught, as well as how to use maps and images for data visualization.

Various data analysis and machine learning methods will be used but the underlying theory is beyond the scope of this course. The intention is to make the students proficient with how those methods can be applied in real-world settings encountered in industry and society in general. This is why lectures are accompanied by exercises where students practice applying some of the methods treated during lectures.

The course mainly uses the R programming language, so students will learn the basics of R. A "bonus lecture" provides an overview of how data preprocessing and visualization methods can be used in the Python programming language.

Topics covered are:

• Introduction to "R" programming language and tools
• Import and export of data from text files, data bases and other sources
• Data visualization in R, in 2D and 3D
• Map visualizations
• Displaying and working with images in R
• Introduction to other useful data preprocessing and visualization packages
• Linear regression, BLUE, RMSE, shrinkage methods (Lasso, ridge regression)
• Linear classification (logistic regression, LDA)
• Principal Components Analysis (PCA) for identifying linear correlations between variables
• Robust PCA and low-rank matrix completion for outliers and missing data,
• K-means clustering
• Nonlinear or nonparametric methods (k-NN, kernel methods, etc.)
• Preparation of data for machine learning, introduction to "caret" machine learning package
• Basic notions of Explainable Artificial Intelligence (XAI)

Module 2, proficiency training, 3.5 credits.
Module 2 consists in a practical project that requires the combined use of methods learned in Module 1. Project topics and data sets will be provided by the course personnel, but student-proposed topics are encouraged. The project is performed in groups of 1-4 students. Each group presents their progress, plans and open questions to course personnel and fellow students in two "mentoring sessions" and in one final presentation session. The purpose of mentoring sessions is to provide constructive feedback and guidance to the students in their learning project. Mentoring session do NOT directly influence the grading of this Module.

## Expected learning outcomes

Knowledge and understanding
After having completed the course the student should be able to:

• Understand what is meant by Data Science as a concept: where and when Data Science is needed, what types of problems Data Science can solve and what main methodologies and tools of Data Science are. (ELO 1)
• Understand the meaning of various data-based measurements and visualizations commonly used in society, and know how to read and interpret them. (ELO 2)

Skills and abilities
After having completed the course the student should be able to:

• Understand data structures in the R programming language and have basic notions of data manipulation and programming in R. (ELO 3)
• Perform manual as well as automated pre-processing of data (cleaning, normalization, centering, scaling, …).  (ELO 4)
• Extract and understand statistical indicators from data, as well as how to detect and eliminate missing values. (ELO 5)
• Perform regression analysis and clustering of data. (ELO 6)
• Visualize data and results of analyses using line plots, scatter plots, bar plots. maps etc., both in 2D and 3D. (ELO 7)

Values and attitudes
After having completed the course the student should be able to:

• Assess the correctness and significance of data-based measurements and visualizations encountered in various media. (ELO 8)

## Required Knowledge

At least 7.5 credits in Mathematical Statistics at university level.
Proficiency in English equivalent to Swedish upper Secondary course English A/5. Where the language of instruction is Swedish, applicants must prove proficiency in Swedish to the level required for basic eligibility for higher studies.

## Form of instruction

The course consists of lectures, practical exercises performed individually, and a project performed in groups of up to four students. In addition to scheduled activities the course also requires individual work with the material

## Examination modes

The assessment of Module 1 (ELO 1-7) is done through a written Learning Diary, which includes written lab reports. The grades given in this module are Fail (U), Pass (G) or Pass with distinction (VG).
The assessment of Module 2 (ELO 3-8) is done through a written project report.  The grades given in this module are Fail (U), Pass (G) or Pass with distinction (VG).

A student that has failed one of the Modules of the course but has regularly attended a majority of the project activities can be given a re-exam covering the parts that the student has missed. If a student has not participated in the project activities (or missed a majority of them), the student can be assessed the next time the course is given.

On the whole course, one of the grades Fail (U), Pass (G) or Pass with distinction (VG) is given. At least the grade Pass must be achieved on each module in order to get a grade for the whole course. The grade given on the course is Pass with distinction (VG) if both of the two Modules have the grade Pass with distinction (VG).

A student who has passed an examination may not be re-examined. A student who has taken two tests for a course or segment of a course, without passing, has the right to have another examiner appointed, unless there exist special reasons (Higher Education Ordinance Chapter 6, section 22). Requests for new examiners are made to the head of the Department of Computing Science.

Examination based on this syllabus is guaranteed for two years after the first registration on the course. This applies even if the course is closed down and this syllabus ceased to be valid.

Deviations from the examination forms mentioned in this syllabus can be made for a student who has a decision on pedagogical support due to disability. Individual adaptation of the examination forms should be considered based on the student's needs. The examination form is adapted within the framework of the expected learning outcomes of the course syllabus. At the request of the student, the course responsible teacher, in consultation with the examiner, must promptly decide on the adapted examination form. The decision must then be communicated to the student.

Transfer of credits
Students have the right to be tried on prior education or equivalent knowledge and skills acquired in the profession can be credited for the same education at Umeå University. Application for credit is submitted to the Student Services / Degree. For more information on credit transfer available at Umeå University's student web, www.student.umu.se, and the Higher Education Ordinance (Chapter 6). A refusal of crediting can be appealed (Higher Education chapter 12) to the University Appeals Board. This applies to the whole as part of the application for credit transfer is rejected.

## Literature

### Valid from: 2021 week 1

All needed course literature is freely available on the web. The list will be presented on the course site.