Data Profiling and Data Cleansing (WS 2014/15) - tele-TASKhttps://www.tele-task.de/series/1027/Data profiling is the set of activities and processes to determine the metadata about a given dataset. Profiling data is an important and frequent activity of any IT professional and researcher. It encompasses a vast array of methods to examine data sets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute usually involve multiple columns, such as inclusion dependencies or functional dependencies between columns. More advanced techniques detect approximate properties or conditional properties of the data set at hand. The first part of the lecture examines efficient detection methods for these properties. Data profiling is relevant as a preparatory step to many use cases, such as query optimization, data mining, data integration, and data cleansing. Many of the insights gained during data profiling point to deficiencies of the data. Profiling reveals data errors, such as inconsistent formatting within a column, missing values, or outliers. Profiling results can also be used to measure and monitor the general quality of a dataset, for instance by determining the number of records that do not conform to previously established constraints. The second part of the lecture examines various methods and algorithms to improve the quality of data, with an emphasis on the many existing duplicate detection approaches.High quality e-learning content created with tele-TASK - more than video! Powered by Hasso Plattner Institute (HPI)Prof. Dr. Felix NaumannData profiling is the set of activities and processes to determine the metadata about a given dataset. Profiling data is an important and frequent activity of any IT professional and researcher. It encompasses a vast array of methods to examine data sets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute usually involve multiple columns, such as inclusion dependencies or functional dependencies between columns. More advanced techniques detect approximate properties or conditional properties of the data set at hand. The first part of the lecture examines efficient detection methods for these properties. Data profiling is relevant as a preparatory step to many use cases, such as query optimization, data mining, data integration, and data cleansing. Many of the insights gained during data profiling point to deficiencies of the data. Profiling reveals data errors, such as inconsistent formatting within a column, missing values, or outliers. Profiling results can also be used to measure and monitor the general quality of a dataset, for instance by determining the number of records that do not conform to previously established constraints. The second part of the lecture examines various methods and algorithms to improve the quality of data, with an emphasis on the many existing duplicate detection approaches.notele-TASKtele-task@hpi.dede℗; ©; tele-TASKMon, 20 Jan 2020 21:07:04 GMTPyRSS2Gen-1.1.0http://blogs.law.harvard.edu/tech/rssProfiling Linked Datahttps://www.tele-task.de/lecture/video/5022/Anja Jentzsch01:13:08tele-TASK, HPI, computer science, technology, Germany, PotsdamAnja JentzschAnja Jentzschhttps://www.tele-task.de/lecture/video/5022/Thu, 29 Jan 2015 11:00:00 GMTGeneric Entity Resolution with Swooshhttps://www.tele-task.de/lecture/video/5010/Prof. Dr. Felix Naumann00:44:04tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/5010/Mon, 26 Jan 2015 15:15:00 GMTSorted Neighborhood Methods & Generic Entity Resolution with Swooshhttps://www.tele-task.de/lecture/video/258/Prof. Dr. Felix Naumann01:25:48tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/258/Mon, 19 Jan 2015 15:15:00 GMTSorted Neighborhood Methodshttps://www.tele-task.de/lecture/video/4987/Prof. Dr. Felix Naumann01:25:58tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4987/Thu, 15 Jan 2015 11:00:00 GMTSimilarity Measures & Generic Entity Resolution with Swooshhttps://www.tele-task.de/lecture/video/4971/Prof. Dr. Felix Naumann01:26:54tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4971/Thu, 08 Jan 2015 11:00:00 GMTSimilarity Measureshttps://www.tele-task.de/lecture/video/4963/Prof. Dr. Felix Naumann01:29:06tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4963/Mon, 05 Jan 2015 15:15:00 GMTDuplicate Detectionhttps://www.tele-task.de/lecture/video/4957/Prof. Dr. Felix Naumann01:30:07tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4957/Thu, 18 Dec 2014 11:00:00 GMTData Quality and Data Cleansinghttps://www.tele-task.de/lecture/video/4936/Prof. Dr. Felix Naumann01:21:07tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4936/Mon, 15 Dec 2014 15:15:00 GMTDependency Checking, Approximate FDs, FD_Mine and DFDhttps://www.tele-task.de/lecture/video/4934/Prof. Dr. Felix Naumann01:29:33tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4934/Thu, 11 Dec 2014 11:00:00 GMTIND Detection on very many Tableshttps://www.tele-task.de/lecture/video/4903/Fabian Tschirschnitz00:41:02tele-TASK, HPI, computer science, technology, Germany, PotsdamFabian TschirschnitzFabian Tschirschnitzhttps://www.tele-task.de/lecture/video/4903/Thu, 04 Dec 2014 11:30:00 GMTDiscovery of Conditional Unique Column Combinationhttps://www.tele-task.de/lecture/video/4902/Jens Ehrlich00:24:04tele-TASK, HPI, computer science, technology, Germany, PotsdamJens EhrlichJens Ehrlichhttps://www.tele-task.de/lecture/video/4902/Thu, 04 Dec 2014 11:00:00 GMTTANEhttps://www.tele-task.de/lecture/video/4889/Prof. Dr. Felix Naumann01:28:46tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4889/Mon, 01 Dec 2014 15:15:00 GMTDer Apriori Algorithmus, Discovering cINDs & Detecting Functional Dependencieshttps://www.tele-task.de/lecture/video/4873/Prof. Dr. Felix Naumann01:24:57tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4873/Mon, 24 Nov 2014 15:15:00 GMTSPIDER, Foreign Key Extraction & Conditional Inclusion Dependencieshttps://www.tele-task.de/lecture/video/4844/Prof. Dr. Felix Naumann01:27:04tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4844/Thu, 13 Nov 2014 11:00:00 GMTDetecting Inclusion Dependencieshttps://www.tele-task.de/lecture/video/4835/Prof. Dr. Felix Naumann01:20:03tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4835/Mon, 10 Nov 2014 15:15:00 GMTUnique Column Combinationshttps://www.tele-task.de/lecture/video/4790/Arvid Heise01:02:12tele-TASK, HPI, computer science, technology, Germany, PotsdamArvid HeiseArvid Heisehttps://www.tele-task.de/lecture/video/4790/Mon, 27 Oct 2014 15:15:00 GMTVisualization, Next Generation Profiling & Profiling Challengeshttps://www.tele-task.de/lecture/video/4784/Prof. Dr. Felix Naumann01:24:32tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4784/Thu, 23 Oct 2014 11:00:00 GMTAn Introduction to Data Profilinghttps://www.tele-task.de/lecture/video/4773/Prof. Dr. Felix Naumann01:31:44tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4773/Mon, 20 Oct 2014 15:15:00 GMTIntroductionhttps://www.tele-task.de/lecture/video/4757/Prof. Dr. Felix Naumann01:29:33tele-TASK, HPI, computer science, technology, Germany, PotsdamProf. Dr. Felix NaumannProf. Dr. Felix Naumannhttps://www.tele-task.de/lecture/video/4757/Mon, 13 Oct 2014 15:15:00 GMT