Sources
lecture 1: Data Literacy: understanding & effectively communicating data Analytics Process: *Making money is never the business problem Step 1: Business problem – investigate one thing at a time Step 2: Data collection – data sourcing or samples, storytelling and ethics Step 3: Descriptive data – what has happened? Step 4: Predictive data – what could happen? Step 5: Prescriptive data – what will/ should happen? Step 6: The goal of the process – actionable business decisions Step 7: Ask the next question and back to Step 2 Lecture 2: Quantitative Variables: numerical data that can be measured and must have units EX: weight, height, temp Must also have mathematical operations (must be a number) Continuous: fraction/ decimal Discreet: whole number Qualitative Variable: non-numerical in nature – can't do calculations Describes things in words – categorized by traits/ characteristics that can be counted Responses to open ended questions Ordinal: order matters – meaningful impact of order Nominal: no meaning and order does not matter Lecture 3: Electronic Data Sourcing falls under data management in analytics process 1.) Point of Sale: accept payments from customers and keeps track of inventory & sales Receipts: provides price, date, ID, item popularity data 2.) Click Stream Data: pathway user takes through online journey – business provides links Business provides links to increase “clicks” which increases overall conversion rate 3.) Social Media Data: computer-based sharing of info – user provides data not company Confidential vs anonymous data: how data is stored Confidential: identification is coded but known by someone in charge Anonymous: no one knows identity ***Internet and CSD is NEVER anonymous – IP address 4.) Sensors: device that detects change in environment and converts into data Small and cheap – help us predict Lecture 4: Data sourcing: collecting ALL the data Why is sampling needed? Data observations (step 2 in analytics process) To perform you either use appropriate data sourcing method to collect ALL data from entire population OR you sample from population Sample: foundational descriptive data needed so that everyone has equal chance of being represented Population: entire group of individuals from which we want info ***Sample is a subset of population from which we draw conclusions to apply to entire population Random Sampling: everyone in population has chance to be chosen – when not random it is biased (participants chosen to favor certain outcomes) Most common biased sampling: Convenient: select readily available data Voluntary: self-selected, usually those who participate only do so because they have strong opinions These result in under-coverage bias: incomplete sampling Simple Random Sampling: foundational --> costly and time consuming and can be biased if done incorrectly Alternative SRS methods: stratified, systematic, and cluster sampling Stratified Random Sampling: divide population into non-overlapping strata’s (characteristics) then select random sample from each strata Requirements: identify strata, use weight or same number of characteristics in sample Adv: prevents bias on SRS Disadv: may have missed possible strata, overlap may occur Systematic Sampling: “skip method” - list population, select RANDOM starting point, select every nth person Requirements: randomly selected starting point, computer mathematically computes and gives nth skip number Adv: finding defects, detecting when people/ things appear Disadv: no pattern in population Multistage Cluster Sampling: divide population into clusters & randomly select clusters; sample everyone/ thing in cluster Requirements: large geographic area and large random sample and must follow up with stratified Adv: samples large area Disadv: lose precision by randomly choosing clusters Lecture 5 Sampling Biases 1.) Sample Error: wrong sampling method used 2.) Coverage Biases: sample does not have adequate representation of entire population – common when sampling nation or worldwide Under-coverage: not whole population No response: most national surveys result in 40% or less response rate and therefore not representative of entire population 3.) Measurement Error: sampling procedures result in collecting data that does not answer business problem Response bias: using leading questions or poor wording of questions that misguide responses 4.) Errors of Observation: data entered incorrectly Lecture 6: Data sourcing for random samples: surveys Common Surveys: 1.) Phone Survey: inexpensive but low response rate 2.) Mail Survey: inexpensive, low response rate, requires multiple mailings 3.) Web Surveys: cheaper still, same issues as mail 4.) Personal Interview: more expensive, more control, higher response rate Advantages of surveys: Relatively easy to administer Can be developed by user Cost effective Can be administered online and collect larger number of responses Broad range of questions Standardized surveys relatively free and alleviate measurement error Disadvantages of surveys: Low or no responses ---> under-coverage Costly and time consuming Can lead to measurement error Must get participant consent to use data Small pilot test can help alleviate disadvantages 4 Essential Parts to Participant Consent Purpose: why data is collected What will participation entail/ what will participant be asked to do? Will the data be confidential? Anonymous? Are there any risks involved with participation? Lecture 7: Graphing Qualitative Data: Bar, pie, and pareto Bar graph: used to compare qualitative data frequencies ***bars do not touch Pie chart: used to compare between two executive officers – best chart to show part of the whole using relative frequencies Pareto chart: bar chart having the different kinds of defected listed on horizontal side Bar height represents the frequency of occurrence – bars arranged in decreasing height from left to right Includes cumulative % on right side of graph Combination of bar and line graph Useful for finding and prioritizing defects in business production Lecture 8: Graphing Quantitative Data: Histograms, scatterplots Histograms: graphs quantitative data and describes shape of dataset – bars DO touch in histograms because the data is continuous Often confused with bar graph but they are different Normal distribution/ bell curve: mirror image down middle is normal but not always the case Right skew: when data isn't symmetrical and has tail to right --> caused by outliers in upper portion of dataset Left skew is opposite X axis is number range Scatterplots: association between two quantitative variables Two variables measured on the same cases are associated if knowing the value of one if the variables told you something that you would not otherwise know about the other value Frequency does not equal relationship Form: points in linear pattern Direction: can be positive or negative (+ variables work together, - variables work inversely) Strength: how related are the two variables – numerical (r) or graphically (how close points are to line) 0 - 0.39 - weak 0.4 - 0.69 - moderate 0.7 - 1 – strong Lecture 9: Storytelling and Misleading Graphs Cherry picking: picking certain datasets to support narrative, doesn’t show big picture 5 key points to storytelling with graphs: Accuracy Understanding audience Appropriate graph for data Minimize clutter Focus attention on important parts Avoid misleading graphs: Choose the appropriate graph Start at zero or use slashes Graph and axis titles should be self-explanatory Use key if it isn’t self-explanatory Use appropriate and precise frequency intervals Scale must include highest value Caution when rounding data ***Difference in data visualization and visual analytics Lecture 10: Data Ethics: how to store and deal with data – process of examining, interpreting, & applying moral principles Data Ethics Principles: Ownership: who owns the data? Recall clickstream and PoS Informed consent: permission given to collect of use data for stated purpose Privacy: how is data collected and how is it stored Currency: unethical to sell info Openness: data is public vs protected (copyright) Consent: Purpose of data collection Confidential vs anonymous What participant is expected to do Risk or discomfort Strategies companies use to distract from privacy policies: Placation: consumer trust is important Diversion: privacy link is not easy to find, or no link provided Misnaming: privacy policy vs data collection and storage policy
Podcast Editor
Podcast.json
Preview
Audio
