In this lesson, you'll process the data you collected in the previous lesson and implement algorithms to analyse it. This builds on your hypothesis and prepares you for building the full analytics artefact.
We'll guide you through reviewing your data, cleaning it, exploring patterns, selecting and coding algorithms, evaluating results, and documenting your work. By the end, you'll have initial insights from your data analysis.
Begin by examining the dataset from your previous lesson. Check for completeness, structure, and relevance to your hypothesis.
Key Checks:
import csv
# Example: Load data from CSV
data = []
with open('your_data.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
data.append(row)
print(data[:5]) # Print first 5 rows
Handle any issues found in the review. Clean the data to make it ready for analysis. This step is crucial for structuring and transforming raw data to prepare it for analysis (outcome 3.5).
Steps:
Use statistics and visualisations to understand your data's patterns, trends, and relationships.
Approach:
# Assume cleaned_data has numeric values in index 1
values = [row[1] for row in cleaned_data]
mean = sum(values) / len(values)
print('Mean:', mean)
# Simple text visualization
print('Histogram:')
for i in range(0, max(values)+1, 5):
count = sum(1 for v in values if i <= v < i+5)
print(f'{i}-{i+4}: ' + '*' * count)
Code the algorithms to calculate frequency, mean, median, and mode, and apply them to your data.
Approach:
# Function to calculate mean
def calculate_mean(data):
return sum(data) / len(data) if data else 0
# Function to calculate median
def calculate_median(data):
sorted_data = sorted(data)
n = len(sorted_data)
if n % 2 == 0:
return (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
return sorted_data[n//2]
# Function to calculate mode
def calculate_mode(data):
from collections import Counter
count = Counter(data)
max_count = max(count.values())
return [k for k, v in count.items() if v == max_count]
# Function to calculate frequency
def calculate_frequency(data):
from collections import Counter
return dict(Counter(data))
# Example usage
sample_data = [1, 2, 2, 3, 4, 4, 4]
print('Mean:', calculate_mean(sample_data))
print('Median:', calculate_median(sample_data))
print('Mode:', calculate_mode(sample_data))
print('Frequency:', calculate_frequency(sample_data))
# Apply to your data (replace with your numeric column)
your_data = [float(row[1]) for row in cleaned_data if row[1]] # Assuming numeric in index 1
print('Your Data Mean:', calculate_mean(your_data))