Smartphone Data Analysis¶
Purpose: This notebook analyzes a dataset of smartphones (smartphones_data.csv.csv
). The goal is to explore the specifications, features, and pricing trends across different brands. We will perform Exploratory Data Analysis (EDA) to visualize distributions, identify relationships between features, uncover potential insights or anomalies in the data, and summarize the findings. Predictive modeling is not the primary focus but potential challenges for it will be noted.
# Import Libraries
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Ignore warnings for cleaner output
warnings.filterwarnings('ignore')
# Load Data
df = pd.read_csv('smartphones_data.csv.csv')
Data Overview and Initial Cleaning¶
# Display basic information
print("Data Info:")
df.info()
# Display first few rows
print("\nData Head:")
df.head()
Data Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 3260 entries, 0 to 3259 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3260 non-null object 1 Name 3260 non-null object 2 Price 3260 non-null int64 3 RAM 3260 non-null float64 4 OS 3260 non-null object 5 storage 3260 non-null float64 6 Battery_cap 3260 non-null int64 7 has_fast_charging 3260 non-null object 8 has_fingerprints 2534 non-null object 9 has_nfc 2534 non-null object 10 has_5g 2534 non-null object 11 processor_brand 3260 non-null object 12 num_core 3085 non-null float64 13 primery_rear_camera 3260 non-null float64 14 Num_Rear_Cameras 3260 non-null int64 15 primery_front_camera 3260 non-null float64 16 num_front_camera 3260 non-null int64 17 display_size(inch) 3260 non-null float64 18 refresh_rate(hz) 1529 non-null float64 19 display_types 3260 non-null object dtypes: float64(7), int64(4), object(9) memory usage: 509.5+ KB Data Head:
brand_name | Name | Price | RAM | OS | storage | Battery_cap | has_fast_charging | has_fingerprints | has_nfc | has_5g | processor_brand | num_core | primery_rear_camera | Num_Rear_Cameras | primery_front_camera | num_front_camera | display_size(inch) | refresh_rate(hz) | display_types | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | vivo | vivo v50 | 34999 | 8.0 | android | 128.0 | 6000 | Yes | Yes | No | Yes | snapdragon | 8.0 | 50.0 | 2 | 50.0 | 1 | 6.77 | 120.0 | amoled display |
1 | realme | realme p3 pro | 21999 | 8.0 | android | 128.0 | 6000 | Yes | Yes | No | Yes | snapdragon | 8.0 | 50.0 | 2 | 16.0 | 1 | 6.83 | 120.0 | amoled display |
2 | realme | realme 14 pro plus | 27999 | 8.0 | android | 128.0 | 6000 | Yes | Yes | No | Yes | snapdragon | 8.0 | 50.0 | 3 | 32.0 | 1 | 6.83 | 120.0 | oled display |
3 | samsung | samsung galaxy s25 ultra | 129999 | 12.0 | android | 256.0 | 5000 | Yes | Yes | Yes | Yes | snapdragon | 8.0 | 200.0 | 4 | 12.0 | 1 | 6.90 | 120.0 | amoled display |
4 | vivo | vivo t3 pro | 22999 | 8.0 | android | 128.0 | 5500 | Yes | Yes | No | Yes | snapdragon | 8.0 | 50.0 | 2 | 16.0 | 1 | 6.77 | 120.0 | amoled display |
# Clean column names (remove special chars, spaces)
df.columns = df.columns.str.replace(r'[()%,]', '', regex=True) # Remove (), %
df.columns = df.columns.str.replace(' ', '_', regex=False) # Replace space with underscore
df.columns = df.columns.str.lower() # Convert to lowercase
# Display cleaned column names
print("Cleaned Column Names:")
print(df.columns)
Cleaned Column Names: Index(['brand_name', 'name', 'price', 'ram', 'os', 'storage', 'battery_cap', 'has_fast_charging', 'has_fingerprints', 'has_nfc', 'has_5g', 'processor_brand', 'num_core', 'primery_rear_camera', 'num_rear_cameras', 'primery_front_camera', 'num_front_camera', 'display_sizeinch', 'refresh_ratehz', 'display_types'], dtype='object')
# Handle inconsistent brand names
df['brand_name'] = df['brand_name'].replace('moto', 'motorola')
# Convert Yes/No columns to numeric (1/0)
yes_no_cols = ['has_fast_charging', 'has_fingerprints', 'has_nfc', 'has_5g']
for col in yes_no_cols:
if col in df.columns:
df[col] = df[col].map({'Yes': 1, 'No': 0})
# Check missing values
print("\nMissing Values Before Handling:")
print(df.isnull().sum())
# Handle missing numerical values with median
num_cols_with_na = ['ram', 'storage', 'num_core', 'primery_rear_camera', 'num_rear_cameras',
'primery_front_camera', 'num_front_camera', 'display_size_inch', 'refresh_rate_hz', 'battery_cap']
for col in num_cols_with_na:
if col in df.columns:
# Convert column to numeric, coercing errors
df[col] = pd.to_numeric(df[col], errors='coerce')
median_val = df[col].median()
df[col].fillna(median_val, inplace=True)
# Special handling for Google Pixel processor/cores
df.loc[(df['brand_name'] == 'google') & (df['processor_brand'].isnull()), 'processor_brand'] = 'google'
df.loc[(df['brand_name'] == 'google') & (df['num_core'].isnull()), 'num_core'] = 8 # Assume 8 cores for Tensor
# Handle remaining missing categorical values with 'Unknown'
cat_cols_with_na = ['os', 'processor_brand', 'display_types']
for col in cat_cols_with_na:
if col in df.columns:
df[col].fillna('Unknown', inplace=True)
# Fill remaining boolean-like NaNs with 0 (assuming 'No' or feature absent)
for col in yes_no_cols:
if col in df.columns:
df[col].fillna(0, inplace=True)
# Convert relevant columns to appropriate types after filling NaNs
for col in num_cols_with_na + yes_no_cols + ['price']:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce') # Ensure numeric
# Re-check missing values
print("\nMissing Values After Handling:")
print(df.isnull().sum())
# Display cleaned data head
print("\nCleaned Data Head:")
df.head()
Missing Values Before Handling: brand_name 0 name 0 price 0 ram 0 os 0 storage 0 battery_cap 0 has_fast_charging 0 has_fingerprints 726 has_nfc 726 has_5g 726 processor_brand 0 num_core 175 primery_rear_camera 0 num_rear_cameras 0 primery_front_camera 0 num_front_camera 0 display_sizeinch 0 refresh_ratehz 1731 display_types 0 dtype: int64 Missing Values After Handling: brand_name 0 name 0 price 0 ram 0 os 0 storage 0 battery_cap 0 has_fast_charging 0 has_fingerprints 0 has_nfc 0 has_5g 0 processor_brand 0 num_core 0 primery_rear_camera 0 num_rear_cameras 0 primery_front_camera 0 num_front_camera 0 display_sizeinch 0 refresh_ratehz 1731 display_types 0 dtype: int64 Cleaned Data Head:
brand_name | name | price | ram | os | storage | battery_cap | has_fast_charging | has_fingerprints | has_nfc | has_5g | processor_brand | num_core | primery_rear_camera | num_rear_cameras | primery_front_camera | num_front_camera | display_sizeinch | refresh_ratehz | display_types | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | vivo | vivo v50 | 34999 | 8.0 | android | 128.0 | 6000 | 1 | 1.0 | 0.0 | 1.0 | snapdragon | 8.0 | 50.0 | 2 | 50.0 | 1 | 6.77 | 120.0 | amoled display |
1 | realme | realme p3 pro | 21999 | 8.0 | android | 128.0 | 6000 | 1 | 1.0 | 0.0 | 1.0 | snapdragon | 8.0 | 50.0 | 2 | 16.0 | 1 | 6.83 | 120.0 | amoled display |
2 | realme | realme 14 pro plus | 27999 | 8.0 | android | 128.0 | 6000 | 1 | 1.0 | 0.0 | 1.0 | snapdragon | 8.0 | 50.0 | 3 | 32.0 | 1 | 6.83 | 120.0 | oled display |
3 | samsung | samsung galaxy s25 ultra | 129999 | 12.0 | android | 256.0 | 5000 | 1 | 1.0 | 1.0 | 1.0 | snapdragon | 8.0 | 200.0 | 4 | 12.0 | 1 | 6.90 | 120.0 | amoled display |
4 | vivo | vivo t3 pro | 22999 | 8.0 | android | 128.0 | 5500 | 1 | 1.0 | 0.0 | 1.0 | snapdragon | 8.0 | 50.0 | 2 | 16.0 | 1 | 6.77 | 120.0 | amoled display |
Descriptive Statistics¶
# Display descriptive statistics for numerical features
print("Descriptive Statistics (Numerical):")
df.describe()
Descriptive Statistics (Numerical):
price | ram | storage | battery_cap | has_fast_charging | has_fingerprints | has_nfc | has_5g | num_core | primery_rear_camera | num_rear_cameras | primery_front_camera | num_front_camera | display_sizeinch | refresh_ratehz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 3260.000000 | 1529.000000 |
mean | 20181.384356 | 5.065874 | 112.040893 | 4163.485583 | 0.473313 | 0.733436 | 0.256135 | 0.307669 | 7.138037 | 32.655828 | 2.076994 | 12.555767 | 1.026994 | 6.097110 | 100.375409 |
std | 24145.388368 | 3.256896 | 126.893532 | 1312.404904 | 0.499364 | 0.442231 | 0.436564 | 0.461599 | 1.649559 | 29.397695 | 0.990856 | 10.564795 | 0.162090 | 0.741478 | 24.920299 |
min | 2500.000000 | 0.250000 | 0.310000 | 1100.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.300000 | 1.000000 | 0.300000 | 1.000000 | 2.400000 | 60.000000 |
25% | 7490.000000 | 3.000000 | 32.000000 | 3007.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 12.000000 | 1.000000 | 5.000000 | 1.000000 | 5.500000 | 90.000000 |
50% | 11999.000000 | 4.000000 | 64.000000 | 4500.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 8.000000 | 16.000000 | 2.000000 | 8.000000 | 1.000000 | 6.455000 | 120.000000 |
75% | 21999.000000 | 8.000000 | 128.000000 | 5000.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 8.000000 | 50.000000 | 3.000000 | 16.000000 | 1.000000 | 6.670000 | 120.000000 |
max | 200999.000000 | 24.000000 | 1024.000000 | 22000.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 10.000000 | 200.000000 | 5.000000 | 60.000000 | 2.000000 | 8.030000 | 165.000000 |
# Display descriptive statistics for categorical features
print("\nDescriptive Statistics (Categorical):")
df.describe(include='object')
Descriptive Statistics (Categorical):
brand_name | name | os | processor_brand | display_types | |
---|---|---|---|---|---|
count | 3260 | 3260 | 3260 | 3260 | 3260 |
unique | 32 | 3260 | 3 | 15 | 5 |
top | samsung | vivo v50 | android | mediatek | lcd display |
freq | 315 | 1 | 3130 | 1384 | 2010 |
Exploratory Data Analysis (EDA) - Visualizations¶
Price Distribution¶
plt.figure(figsize=(12, 6))
sns.histplot(df['price'], kde=True, bins=50)
plt.title('Distribution of Smartphone Prices')
plt.xlabel('Price (INR)')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(12, 4))
sns.boxplot(x=df['price'])
plt.title('Box Plot of Smartphone Prices')
plt.xlabel('Price (INR)')
plt.show()
The price distribution is heavily right-skewed, indicating most phones are in the lower to mid-range, with a few high-priced outliers (likely premium flagships).
Brand Distribution¶
plt.figure(figsize=(14, 8))
sns.countplot(y=df['brand_name'], order = df['brand_name'].value_counts().index)
plt.title('Number of Smartphones per Brand')
plt.xlabel('Count')
plt.ylabel('Brand Name')
plt.tight_layout()
plt.show()
Samsung, Realme, Vivo, Xiaomi, and Oppo are the most represented brands in the dataset.
RAM and Storage Distribution¶
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.countplot(x=df['ram'], ax=axes[0], order = sorted(df['ram'].unique()))
axes[0].set_title('Distribution of RAM (GB)')
axes[0].set_xlabel('RAM (GB)')
axes[0].set_ylabel('Count')
sns.countplot(x=df['storage'], ax=axes[1], order = sorted(df['storage'].unique()))
axes[1].set_title('Distribution of Storage (GB)')
axes[1].set_xlabel('Storage (GB)')
axes[1].set_ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
8GB RAM and 128GB storage are the most common configurations, followed by 6GB/128GB and 12GB/256GB.
Battery Capacity Distribution¶
plt.figure(figsize=(12, 6))
sns.histplot(df['battery_cap'], kde=True, bins=30)
plt.title('Distribution of Battery Capacity (mAh)')
plt.xlabel('Battery Capacity (mAh)')
plt.ylabel('Frequency')
plt.show()
Most phones have batteries around 5000mAh. There are some outliers with very high capacity (> 10000mAh).
Processor Brand Distribution¶
plt.figure(figsize=(10, 6))
sns.countplot(y=df['processor_brand'], order = df['processor_brand'].value_counts().index)
plt.title('Distribution of Processor Brands')
plt.xlabel('Count')
plt.ylabel('Processor Brand')
plt.tight_layout()
plt.show()
Snapdragon and Mediatek dominate the processor market in this dataset.
Feature Presence (5G, NFC, Fast Charging, Fingerprints)¶
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Presence of Key Features (1 = Yes, 0 = No)')
sns.countplot(x=df['has_5g'], ax=axes[0, 0])
axes[0, 0].set_title('5G Support')
axes[0, 0].set_xlabel('Has 5G')
sns.countplot(x=df['has_nfc'], ax=axes[0, 1])
axes[0, 1].set_title('NFC Support')
axes[0, 1].set_xlabel('Has NFC')
sns.countplot(x=df['has_fast_charging'], ax=axes[1, 0])
axes[1, 0].set_title('Fast Charging Support')
axes[1, 0].set_xlabel('Has Fast Charging')
sns.countplot(x=df['has_fingerprints'], ax=axes[1, 1])
axes[1, 1].set_title('Fingerprint Sensor')
axes[1, 1].set_xlabel('Has Fingerprints')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Most phones have 5G, Fast Charging, and Fingerprint sensors. NFC is less common, likely absent in budget models.
Price vs. Brand¶
# Filter out potential extreme low price errors for better visualization
df_filtered_price = df[df['price'] > 3000]
plt.figure(figsize=(15, 8))
sns.boxplot(y=df_filtered_price['brand_name'], x=df_filtered_price['price'], order = df_filtered_price.groupby('brand_name')['price'].median().sort_values().index)
plt.title('Price Distribution by Brand (Price > 3000 INR)')
plt.xlabel('Price (INR)')
plt.ylabel('Brand Name')
plt.xscale('log') # Use log scale due to wide price range
plt.tight_layout()
plt.show()
Apple clearly occupies the highest price segment. Brands like Google, Samsung, OnePlus, and Asus cover mid-range to high-end, while Realme, Poco, Xiaomi, Vivo, Oppo, Motorola cover budget to high-end. Brands like Itel, Lava, Tecno are primarily in the budget segment.
Price vs. RAM and Storage¶
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.boxplot(x=df_filtered_price['ram'], y=df_filtered_price['price'], ax=axes[0])
axes[0].set_title('Price vs. RAM (GB)')
axes[0].set_xlabel('RAM (GB)')
axes[0].set_ylabel('Price (INR)')
axes[0].set_yscale('log')
sns.boxplot(x=df_filtered_price['storage'], y=df_filtered_price['price'], ax=axes[1], order=sorted(df_filtered_price['storage'].unique()))
axes[1].set_title('Price vs. Storage (GB)')
axes[1].set_xlabel('Storage (GB)')
axes[1].set_ylabel('Price (INR)')
axes[1].set_yscale('log')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
As expected, higher RAM and Storage configurations generally correspond to higher prices.
Correlation Analysis¶
# Select numerical columns for correlation analysis
corr_cols = ['price', 'ram', 'storage', 'battery_cap', 'has_fast_charging', 'has_fingerprints',
'has_nfc', 'has_5g', 'num_core', 'primery_rear_camera', 'num_rear_cameras',
'primery_front_camera', 'num_front_camera', 'display_size_inch', 'refresh_rate_hz']
# Ensure all selected columns exist and are numeric
valid_corr_cols = [col for col in corr_cols if col in df.columns and pd.api.types.is_numeric_dtype(df[col])]
correlation_matrix = df[valid_corr_cols].corr()
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Matrix of Numerical Features')
plt.show()
Observations from Correlation Matrix:
price
shows positive correlation withram
,storage
,has_nfc
,has_5g
,primery_front_camera
,refresh_rate_hz
.ram
andstorage
are positively correlated with each other and with price.has_5g
is positively correlated withprice
,ram
,storage
,has_nfc
.refresh_rate_hz
is positively correlated withprice
,ram
,storage
.battery_cap
doesn't show a strong correlation with price, suggesting high capacity batteries are available across different price points.
Analysis Results Summary¶
Insights¶
- The dataset contains a wide range of smartphones from various brands, dominated by Android OS, with iOS (Apple) representing the high-end segment.
- Major brands represented include Samsung, Xiaomi, Realme, Vivo, OnePlus, Oppo, Motorola, Apple, Poco, and iQOO.
- There is a strong trend towards 8GB RAM and 128GB/256GB storage configurations across mid-range and high-end phones.
- 5G connectivity is prevalent across most price points, although several budget or older models lack it.
- Fast charging ('has_fast_charging' = 1) is almost a standard feature, missing primarily in very low-end or older devices.
- AMOLED and OLED display types dominate the mid-range and premium segments, often paired with a 120Hz refresh rate. LCD displays are common in budget phones.
- Snapdragon and Mediatek are the most common processor brands, followed by Samsung (Exynos), Apple, and Google (Tensor). Unisoc appears in budget devices.
- Most processors are Octa-core (8 cores). Apple uses Hexa-core (6 cores). Some Samsung models list 10 cores.
- Primary rear camera resolution varies significantly, with 50MP being very common. High-end models feature 108MP or 200MP sensors.
- Battery capacity typically ranges from 4000mAh to 6000mAh, with 5000mAh being a frequent value.
- NFC ('has_nfc' = 1) is common in mid-range to high-end phones but often absent in budget models.
- Fingerprint sensors ('has_fingerprints' = 1) are standard, except notably on Apple iPhones which use Face ID.
Anomalies¶
- Missing values were observed and handled in columns like 'refresh_rate_hz', 'display_types', 'num_core', and 'processor_brand'. Initial handling used median for numerical and 'Unknown' for categorical, with specific logic for Google Pixels.
- Inconsistent brand naming ('moto' vs 'motorola') was standardized to 'motorola'.
- Presence of very low-spec phones (e.g., RAM < 4GB, Storage < 64GB, no 5G/Fast Charging) like Reliance JioPhone Prima 2 (0.5GB RAM), iKall models, indicating a mix of modern and older/budget devices.
- Outlier battery capacities: Oukitel phones listed with >10000mAh (e.g., 11000mAh, 15600mAh, 22000mAh).
- Potential data entry errors in price: Google Pixel 6a listed at 4399, Nothing Phone 2a Plus at 2599, Tecno Spark Go 1 at 72908 - these seem incorrect and were filtered for some visualizations.
- Missing 'num_core' for Google Pixel phones was handled by assuming 8 cores based on Tensor chip architecture.
- Samsung Galaxy S24 FE, S24 Plus 5G, S24 5G listed with 10 cores, which is unusual for mainstream mobile processors.
- Apple iPhones consistently listed with 'has_fingerprints' = 0 (No).
- Several phones lack 'has_fast_charging', mostly in the budget segment (e.g., Poco C61, iQOO Z9 lite, Itel models, Lava models, some Samsung/Realme budget models).
- OnePlus 11R listed with 18GB RAM, potentially a typo or special edition.
- Some entries had missing values for both 'refresh_rate_hz' and 'display_types', often older models (e.g., Realme 3 Pro, Samsung Galaxy Note 10 Plus, Xiaomi Redmi Note 9 Pro). These were filled during cleaning.
Predictions¶
- Status: Not Performed
- Reason: Predictive analysis (e.g., price prediction based on specs) was not performed. The dataset contains inconsistencies (like the 10-core Samsung entries), potential outliers/errors (extreme prices, battery capacities), and requires significant feature engineering (handling numerous brands, processor types, cleaning display types, potentially creating feature interactions) for meaningful and accurate predictions. Such detailed cleaning and modeling are beyond the scope of this automated exploratory analysis. Exploratory Data Analysis provides more direct insights from the current data state.
Overall Summary¶
The smartphone dataset provides a comprehensive overview of the market, showcasing a wide variety of brands, specifications, and price points. Android phones dominate, with major players like Samsung, Xiaomi, Realme, and Vivo offering devices across segments. Key trends include the standardization of 8GB RAM/128GB+ storage, 5G connectivity, fast charging, and high-refresh-rate AMOLED/OLED displays in mid-to-high-end models. Apple maintains its position in the premium segment with iOS. Budget phones often feature LCD screens, lower RAM/storage, lack 5G/NFC, and sometimes fast charging. The data contains anomalies like missing values, potential typos in pricing/specs, and inconsistencies needing further cleaning for robust predictive modeling, but offers valuable insights into current smartphone features and market segmentation.