On this page
Estimated Reading Time is 3 minutes and 35 seconds ๐
Introduction
Hey there! I'm Johann Maiga, thrilled to share insights from my data science journey, especially through my class at MIT IDSS. The realms of Data Science and AI/ML are not just academic interests; they're the frontiers where I challenge myself with complex datasets, leveraging tools like Numpy and Pandas to unearth patterns and insights.
My Career Journey
Starting as a Help Desk Technician in 2016 and evolving into a Senior Cloud Engineer, my career has been a testament to the power of continuous learning. The transition to cloud technologies and earning my first AWS certification in 2018 were pivotal. Now, with my eyes set on AI/ML, I find the synergies between cloud computing and data science fascinating, driving my exploration deeper into this integrated technological landscape.
Diving into AI/ML
The essence of AI and Machine Learning lies in understanding and manipulating data. Here, Numpy and Pandas are indispensable. A class study on analyzing Uber's pickup data in NYC, conducted during my MIT IDSS class, showcased their capabilities vividly. Below I will share a few code snippets on how to use Numpy and Pandas.
The Magic of Numpy:
What It Is: A high-performance library for numerical operations in Python. Numpy introduces support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.
Why It Matters: Its speed and functionality make it indispensable for scientific computing. Numpy's array operations are both faster and more efficient than standard Python lists, crucial for processing large datasets and serving as the backbone for more complex data science and machine learning libraries.
Use Cases:
1- Data Transformation: Quickly performing operations like normalization and standardization on large datasets, essential for pre-processing data in machine learning.
2- Simulation: Generating random data or simulating real-world scenarios for testing algorithms, from finance models to scientific experiments.
Numpy simplifies complex numerical operations. Hereโs an example where we calculate the average, standard deviation, alongside other operations:
import numpy as np
# Creating a numpy array
data = np.array([10, 20, 30, 40, 50])
# Calculating the average
print("Average:", np.mean(data))
# Output: Average: 30.0
# Calculating the standard deviation
print("Standard Deviation:", np.std(data))
# Output: Standard Deviation: 14.142135623730951
# Calculating variance
print("Variance:", np.var(data))
# Output: Variance: 200.0
# Square root of each element
print("Square Roots:", np.sqrt(data))
# Output: Square Roots: [3.16227766 4.47213595 5.47722558 6.32455532 7.07106781]
# Performing element-wise addition
print("Data + 5:", data + 5)
# Output: Data + 5: [15 25 35 45 55]
The Wonders of Pandas:
What It Is: A library designed to make data cleaning, manipulation, and analysis straightforward in Python. It introduces DataFrames and Series, which provide a rich set of methods and functionalities for working with structured data.
Why It Matters: Pandas simplifies tasks that are tedious and complex in raw Python, such as data filtering, aggregation, and visualization. It's a pivotal tool for data scientists who need to extract insights from data, offering a balance of performance and ease of use.
Use Cases:
1- Time Series Analysis: Managing and analyzing time-series data, useful in financial or environmental data analysis, with built-in methods for resampling, filling gaps, and moving window statistics.
2- Data Wrangling: Cleaning and preparing real-world messy data for analysis, including handling missing values, merging datasets, and transforming data formats.
Pandas excels in data manipulation, offering intuitive methods for analyzing datasets:
import pandas as pd
# Loading data from a CSV file
df = pd.read_csv('uber_data.csv')
# Displaying the first 5 rows of the dataframe
print(df.head())
# Outputs the first 5 rows of your dataset
# Summary statistics for numerical columns
print(df.describe())
# Outputs summary statistics like count, mean, std, etc., for each column
# Counting the number of pickups by the 'Base' column
pickup_counts = df['Base'].value_counts()
print(pickup_counts)
# Outputs the count of pickups per base, illustrating data aggregation
# Assume 'df' is loaded with Uber's pickup data
# Filtering data for pickups in a specific borough
manhattan_pickups = df[df['borough'] == 'Manhattan']
# Displaying the first 5 rows of filtered data
print(manhattan_pickups.head())
# Outputs the first 5 rows of pickups in Manhattan
Conclusion:
As I continue to explore the intersections of AI/ML and cloud engineering, tools like Numpy and Pandas have become my compasses in the vast sea of data. From performing mathematical operations with Numpy to visualizing complex datasets with Pandas, Matplotlib, and Seaborn, the journey is as rewarding as it is challenging.
Fun Fact:
You may be wondering why the estimated reading time of this post is 3min and 35 seconds? Well, here is the explanation in code ๐
# Calculate the estimated reading time based on 895 words and convert to minutes and seconds with rounding
words = 895
average_reading_speed_per_minute = 250 # Average words read per minute
# Calculate reading time in minutes
estimated_reading_time_minutes = words / average_reading_speed_per_minute
# Convert fractional part of minutes into seconds and round to nearest whole number
minutes = int(estimated_reading_time_minutes)
seconds = round((estimated_reading_time_minutes - minutes) * 60)
print(f"{minutes} min and {seconds} seconds")
# Output is: 3 min and 35 seconds
Stay tuned for more from my adventures in cloud and data science. Whether you're just starting or are well on your journey, thereโs always something new to learn and discover. Letโs dive into the data together!