Seth Barrett

Daily Blog Post: June 11th, 2023

go

June 11th, 2023

Harnessing Julia for Data Science: Working with Data, Visualization, and Basic Machine Learning

Welcome back to our series on Julia, the high-performance programming language designed for scientific computing. In previous posts, we covered setting up a coding environment and discussed Julia's syntax and unique features. In this post, we'll explore how to use Julia for data science tasks, including data manipulation, visualization, and basic machine learning.

Working with Data: DataFrames.jl

DataFrames.jl is a powerful and flexible package that provides a DataFrame object for handling and manipulating tabular data. To get started, install DataFrames.jl using the following command in your Julia terminal:

using Pkg
Pkg.add("DataFrames")

Next, import the package by running:

using DataFrames

Now you can create a DataFrame, read data from a file, and perform various operations on the data:

# Create a DataFrame
data = DataFrame(Name=["Alice", "Bob", "Charlie"], Age=[30, 25, 22], Salary=[50000, 45000, 55000])

# Read data from a CSV file
using CSV
Pkg.add("CSV")
data = CSV.read("data.csv", DataFrame)

# Select a column
ages = data[:, :Age]

# Filter rows
high_salary = data[data.Salary .> 50000, :]

# Sort by a column
sorted_data = sort(data, :Salary, rev=true)

# Group by a column and compute the mean
using Statistics
Pkg.add("Statistics")
grouped_data = groupby(data, :Department)
mean_salaries = combine(grouped_data, :Salary => mean)

Data Visualization: Plots.jl

Visualizing your data is an essential part of data analysis. Plots.jl is a versatile package that provides a simple and consistent interface for creating various types of plots. Install Plots.jl using the following command:

using Pkg
Pkg.add("Plots")

Then, import the package:

using Plots

With Plots.jl, you can create line plots, scatter plots, bar plots, and more:

# Line plot
x = 1:10
y = x .^ 2
plot(x, y, label="Line")

# Scatter plot
scatter(x, y, label="Scatter")

# Bar plot
bar(["Category A", "Category B", "Category C"], [10, 20, 30], label="Bar")

# Multiple plots in a single figure
p1 = plot(x, y, label="Line")
p2 = scatter(x, y, label="Scatter")
p3 = bar(["Category A", "Category B", "Category C"], [10, 20, 30], label="Bar")
plot(p1, p2, p3, layout=(1, 3), legend=false)

Basic Machine Learning with MLJ.jl

MLJ (Machine Learning in Julia) is a toolbox for machine learning in Julia, providing tools for loading, preprocessing, and training a wide variety of models. To get started, install MLJ and some supporting packages:

using Pkg
Pkg.add("MLJ")
Pkg.add("MLJLinearModels")
Pkg.add("MLJModels")

Now, let's load a dataset, preprocess it, and train a simple linear regression model:

using MLJ, DataFrames, MLJLinearModels

# Load the dataset
data = DataFrame(X=[1, 2, 3, 4, 5], Y=[2, 4, 6, 8, 10])

# Split the data into training and testing sets
train, test = partition(eachindex(data.Y), 0.7, shuffle=true, rng=1234)

X_train = data[train, :X]
y_train = data[train, :Y]
X_test = data[test, :X]
y_test = data[test, :Y]

# Wrap data in MLJ format
X_train = MLJ.matrix(X_train)
X_test = MLJ.matrix(X_test)

# Load the linear regression model
model = @load LinearRegressor pkg=MLJLinearModels

# Train the model
mach = machine(model, X_train, y_train)
fit!(mach, verbosity=0)

# Make predictions on the test set
y_pred = predict(mach, X_test)

# Evaluate the model using mean squared error
using MLJBase
mse = mean_squared_error(y_test, y_pred)
println("Mean squared error: $mse")

This example demonstrates how to load a dataset, split it into training and testing sets, train a linear regression model using MLJ, and evaluate the model using mean squared error.

Conclusion

In this post, we explored how to use Julia for data science tasks, covering data manipulation with DataFrames.jl, data visualization with Plots.jl, and basic machine learning with MLJ.jl. With these tools, you can start analyzing and modeling your data using Julia.

In the next post, we will dive deeper into more advanced machine learning topics, such as deep learning with Flux.jl and natural language processing with TextAnalysis.jl. Stay tuned and happy coding!