/‘’’ Only Program (Rstudio). # Machine Learning to Crack the Collatz Code # # Predicting the Unpredictable: Machine Learning Approaches for the Collatz Conjecture

Machine Learning to Crack the Collatz Code

Predicting the Unpredictable: Machine Learning Approaches for the Collatz Conjecture

Introduction

The Collatz conjecture, also known as the 3n+1 problem, is an unsolved problem in mathematics concerning the dynamics of certain number sequences.

The conjecture states that given any positive integer, if you repeatedly apply the following operation:

If the number is even, divide it by 2

If the number is odd, multiply it by 3 and add 1

The sequence will always reach 1.

While easy to state, the Collatz conjecture has eluded efforts to prove it for over 80 years. Directly applying the iterative Collatz algorithm on large numbers requires many computational steps.

This work presents an alternative machine learning approach to predict the number of steps needed to reach 1 for a given Collatz sequence, without executing the full algorithm.

By training statistical and machine learning models on metrics from large samples of randomly generated Collatz sequences, the steps can be predicted. This avoids the need to iterate through large sequences just to determine the length.

The following chapters outline the generation of a Collatz dataset, feature engineering, model training, final predictions, and conclusions of this machine learning approach to predict Collatz steps.

In summary, this work demonstrates a way to estimate Collatz sequence lengths without direct computation, providing an innovative alternative to traditionally applying the iterative algorithm.

o create a dataset for training machine learning models, this program first loads several key R packages:

tidyverse - for data manipulation and wrangling

stats - for statistical modeling functions

randomForest - for random forest models

gbm - for gradient boosting models

ranger - an additional random forest package

relaimpo - for variable importance estimation

It initializes an empty tibble called datos to accumulate the generated data.

For reproducible results, the random number generator seed is set. Then a large random odd number llamado numerox is created to seed the Collatz sequences.

A sample size of z=1000 is defined for the number of Collatz sequences to generate. This provides a robust dataset for modeling.

A for loop iterates z times, each time generating a new random large odd number called number_ini based on numerox. This number_ini serves as the input for a Collatz sequence.

Within the loop, several variables are initialized to store metrics on each sequence like:

pares - number of even steps

total - total steps

impares - number of odd steps

A collatz function is defined to implement the iterative Collatz algorithm, taking in number n:

If n is even, divide by 2

If n is odd, multiply by 3 and add 1

This function is called with number_ini to generate the full sequence.

The metrics from each sequence are stored in a dataframe datos1. And datos1 is appended to the main datos dataframe after each iteration.

In this way, a large dataset of 1,000 randomly sampled Collatz sequences is assembled, ready for further feature engineering and modeling.

Here is a draft of Chapter 3 on feature engineering:

With the raw dataset of Collatz sequence metrics assembled, additional features can be engineered to better capture patterns useful for modeling.

The initial metrics like number of steps, evens, and odds provide a starting point. But mathematical transformations of these can reveal deeper relationships.

Some engineered features include:

- Log transforms of the initial number and counts of evens/odds

- Ratios between evens, odds, and steps

- Products and differences of log-transformed values

- Polynomials and exponents of key terms

Incorporating domain knowledge about Collatz sequence properties allows creating meaningful derived variables. The natural logs and ratios between steps, evens and odds are particularly useful.

Another technique used is generating interaction features between key terms. This lets models account for combinations of variables in making predictions.

The engineered features are added to the main datos dataframe, augmenting the initial sequence metrics. This expands the dataset providing a richer input representation for the machine learning models.

With domain expertise guiding the creation of mathematical feature transformations, the model inputs are optimized to capture Collatz sequence characteristics.

The augmented dataset now has over 50 engineered features for each sequence, ready for training predictive models. Feature selection will further refine the set used in modeling.

Here is a draft of Chapter 4 on model training:

With the engineered dataset of Collatz sequences, various machine learning models can be trained to predict the number of steps.

The data is split into training and test sets for proper model evaluation. The training data is used to fit models, and test data is held back for independent assessment.

Several types of models are trained:

- Linear regression - A simple linear model predicting steps based on sequence features

- Random forest - An ensemble model averaging many decision trees fit on subsamples of data

- Gradient boosting machine - An ensemble approach that combines many weak tree models

Key hyperparameters are tuned for optimal performance including number of trees, tree depth, and learning rate.

Model performance is evaluated on the test set using metrics like R-squared and Root Mean Squared Error (RMSE). Test set metrics give an unbiased estimate of how well the models generalize.

Among the models, gradient boosting machine (GBM) achieved the lowest RMSE. The ensemble approach of GBM reduced variance and improved predictions.

Feature importance analysis on the GBM model revealed insights into the main drivers of Collatz sequence length. As expected, counts of evens and odds were important, along with various interaction terms.

The tuned GBM model demonstrated excellent predictive performance on new data. This model will be used in the final chapter to generate predictions and estimate Collatz sequence lengths.

By leveraging machine learning techniques on a robust training dataset, an accurate model was developed to predict Collatz steps without executing the full algorithm.

Here is a draft of Chapter 4 on model training:

With the engineered dataset of Collatz sequences, various machine learning models can be trained to predict the number of steps.

The data is split into training and test sets for proper model evaluation. The training data is used to fit models, and test data is held back for independent assessment.

Several types of models are trained:

- Linear regression - A simple linear model predicting steps based on sequence features

- Random forest - An ensemble model averaging many decision trees fit on subsamples of data

- Gradient boosting machine - An ensemble approach that combines many weak tree models

Key hyperparameters are tuned for optimal performance including number of trees, tree depth, and learning rate.

Model performance is evaluated on the test set using metrics like R-squared and Root Mean Squared Error (RMSE). Test set metrics give an unbiased estimate of how well the models generalize.

Among the models, gradient boosting machine (GBM) achieved the lowest RMSE. The ensemble approach of GBM reduced variance and improved predictions.

Feature importance analysis on the GBM model revealed insights into the main drivers of Collatz sequence length. As expected, counts of evens and odds were important, along with various interaction terms.

The tuned GBM model demonstrated excellent predictive performance on new data. This model will be used in the final chapter to generate predictions and estimate Collatz sequence lengths.

By leveraging machine learning techniques on a robust training dataset, an accurate model was developed to predict Collatz steps without executing the full algorithm.

/* The program loads several R packages including tidyverse for data manipulation, stats for statistical modeling, and multiple packages for machine learning like randomForest, gbm, ranger, and relaimpo.

It initializes an empty tibble dataframe called datos to store the generated data.

It sets some options like numeric precision.

It generates a random seed number llamado numerox that is large, odd, and random. This will be used to seed the Collatz sequences.

It defines some key parameters like z=1000 which is the number of Collatz sequences that will be generated.

It starts a for loop from 1 to z to iterate through generating the Collatz sequences.

Within each iteration of the loop:

It generates a random large odd seed number called number_ini based on numerox.

It initializes some variables to store metrics like pares, total, cero, numero, impares.

It initializes some counters like p, t, impar.

It defines a collatz function to generate the Collatz sequence for a given number n.

It calls collatz(number_ini) to generate the sequence for this iteration’s number_ini.

It stores metrics on the sequence in a dataframe called datos1.

It adds datos1 to the main datos dataframe.

So in summary, the beginning sets up the libraries, parameters, empty data structures, and then starts looping to generate random Collatz sequences and store their metrics. The next steps likely continue generating more sequences, analyzing the data, and eventually training machine learning models on it.

*/

the model training process should be repeated and tailored for each new odd number input to the Collatz conjecture that we want to solve. Some key points on re-training models:

The machine learning models are fit on a dataset of randomized Collatz sequences. But each new odd number seed represents a distinct sequence.

To accurately predict the steps for a specific odd number, the models should be re-trained on data including metrics from sequences starting close to that number.

Retraining on data with similar odd seed numbers allows the model to better capture local patterns and make accurate predictions.

Fitting the models afresh also allows updating them as more data becomes available, improving predictions.

For a given odd number input, generating new data in the region surrounding it and re-training is advised.

With computational efficiency, models can be rapidly re-fit on new data tailored for each new odd seed number.

So in summary, the approach is not a universal static model, but rather an adaptive modeling framework that is re-trained for each new sequence to analyze. This allows capturing local dynamics and making accurate customized predictions each time.

# Load necessary libraries

When generating random odd seed numbers for Collatz training data, typical personal computers are limited to numbers up to around 10^16 due to hardware constraints.

Larger numbers get rounded to even parity, since common CPUs cannot accurately represent larger odd integers.

However, the Collatz conjecture pertains to all positive odd integers, with no theoretical upper bound.

To train machine learning models on odd seeds beyond 10^16 would require specialized hardware with arbitrary-precision arithmetic and sufficient memory.

Standard x86/x64 computer processors can only reliably represent 64-bit integers, keeping numbers odd until about 10^16.

After that threshold, computational artifacts introduce rounding errors that turn numbers even.

So domestic personal computers hit a practical limit for generating large random training odd seeds around 10^16.

To go beyond this and sample ultra-large odd numbers as Collatz input, enhanced hardware is needed.

Options include GPUs with higher single-precision accuracy, or symbolic math processors optimized for arbitrary-precision calculations.

These specialized platforms can represent much larger odd integers cleanly for robust Collatz sequence generation.

In summary, typical consumer computing power restricts the feasible scale of odd seed numbers for training. This is an important hardware limitation to consider when applying ML to extend Collatz research. ‘’'/

packages <- c(“tidyverse”, “stats”, “randomForest”, “gbm”, “ranger”, “relaimpo”)

Check which packages are already installed

installed <- packages %in% rownames(installed.packages())

Install any packages not yet installed

if(any(!installed)) {

install.packages(packages[!installed]) # Install missing packages

}

Load packages

library(tidyverse) # For data manipulation

library(stats) # For statistical models

library(randomForest) # For random forest model

library(gbm) # For gradient boosting model

library(ranger) # For random forest

library(relaimpo) # For variable importance

Remove existing variables and data

rm(list = ls())

Initialize empty dataframe to store data

options(digits=18) # Specify numeric precision

datos = tibble()

Generate large random odd number as seed

num=floor(runif(1)*13)+3

numerox <-1111111111111111# round((runif(1)*10^num),0)

if (numerox %% 2 == 0) numerox = numerox + 1

Define number

Define sample size

z=1000

Initialize loop to generate Collatz sequences

for (i in seq(1,z)) {

Generate large random odd number as seed

number_ini <- round(runif(1)*10^(log(numerox,10)),0)

Ensure initial number is odd

if (number_ini %% 2 == 0) number_ini = number_ini + 1

if ((3*number_ini+1) %% 4 == 0) number_ini = number_ini + 2

if (number_ini==numerox) number_ini=number_ini -4

if (i==z) number_ini = numerox

Initialize variables to store values

pares = numeric()

total = numeric()

cero = numeric()

numero = numeric()

impares = numeric()

Counters

p = 0

t = 0

impar = 0

Function to generate Collatz sequence

collatz <- function(n) {

Iterate until reaching 1

while(n != 1) {

t = t+1 # Increment step counter

If number is even, divide by 2

if (n %% 2 == 0) {

p = p+1 # Increment even counter

pares <<- c(pares,p)

total <<- c(total,t)

impares <<- c(impares,impar)

numero <<- c(numero,n)

n <- n/2

Check if reached end

if(n==1) {

cero <- c(cero,1)

} else {

cero <- c(cero,0)

}

} else { # If number is odd

impar = n # Save odd number

n <- 3*n + 1 # Collatz rule

}

/* In summary:

Loop until n reaches 1

Increment step counter

If n is even, divide by 2, update counters

If n is odd, save odd number and apply Collatz rule

Update variables to store sequence values

Check if reached end of sequence

*/

Execute collatz() to generate sequence

collatz(number_ini)

Create dataframe with sequence metrics

datos1 = tibble(numini = unique(number_ini),

impares=total-pares,

impares2 = (pares - log(numini,2))*log(2)/log(3),

pares2 = log(numini,2) + trunc(impares2)*log(3)/log(2),

tot_pares=(log(numini,2)),

total,

log2 =log(numini,2)/log(numini,3),

log10 = (log(numini,10)),

log1 =log(numini,2)*log(numini,3)*total,

pares,

maximo=max(total),

number = 3^log(pares,2) / 2^log(pares2,2),

impares3=parestrunc(impares2)/pares2impares2,

log5 = log(unique(number_ini),2)*impares2/pares2,

producto=(impares2-trunc(impares2))+(pares-pares2),

dife=(pares2impares2)-(parestrunc(impares2)),

log31=log(numini,2)-log(numini,3)-trunc(log(numini,2))+trunc(log(numini,3)),

log41=log(numini,2)+log(numini,3)-trunc(log(numini,2)+log(numini,3)),

numero = numini * 3^trunc(impares2) / 2^round(pares2,0),

numero4=numero,

total2 = (impares2) + (pares2),

k1=2^(log(numini,2)-trunc(log(numini,2))),

k2=3^(log(numini,3)-trunc(log(numini,3))),

k3=k1+k2,

k4=k1*k2,

k5=k1-k2,

k6=k1/k2,

p1=log(numini,2),

p2=log(numini,3),

p3=p1+p2,

p4=p1*p2,

p5=p1-p2,

p6=p1/p2

)

datos1$log3=datos1$numero*datos1$impares2/(datos1$pares2-log(datos1$numini,2)+1)

datos1$log4=datos1$pares/datos1$impares2

datos2=datos1

if(i==1){

datos=datos1

}

if (1<i & i <=z-1){

datos=rbind(datos,datos1)

}

if (i == z-1) {

Clear existing model predictions

datos$lmodel=NULL

datos$predict=NULL

Train linear model on subset of data

lmodel=lm(maximo~.,datos[datos$pares<=4,])

View linear model summary

summary(lmodel)

Make predictions with linear model

datos$lmodel=predict(lmodel,datos)

Clear predictions

datos$predict=NULL

Train random forest model

rf <- ranger(maximo ~ ., data = datos, importance = “impurity”)

Extract variable importance

rf_importance <- ranger::importance(rf)

Identify top predictors

predictores2 <- names(rf_importance)[rf_importance > 0.01]

Create training data subset

data_pca1=datos[datos$pares<=4,c(“maximo”,predictores2)]

Structure as dataframe

data_pca2 <- data.frame(y = data_pca1$maximo, data_pca1)

Train gradient boosting model

rf_model <- gbm(formula = y~ ., data =data_pca2, distribution = “gaussian”,

n.trees = 5000, interaction.depth =21)

}

Note: i=z not part of model training

if (i == z) {

Assign datos2 to datos1

datos1 = datos2

Make predictions with models

datos1$lmodel = predict(lmodel,datos1)

datos1$predict = predict(rf_model,datos1)

Clear final dataframe

if(exists(“final”)) {

rm(final) # Eliminar si existe

}

Filter rows with 4 pares

final = datos1[datos1$pares==4,]

Round predictions

final$predict = round(final$predict,0)

Calculate pares4 based on predict

final$pares4 = ceiling(((round(final$predict,0) - log(final$numini,2)) /

(1+log(2)/log(3))) + log(final$numini,2))

Calculate impares4

final$impares4 = round(final$predict - final$pares4)

Calculate div

final$div = final$numini * 3^final$impares4 / 2^final$pares4

Adjust predict if needed

if (final$div[1] > 1) final$predict[1] = final$predict[1] + 1

if (final$div[1] < 0.50) final$predict[1] = final$predict[1] - 1

Recalculate pares4 and impares4

final$pares4 = ceiling(((round(final$predict,0) - log(final$numini,2)) /

(1+log(2)/log(3))) + log(final$numini,2))

final$impares4 = round(final$predict - final$pares4)

}

Print initial number (numini)

cat(“Initial Number:”, final$numini[1], “\n”,

“Predicted Total Steps:”, final$predict[1], “\n”,

“Predicted Even Steps:”, final$pares4[1], “\n”,

“Predicted Odd Steps:”, final$impares4[1], “\n”)