Skip to main content

Final Report

1. Introduction

The goal of this analysis is to understand the factors which impact the compressive strength of concrete, a critical property in construction.
Predicting compressive concrete strength based on component proportions can help manufacturers optimize material usage and ensure structural reliability.
So via this analysis, we aim to provide practical recommendations for real-world applications, by applying data mining techniques, particularly linear regression, including both custom-built via gradient descent and statistically validated using software "statsmodels".

2. Results

Model from Part B - Q2.3

Multivariate Linear Regression with Gradient Descent (Pre-processed Predictors, Raw Response)

  • Parameters:

    • m: [ 1.03198275e+01 2.78827073e+01 7.45159185e+00 -5.12468027e+00 3.11466814e+00 3.57357621e-01 7.71318550e-03 1.02909491e+01 ]
    • b: 28.0044317376294
    • α\alpha: 0.01
    • Epochs: 5000
  • Peformance:

    • MSE on Testing Data: 67.87198233457188
    • VE (R2R^2) on Testing Data: 0.6894830160041516
  • Why Trustworthy?:
    High R-squared value (0.69), explaining 69% of the variance in compressive strength, as the pre-processed predictors enhances its reliability.

  • Recommendations:

    • Increase Cement and Blast Furnace Slag proportions, while decreasing Water content, to improve strength significantly.
    • Fine Aggregate has a smallest impact on strength. So we may reduce its use to lower costs.
  • Figure:
    PartB Set1 Model

Model from Part C - Q2.3

Multivariate Linear Regression with Statsmodels (Pre-processed Predictors, Raw Response)

  • Parameters:

    • m: [ 15.3584 50.0728 22.4310 -2.0918 0.1704 3.5054 4.1012 10.5111 ]
    • b: 19.9179
  • Peformance:

    • MSE on Testing Data: 80.73582880174308
    • VE (R2R^2) on Testing Data: 0.6306304133516871
  • Why Trustworthy?: Fitted with statsmodels (which means this model optimizes using Closed Form Solution), showing a high R-squared value (0.63), and the use of pre-processed predictors ensures its reliability.

  • Recommendations:

    • Components suggestion for Cement, Blast Furnace Slag and water are consistent with the previous model.
    • The fly ash's weight of this model has become greater than the cement. So it could also be considered as one of the most significant factors.
    • In this model, Superplasticizer turns to be the minimal impact feature. So we may reduce its use to lower costs.
  • Figure:
    PartC Pre-pro Model

3. Methods

Model from Part B - Q2.3

I used a multivariate linear regression model optimized with gradient descent. The algorithm iteratively adjusts coefficients (m) and intercept (b) to minimize the mean squared error (MSE) between predictions and true values.

  • Pre-processed?
    Yes. As following:

    FeatureTransformation Method
    CementStandardization
    Blast Furnace SlagNormalization
    Fly AshNormalization
    WaterLog and Standardization
    SuperplasticizerNormalization
    Coarse AggregateStandardization
    Fine AggregateStandardization
    AgeLog and Standardization
  • Hyperparameters:

    • Learning Rate (α\alpha): 0.01
    • Epochs: 5000
    • These were chosen by error values, starting with the bigger α\alpha (α=0.1\alpha = 0.1 , $ Epochs = 1000) and then trying to decrease until convergence was achieved without overshooting. And Epochs should be set high enough to stabilize MSE.

Model from Part C - Q2.3

I used the t-test to obtain p-values for each feature parameter in a multivariate linear regression model fitted with statsmodels.OLS, which helps to identify statistically significant features.
The T-test assesses whether the coefficient of each feature is significantly different from zero, indicating its importance in predicting compressive strength. And the P-value indicates the probability of observing the data if the null hypothesis is true -- p<0.05p < 0.05 is considered statistically significant.

  • Pre-processed? How?
    Yes. As following:

    FeatureTransformation Method
    WaterLog and Standardization
    Coarse AggregateStandardization
    Fine AggregateStandardization

    I utilize Log transformation & Standardization to Water feature, and Standardization to Coarse Aggregate, Fine Aggregate features. And all of their P-value turned into 0.000.

    Standardization: Standardization can help reduce multicollinearity.
    Log transformation: Log transformation can reduce skewness and make distributions more normal, compressing the range of high values, which can reduce the variance.