Why SQL for machine learning?
Most ML tutorials start with import pandas and end with 200 lines of Python. That's fine if you're a data scientist, but what about the analyst who understands the business problem better than anyone and already thinks in SQL?
DataLAB lets you train production-grade ML models using SnapQL commands you can learn quickly.
Step 1: Load your data
Start by loading or connecting your dataset. DataLAB supports CSV, Excel, Parquet, JSON, and direct SQL sources.
In practice, this is often a business analyst or data lead pulling in a customer extract, support snapshot, or billing export before a retention or forecasting discussion.
SELECT * FROM customer_data LIMIT 5;You can also connect directly to SQL Server, PostgreSQL, MySQL, Oracle, or SQLite.
Step 2: Explore
Understand your data before modelling:
This is the point where a team usually wants quick answers before it commits to a full modelling run: how many rows are here, how many unique customers do we have, and how imbalanced is the target?
SELECT
COUNT(*) AS rows,
COUNT(DISTINCT customer_id) AS unique_customers,
AVG(monthly_charges) AS avg_charges,
SUM(CASE WHEN churned = 1 THEN 1 ELSE 0 END) AS churned_count
FROM customer_data;Step 3: Train the model
Here's where the core SnapQL workflow starts:
Think of a churn analyst or growth team lead who already knows the likely drivers and wants a first serious model without leaving the SQL environment they are already using for data prep.
CREATE MODEL churn_predictor
USING RandomForest
ON customer_data
PREDICT churned
FEATURES tenure, monthly_charges, contract_type,
total_charges, num_support_tickets;DataLAB automatically:
- Encodes categorical features like
contract_type - Splits data using the configured training and evaluation defaults
- Trains the model and evaluates performance
- Stores the model with full metadata for reuse
Step 4: Check model status
MODEL STATUS churn_predictor;This gives you the saved model status and the tracked training output for the run. That is useful when a team is comparing whether the first model is good enough to share internally or whether it needs another pass.
If you want to try a stronger algorithm, create another model:
CREATE MODEL churn_v2
USING XGBoost
ON customer_data
PREDICT churned
FEATURES tenure, monthly_charges, contract_type,
total_charges, num_support_tickets;Step 5: Make predictions
Apply your model to new data:
This is where the work becomes operational. A team can take a fresh customer list, score it, and hand the resulting population to retention, sales, or customer-success teams for action.
PREDICT USING MODEL churn_predictor
ON new_customers
AS churn_predictions;The result is a new dataset you can query, export, or feed into a larger workflow.
Step 6: Compare multiple runs
DataLAB's experiment tracking lets you compare models side by side:
This matters most when the work stops being a one-person exercise and becomes a team discussion about tradeoffs between speed, accuracy, explainability, and deployment fit.
CREATE EXPERIMENT churn_comparison;
USE EXPERIMENT churn_comparison;
CREATE MODEL rf_model USING RandomForest ON customer_data PREDICT churned;
CREATE MODEL xgb_model USING XGBoost ON customer_data PREDICT churned;
CREATE MODEL lr_model USING LogisticRegression ON customer_data PREDICT churned;
LIST RUNS FROM churn_comparison LIMIT 10;That gives your team a consistent way to review how different algorithms behaved inside the same experiment context.
Beyond the basics
DataLAB supports a wide model surface, including:
- Classification: Random Forest, XGBoost, SVM, Logistic Regression, KNN, Naive Bayes, and more
- Regression: Linear, Ridge, Lasso, Random Forest, XGBoost, SVR, and related variants
- Clustering: K-Means, DBSCAN, Gaussian Mixture
- Time series and advanced workflows: via the broader SnapQL and pipeline surface
Plus AutoML for automated model selection:
That is useful when a team wants a fast tournament across candidate models without spending the first week manually tuning every option.
AUTOML churn_auto
FROM customer_data
PREDICT churned
MAX_TIME 300;AutoML tries multiple algorithms and parameter combinations, then returns the strongest candidate within your time budget.
Try it yourself
All of the syntax above is aligned to the current SnapQL language reference. Request early access and we can show you how DataLAB fits your analytics or predictive workflow.