Introduction to Geographical Time Series Prediction with Crime Data in R, SQL, and Tableau

comments

By Jason Wittenauer, Lead Data Scientist at Huron Consulting Group

In this tutorial you will learn how to prepare geographical data for time series predictions.

When reviewing geographical data, it can be difficult to prepare the data for an analysis. There are specific algorithms that can be used, but they can be limited in what they do. If we spend some time splitting up the data into grids and then adding a time point for each grid point, it opens up more possibilities for modeling algorithms and other features that can be added. However, it does create an issue with the size of the data set.

A five year crime data set can easily consist of 250,000 records. Once that is extrapolated into a time series grid of an entire city, it can easily hit 75 million data points. When dealing with data of this size, it is helpful to use a database to cleanse the data before sending it to a modeling script. The steps we will follow are listed below:

Importing data into a SQL Server database.
Cleansing and grouping data into a map grid.
Adding time data points to the set of grid data and filling in the gaps where no crimes occurred.
Importing the data into R
Running an XGBoost model to determine where crimes will occur on a specific day

At the end, we will discuss the next steps for making the predictions more usable to end users in BI tools like Tableau.

Prerequisites

Before beginning this tutorial, you will need:

SQL Server Express installed
SQL Management Studio or similar IDE to interface with SQL Server
R installed
R Studio, Jupyter notebook, or other IDE to interface with R
A general working knowledge of SQL and R

Data Flow Overview

We will be keeping the database and data flow in a simple structure for now. The database itself will be very flat and the final data handoff into the Tableau dashboard will be executed via a text file. In a more production ready version, we would be saving the predictions back into the database and pulling from that database into the reporting dashboard. For this example, we are not triyng to architect everything perfectly, but just understand the basics of doing geographical time series predictions.

Our data flow is below:

Setup the Database

Our prediction model will be using crime data for the Baltimore area from 2012 to 2017. This is located in the "Data" folder within this repo and the filename is "Baltimore Incident Data.zip". Before importing this data you will need to follow one of the options below to setup the database:

Option 1

Restore a SQL database using the backup file located in the "Database Objects\Clean Backup\" folder.

Option 2

Use the scripts located in the "Database Objects" folder under "Tables" and "Procedures" to manually create all of the objects.

Import the Data

Once the database has been successfully created, you can now import the data. This will require you to do the following:

Unzip the "Baltimore Incident Data.zip" file found in the "Data" folder.
Run the "Insert_StagingCrime" procedure and make sure it is pointed to the correct import file and to the correct format file (found in the "Data" folder named "FormatFile.fmt").

EXEC Insert_StagingCrime

This procedure will be truncating the Staging_Crime and inserting data from the file directly into it using a BULK INSERT. The staging table itself has all VARCHAR(MAX) data types which we will convert into better data types in the next phase of the import process. A code snippet of the procedure is below.

BULK INSERT Staging_Crime FROM 'C:\Projects\Crime Prediction\Data\Baltimore Incident Data.csv'WITH (FIRSTROW = 2, FORMATFILE = 'C:\Projects\Crime Prediction\Data\FormatFile.fmt')

Review the Data.

Now that the data has been imported into the staging table, you can view it by running the below code in SQL Management Studio:

SELECT TOP 10 *FROM [dbo].[Staging_Crime]

Giving you the following results.

CrimeDate	CrimeTime	CrimeCode	Address	Description	InsideOutside	Weapon	Post	District	Neighborhood	Location	Premise	TotalIncidents
02/28/2017	23:50:00	6D	2400 KEYWORTH AVE	LARCENY FROM AUTO	O	NULL	533	NORTHERN	Greenspring	(39.3348700000, -76.6590200000)	STREET	1
02/28/2017	23:36:00	4D	200 DIENER PL	AGG. ASSAULT	I	HANDS	843	SOUTHWESTERN	Irvington	(39.2830300000, -76.6878200000)	APT/CONDO	1
02/28/2017	23:02:00	4E	1800 N MOUNT ST	COMMON ASSAULT	I	HANDS	742	WESTERN	Sandtown-Winchester	(39.3092400000, -76.6449800000)	ROW/TOWNHO	1
02/28/2017	23:00:00	6D	200 S CLINTON ST	LARCENY FROM AUTO	O	NULL	231	SOUTHEASTERN	Highlandtown	(39.2894600000, -76.5701900000)	STREET	1
02/28/2017	22:00:00	6E	1300 TOWSON ST	LARCENY	O	NULL	943	SOUTHERN	Locust Point	(39.2707100000, -76.5911800000)	STREET	1
02/28/2017	21:40:00	6J	1000 WILMOT CT	LARCENY	O	NULL	312	EASTERN	Oldtown	(39.2993600000, -76.6034100000)	STREET	1
02/28/2017	21:40:00	6J	2400 PENNSYLVANIA AVE	LARCENY	O	NULL	733	WESTERN	Penn North	(39.3094200000, -76.6417700000)	STREET	1
02/28/2017	21:30:00	5D	1500 STACK ST	BURGLARY	I	NULL	943	SOUTHERN	Riverside	(39.2721500000, -76.6033600000)	OTHER/RESI	1
02/28/2017	21:30:00	6D	2100 KOKO LN	LARCENY FROM AUTO	O	NULL	731	WESTERN	Panway/Braddish Avenue	(39.3117800000, -76.6633200000)	STREET	1
02/28/2017	21:10:00	3CF	800 W LEXINGTON ST	ROBBERY - COMMERCIAL	O	FIREARM	712	WESTERN	Poppleton	(39.2910500000, -76.6310600000)	STREET	1

Cleanse the Data and Create Grids of the Map

Next we can move our data into a "Crime" table that has the correct data types associated with it. During this step we will also split out the location field into longitude and latitude fields. All of the logic to complete these steps can be done by running the below execution statement:

EXEC Insert_Crime

And now we will have data populating the "Crime" table. This will allow us to complete the next step to create a grid on the city and assign each crime to one of the grid squares. You will notice that we create two grids (small and large). This allows us to create features that are at the crime location and a little bit further away from the crime location. Essentially giving us hotspots of crime throughout the city.

Run the below code to create the grids and assign a SmallGridID and LargeGridID to the "Crimes" table:

EXEC Update_CrimeCoordinates

This procedure is doing three separate tasks:

Creating a small grid of squares on the map within the GridSmall table (Procedure: Insert_GridSmall).
Creating a large grid of squares on the map within the GridLarge table (Procedure: Insert_GridLarge).
Assigning all the crime records to a small and large square on the map.

The two procedures that create the grid squares are taking in variables to determine the corners of the map and how many squares we want to have on our map. This is defaulting to a small grid that is 200 by 200 and a large grid that has squares twice as big at 100 by 100.

You can view some of the grid data by running the following:

SELECT TOP 10 c.CrimeId, gs.*FROM [dbo].[Crime] cJOIN GridSmall gs ON gs.GridSmallId = c.GridSmallId

CrimeId	GridSmallId	BotLeftLatitude	TopRightLatitude	BotLeftLongitude	TopRightLongitude
1	31012	39.3341282000001	39.3349965000001	-76.6594308000001	-76.6585152000001
2	19121	39.2828985	39.2837668	-76.68873	-76.6878144
3	25198	39.3089475000001	39.3098158000001	-76.6456968000001	-76.6447812000001
4	20657	39.2889766	39.2898449	-76.5706176000001	-76.5697020000001
5	16212	39.269874	39.2707423	-76.5916764000001	-76.5907608000001
6	22832	39.2985279000001	39.2993962000001	-76.6035792000001	-76.6026636000001
7	25202	39.3089475000001	39.3098158000001	-76.6420344000001	-76.6411188000001
8	16601	39.2716106	39.2724789	-76.6035792000001	-76.6026636000001
9	25781	39.3115524000001	39.3124207000001	-76.6640088	-76.6630932
10	20992	39.2907132	39.2915815	-76.6319628000001	-76.6310472000001

Notice how there are two points being calculated for a square, the top right and bottom left points. We don't have to calculate all four points on the square even though there is technically some curve to a real map when draw longitude and latitude lines. When the scale gets small enough, we can just assume that the squares are mostly straight lines

Create Crime Grid and Lag Features

The last step we need to complete is creating the entire grid of the map for all time periods we want to evaulate. In our case, that is one grid point per day to determine if a crime is going to occur. To complete this step, the below procedure needs to be executed:

EXEC Insert_CrimeGrid

During this step each crime will be grouped together on the map squares so we can determine when and how many crimes occurred on each date in our data set. We will also need to calculate all the filler dates for each square when no crime occurred to make a complete data set.

The procedure take quite awhile to run. On my laptop, it ran for about 1 hour and the subsequent table ("CrimeGrid") contains about 75 million records. The nice part is that the output is now saved into a table, so we will not have to run it in our R script where large data operations might not run as efficiently as inside a database.

Also during this step we will be creating "lag features". These will be columns that tell us how many times within the last day, two days, week, month, etc. that a crime occured on the grid square. This is essentially helping us do a "hotspot" analysis of the data, which can be used to look at other grid squares nearby to see if the crime is localized to our single square or if it is clustered to all nearby squares, similar to aftershocks in an earthquake. These features may or may not be necessary depending on the type of modeling you do.

Prediction Setup

With all of the data cleansing and feature engineering being done on the database side, the code for doing predictions is quite simple. In our example we will be using XGBoost in R to analyze five years of training data to predict future crimes.

To start, we load our libraries.

# Load required librarieslibrary(RODBC)library(xgboost)library(ROCR)library(caret)

Loading required package: gplotsAttaching package: 'gplots'The following object is masked from 'package:stats': lowessLoading required package: latticeLoading required package: ggplot2Registered S3 methods overwritten by 'ggplot2': method from [.quosures rlang c.quosures rlang print.quosures rlang

GridSmallId	GridLargeId	DayOfWeek	MonthOfYear	DayOfYear	Year
3780	990	5	9	262	2013
3781	991	5	9	262	2013
3782	991	5	9	262	2013
3783	992	5	9	262	2013
3784	992	5	9	262	2013
3785	993	5	9	262	2013

Create Feature Set and Model Parameters

We are defining the features as all of the columns that come after the first "target" column in the SQL query. These are passed in as part of the labeling on the train and test data sets.

# Get feature names (all but first column which is the target)feature.names <- names(train)[2:ncol(train)]print(feature.names)# Make train and test matricesdtrain <- xgb.DMatrix(data.matrix(train[,feature.names]), label=train$target)dtest <- xgb.DMatrix(data.matrix(test[,feature.names]), label=test$target)

 [1] "GridSmallId" "GridLargeId" [3] "DayOfWeek" "MonthOfYear" [5] "DayOfYear" "Year" [7] "PriorIncident1Day" "PriorIncident2Days" [9] "PriorIncident3Days" "PriorIncident7Days" [11] "PriorIncident14Days" "PriorIncident30Days" [13] "PriorIncident1Day_Large" "PriorIncident2Days_Large" [15] "PriorIncident3Days_Large" "PriorIncident7Days_Large" [17] "PriorIncident14Days_Large" "PriorIncident30Days_Large"

Next, setup the parameters for model training. We are keeping them pretty basic for now. Make sure that the evaluation metric is AUC so we can try to maximize our True Positive Rate. This will prevent police officers from patrolling areas unnecessarily. There could be additional work to cover all predicted areas with clever patrolling, but we are not going to get into it during this tutorial.

# Training parameterswatchlist <- list(eval = dtest, train = dtrain)param <- list( objective = "binary:logistic", booster = "gbtree", eta = 0.01, max_depth = 10, eval_metric = "auc")

Run the Model

Now that everything is setup, we can run the model and see how well it is predicting. Keep in mind that this is using a pretty basic set of features just to demonstrate one way to deal with geographic time series data. While the data sets can get quite large, they are easy to understand and can run fairly quickly when using cloud based services like AWS.

This particular training data set had 70 million rows and took about 30 minutes to complete 67 rounds of evaluation on my laptop.

# Run modelclf <- xgb.train( params = param, data = dtrain, nrounds = 100, verbose = 2, early_stopping_rounds = 10, watchlist = watchlist, maximize = TRUE)

[14:30:32] WARNING: amalgamation/../src/learner.cc:686: Tree method is automatically selected to be 'approx' for faster speed. To use old behavior (exact greedy algorithm on single machine), set tree_method to 'exact'.[1]eval-auc:0.858208train-auc:0.858555 Multiple eval metrics are present. Will use train_auc for early stopping.Will train until train_auc hasn't improved in 10 rounds.[2]eval-auc:0.858208train-auc:0.858555 [3]eval-auc:0.858208train-auc:0.858555 [4]eval-auc:0.858208train-auc:0.858556 [5]eval-auc:0.858208train-auc:0.858556 [6]eval-auc:0.858208train-auc:0.858556 [7]eval-auc:0.858311train-auc:0.858997 [8]eval-auc:0.858311train-auc:0.858997 [9]eval-auc:0.858311train-auc:0.858997 [10]eval-auc:0.858315train-auc:0.859000 [11]eval-auc:0.858436train-auc:0.859110 [12]eval-auc:0.858436train-auc:0.859110 [13]eval-auc:0.858512train-auc:0.859157 [14]eval-auc:0.858493train-auc:0.859157 [15]eval-auc:0.858496train-auc:0.859160 [16]eval-auc:0.858498train-auc:0.859160 [17]eval-auc:0.858498train-auc:0.859160 [18]eval-auc:0.858342train-auc:0.859851 [19]eval-auc:0.858177train-auc:0.859907 [20]eval-auc:0.858228train-auc:0.859971 [21]eval-auc:0.858231train-auc:0.859971 [22]eval-auc:0.858206train-auc:0.860695 [23]eval-auc:0.858207train-auc:0.860695 [24]eval-auc:0.858731train-auc:0.860894 [25]eval-auc:0.858702train-auc:0.860844 [26]eval-auc:0.858607train-auc:0.860844 [27]eval-auc:0.858574train-auc:0.860842 [28]eval-auc:0.858602train-auc:0.860892 [29]eval-auc:0.858576train-auc:0.860843 [30]eval-auc:0.858574train-auc:0.860841 [31]eval-auc:0.858607train-auc:0.860893 [32]eval-auc:0.858578train-auc:0.860843 [33]eval-auc:0.858611train-auc:0.860894 [34]eval-auc:0.858612train-auc:0.860895 [35]eval-auc:0.858614train-auc:0.860898 [36]eval-auc:0.858615train-auc:0.860899 [37]eval-auc:0.858616train-auc:0.860897 [38]eval-auc:0.858573train-auc:0.860870 [39]eval-auc:0.858546train-auc:0.860822 [40]eval-auc:0.858575train-auc:0.860872 [41]eval-auc:0.858622train-auc:0.860898 [42]eval-auc:0.858578train-auc:0.860875 [43]eval-auc:0.858583train-auc:0.860870 [44]eval-auc:0.859223train-auc:0.861768 [45]eval-auc:0.859220train-auc:0.861760 [46]eval-auc:0.859221train-auc:0.861760 [47]eval-auc:0.859099train-auc:0.861719 [48]eval-auc:0.859112train-auc:0.861735 [49]eval-auc:0.859112train-auc:0.861735 [50]eval-auc:0.859094train-auc:0.861734 [51]eval-auc:0.859125train-auc:0.861785 [52]eval-auc:0.859021train-auc:0.861771 [53]eval-auc:0.859028train-auc:0.861784 [54]eval-auc:0.859029train-auc:0.861781 [55]eval-auc:0.859028train-auc:0.861784 [56]eval-auc:0.859035train-auc:0.861788 [57]eval-auc:0.859037train-auc:0.861789 [58]eval-auc:0.859035train-auc:0.861775 [59]eval-auc:0.859035train-auc:0.861774 [60]eval-auc:0.859010train-auc:0.861738 [61]eval-auc:0.859011train-auc:0.861739 [62]eval-auc:0.859039train-auc:0.861778 [63]eval-auc:0.859016train-auc:0.861739 [64]eval-auc:0.859017train-auc:0.861741 [65]eval-auc:0.859018train-auc:0.861746 [66]eval-auc:0.859019train-auc:0.861747 [67]eval-auc:0.859024train-auc:0.861755 Stopping. Best iteration:[57]eval-auc:0.859037train-auc:0.861789

The model found the patterns quickly and didn't get too much improvement over each round. This could be enhanced with better model parameters and more features.

Review Importance Matrix

Time to analyze our features to see which ones are rating highly in the model by looking at the importance matrix. This can help us determine whether the new features that we add are really worth it on a large data set (larger data set = extra processing time per feature added).

# Compute feature importance matriximportance_matrix <- xgb.importance(feature.names, model = clf)# Graph important featuresxgb.plot.importance(importance_matrix[1:10,])

It looks like the model it using a combination of small and large grid features to determine if an incident will occur on the current day. It is interesting that the longer term features seem to be more important, indicating a history of criminal activity in the area and/or surrounding areas.

Predict on Test and Check ROC Curve

The last step is to check our predictions against the test data set and see how well they did. We always hope that the ROC curve will spike up really high and really quick, but that is not always the case. In our example, we have a decent score with only basic features included in the model. This definitely shows us that we can do the predictions and more time should be invested to make them better.

# Predict on test datapreds <- predict(clf, dtest)# Graph AUC curvexgb.pred <- prediction(preds, test$target)xgb.perf <- performance(xgb.pred, "tpr", "fpr")plot(xgb.perf, avg="threshold", colorize=TRUE, lwd=1, main="ROC Curve w/ Thresholds", print.cutoffs.at=seq(0, 1, by=0.05), text.adj=c(-0.5, 0.5), text.cex=0.5)grid(col="lightgray")axis(1, at=seq(0, 1, by=0.1))axis(2, at=seq(0, 1, by=0.1))abline(v=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")abline(h=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")lines(x=c(0, 1), y=c(0, 1), col="black", lty="dotted")

Review the Confusion Matrix

We know there is a decent AUC score but let's look at the actual output of what we predicted vs. what actually happened. The easiest way to do this is to review the confusion matrix. We want the top left and bottom right boxes (good predictions) to be big and the others (bad predictions) to be small.

# Set our cutoff thresholdpreds.resp <- ifelse(preds >= 0.5, 1, 0)# Create the confusion matrixconfusionMatrix(as.factor(preds.resp), as.factor(test$target), positive = "1")

Confusion Matrix and Statistics ReferencePrediction 0 1 0 280581 454 1 62 367 Accuracy : 0.9982 95% CI : (0.998, 0.9983) No Information Rate : 0.9971 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.5864 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.447016 Specificity : 0.999779 Pos Pred Value : 0.855478 Neg Pred Value : 0.998385 Prevalence : 0.002917 Detection Rate : 0.001304 Detection Prevalence : 0.001524 Balanced Accuracy : 0.723397 'Positive' Class : 1

Over the course of 7 days, it looks like there were 821 incidents and our model correctly predicted 367 of them. The other key point is that we incorrectly predicted 62 incidents, essentially sending police officers to areas where we thought a crime could occur but did not. At face value, this doesn't seem too bad but we would need to see how this affects response times to criminal activity vs. what the current response times are at right now. The idea being that police officers are in areas close enough to a crime that they can either prevent it with their presence or respond to it very quickly to prevent as much harm as possible.

Prepare Data for Reporting

Next we can add in the longitude and latitude coordinates for the test data set and export it to a CSV file. This will allow us to take a look at the actual predictions on a map in a BI tool like Tableau.

# Read in the grid coordinatesgridsmall <- sqlQuery(dbhandle, 'select * from gridsmall')# Merge the predictions with the test dataresults <- cbind(test, preds)# Merge the grid coordinates with the test dataresults <- merge(results, gridsmall, by="GridSmallId")head(results)# Save to filewrite.csv(results,"Data\\CrimePredictions.csv", row.names = TRUE)

GridSmallId	GridLargeId.x	DayOfWeek	MonthOfYear	DayOfYear	Year	...	preds	BotLeftLatitude	TopRightLatitude	BotLeftLongitude	TopRightLongitude	GridLargeId.y
1	1	3	2	52	2017	...	0.2822689	39.20041	39.20128	-76.71162	-76.7107	1
1	1	5	2	54	2017	...	0.2822689	39.20041	39.20128	-76.71162	-76.7107	1
1	1	7	2	56	2017	...	0.2822689	39.20041	39.20128	-76.71162	-76.7107	1
1	1	4	2	53	2017	...	0.2822689	39.20041	39.20128	-76.71162	-76.7107	1
1	1	2	2	58	2017	...	0.2822689	39.20041	39.20128	-76.71162	-76.7107	1
1	1	1	2	57	2017	...	0.2822689	39.20041	39.20128	-76.71162	-76.7107	1

Visualizing the Predictions

It is time to see what these predictions actually look like over the course of 7 days. The Tableau dashboard itself can be found here:

Crime Prediction Dashboard

While looking at the predictions that were incorrect (red dots), it looks like they are pretty close to predictions that were correct. This would put police officers in the general vicinity of a crime occurring. So maybe the incorrect predictions are not affecting the overall results very much at all. Example screenshots are shown below with the correct predictions (blue) and incorrect predictions (red).

Incorrect Predictions

Incorrect and Correct Predictions

Overall it does not seem too bad, but we will need more features and/or more data to capture all those missing predictions. Also, there is probably a lot more we can do to focus on specific types of crimes that are occuring and key in on specific prediction modeling to handle each type.

What Next?

This tutorial got us started with doing geographical time series predictions using crime data. We can see that the predictions are definitely working, but there is more work to be done with creating features. We might want to add in some other features that check a larger area for prior crime occurences. Another useful step would be to change our predictions to run hourly and map squad car patrol routes by time of day. Even if the predictions are not perfect, as long as you are putting a police officer in the general vicinity of a crime and it is better than current patrol methodologies, then they can respond much faster or even prevent the crime from occurring in the first place with just their presence.

Other ideas to think about:

Remove crime data in the vicinity of police stations, fire stations, hospitals, etc. since those could be biased against people submitting reports.
Add in demographic features related to census information.
Map crimes to neighborhoods instead of a square grid to predict against.

Hope you enjoyed the tutorial!

Bio: Jason Wittenauer is a data scientist focused on improving hospital earnings with a background in R, Python, Microsoft SQL Server, Tableau, TIBCO Spotfire, and .NET web programming. His areas of focus in healthcare are business operations, revenue improvement, and expense reduction.

Original. Reposted with permission.

Related:

Stock Market Forecasting Using Time Series Analysis
What you need to know: The Modern Open-Source Data Science/Machine Learning Ecosystem
Justice Can’t Be Colorblind: How to Fight Bias with Predictive Policing

Introduction to Geographical Time Series Prediction with Crime Data in R, SQL, and Tableau - KDnuggets (2024)