AWS Machine Learning Blog

Plan the locations of green car charging stations with an Amazon SageMaker built-in algorithm

While the fuel economy of new gasoline or diesel-powered vehicles improves every year, green vehicles are considered even more environmentally friendly because they’re powered by alternative fuel or electricity. Hybrid electric vehicles (HEVs), battery only electric vehicles (BEVs), fuel cell electric vehicles (FCEVs), hydrogen cars, and solar cars are all considered types of green vehicles.

Charging stations for green vehicles are similar to the gas pump in a gas station. They can be fixed on the ground or wall and installed in public buildings (shopping malls, public parking lots, and so on), residential district parking lots, or charging stations. They can be based on different voltage levels and charge various types of electric vehicles.

As a charging station vendor, you should consider many factors when building a charging station. The location of charging stations is a complicated problem. Customer convenience, urban setting, and other infrastructure needs are all important considerations.

In this post, we use machine learning (ML) with Amazon SageMaker and Amazon Location Service to provide guidance for charging station vendors looking to choose optimal charging station locations.

Solution overview

In this solution, we focus use SageMaker training jobs to train the cluster model and a SageMaker endpoint to deploy the model. We use an Amazon Location Service display map and cluster result.

We also use Amazon Simple Storage Service (Amazon S3) to store the training data and model artifacts.

The following figure illustrates the architecture of the solution.

Data preparation

GPS data is highly sensitive information because it can be used to track historical movement of an individual. In the following post, we use the tool trip-simulator to generate GPS data that simulates a taxi driver’s driving behavior.

We choose Nashville, Tennessee, as our location. The following script simulates 1,000 agents and generates 14 hours of driving data starting September 15, 2020, 8:00 AM:

trip-simulator \
  --config scooter \
  --pbf nash.osm.pbf \
  --graph nash.osrm \
  --agents 1000 \
  --start 1600128000000 \
  --seconds 50400 \
  --traces ./traces.json \
  --probes ./probes.json \
  --changes ./changes.json \
  --trips ./trips.json

The preceding script generates three output files. We use changes.json. It includes car driving GPS data as well as pickup and drop off information. The file format looks like the following:

{
	"vehicle_id":"PLC-4375",
	"event_time":1600128001000,
	"event_type":"available",
	"event_type_reason":"service_start",
	"event_location":{
					"type":"Feature",
					"properties":{

								},
					"geometry":{
					"type":"Point",
					"coordinates":
								[
								-86.7967066040155,
								36.17115028383999
								]
								}
					}
}

The field event_reason has four main values:

  • service_start – The driver receives a ride request, and drives to the designated location
  • user_pick_up – The driver picks up a passenger
  • user_drop_off – The driver reaches the destination and drops off the passenger
  • maintenance – The driver is not in service mode and doesn’t receive the request

In this post, we only collect the location data with the status user_pick_up and user_drop_off as the algorithm’s input. In real-life situations, you should also consider features such as the passenger’s information and business district information.

Pandas is an extended library of the Python language for data analysis. The following script converts the data from JSON format to CSV format via Pandas:

df=pd.read_json('./data/changes.json', lines=True)
df_event=df.event_location.apply(pd.Series)
df_geo=df_event.geometry.apply(pd.Series)
df_coord=df_geo.coordinates.apply(pd.Series)
result = pd.concat([df, df_coord], axis=1)
result = result.drop("event_location",axis = 1)
result.columns=["vehicle_id","event_time","event_type","event_reason","longitude","latitude"]
result.to_csv('./data/result.csv',index=False,sep=',')

The following table shows our results.

There is noise data in the original GPS data. This includes some pickup and drop-off coordinate points being marked in the lake. The generated GPS data follows uniform distribution without considering business districts, no-stop areas, and depopulated zones. In practice, there is no standard process for data preprocessing. You can simplify the process of data preprocessing and feature engineering with Amazon SageMaker Data Wrangler.

Data exploration

To better to observe and analyze the simulated track data, we use Amazon Location for data visualization. Amazon Location provides frontend SDKs for Android, iOS, and the web. For more information about Amazon Location, see the Developer Guide.

We start by creating a map on the Amazon Location console.

We use the MapLibre GL JS SDK for our map display. The following script displays a map of Nashville, Tennessee, and renders a specific car’s driving route (or trace) line:

async function initializeMap() {
// load credentials and set them up to refresh
await credentials.getPromise();

// Initialize the map
map = new maplibregl.Map({
container: "map",
center:[-86.792845,36.16378],// initial map centerpoint
zoom: 10, // initial map zoom
style: mapName,
transformRequest,
});
});

map.addSource('route', {
'type': 'geojson',
'data': {
'type': 'Feature',
'properties': {},
'geometry': {
'type': 'LineString',
'coordinates': [
				[-86.85009051679292,36.144774042081494],
				[-86.85001827659116,36.14473133061205],
				[-86.85004741661184,36.1446756197635],
				[-86.85007975396945,36.14465452846737],
				[-86.85005249508677,36.14469518290888]
				......
				]
			}
		}
						}
			);

The following graph displays a taxi’s 14-hour driving route.

The following script displays the car’s route distribution:

map.addSource('car-location', {
'type': 'geojson',
'data': {
'type': 'FeatureCollection',
'features': [
{'type': 'Feature','geometry': {'type': 'Point','coordinates': [-86.79417828985571,36.1742558685242]}},
{'type': 'Feature','geometry': {'type': 'Point','coordinates': [-86.76932509874324,36.18006513143749]}},
......
{'type': 'Feature','geometry': {'type': 'Point','coordinates': [-86.84082991448976,36.14558741886923]}}

]
}
});

The following map visualization shows our results.

Algorithm selection

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.

SageMaker uses a modified version of the web-scale k-means clustering algorithm. Compared to the original version of the algorithm, the version SageMaker uses is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To do this, it streams mini-batches (small, random subsets) of the training data.

The k-means algorithm expects tabular data. In this solution, the GPS coordinate data (longitude, latitude) is the input training data. See the following code:

df = pd.read_csv('./data/result.csv', sep=',',header=0,usecols=['longitude','latitude'])

#routine that converts the training data into protobuf format required for Sagemaker K-means.
def write_to_s3(bucket, prefix, channel, file_prefix, X):
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, X.astype('float32'))
buf.seek(0)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, channel, file_prefix + '.data')).upload_fileobj(buf)

#prepare training training and save to S3.
def prepare_train_data(bucket, prefix, file_prefix, save_to_s3=True):
train_data = df.as_matrix()
if save_to_s3:
write_to_s3(bucket, prefix, 'train', file_prefix, train_data)
return train_data

# using the dataset
train_data = prepare_train_data(bucket, prefix, 'train', save_to_s3=True)

# SageMaker k-means ECR images ARNs
images = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/kmeans:latest',
'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:latest',
'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/kmeans:latest',
'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/kmeans:latest'}

image = images[boto3.Session().region_name]

Train the model

Before you train your model, consider the following:

  • Data format – Both protobuf recordIO and CSV formats are supported for training. In this solution, we use protobuf format and File mode as the training data input.
  • EC2 instance selection – AWS suggests using an Amazon Elastic Compute Cloud (Amazon EC2) CPU instance when selecting the k-means algorithm. We use two ml.c5.2xlarge instances for training.
  • Hyperparameters – Hyperparameters are closely related to the dataset; you can adjust them according to the actual situation to get the best results:
    • k – The number of required clusters (k). Because we don’t know the number of clusters in advance, we train many models with different values (k).
    • init_method – The method by which the algorithm chooses the initial cluster centers. A valid value is random or kmeans++.
    • epochs – The number of passes done over the training data. We set this to 10.
    • mini_batch_size – The number of observations per mini-batch for the data iterator. We tried 50, 100, 200, 500, 800, and 1,000 in our dataset.

We train our model with the following code. To get results faster, we start up SageMaker training job concurrently, each training jobs includes two instances. The range of k is between 3 and 16, and each training job will generate a model, the model artifacts are saved in S3 bucket.

K = range(3,16,1) #Select different k, k increased by 1 until 15
INSTANCE_COUNT = 2 #use two CPU instances
run_parallel_jobs = True #make this false to run jobs one at a time, especially if you do not want 
#create too many EC2 instances at once to avoid hitting into limits.
job_names = []

# launching jobs for all k
for k in K:
    print('starting train job:' + str(k))
    output_location = 's3://{}/kmeans_example/output/'.format(bucket) + output_folder
    print('training artifacts will be uploaded to: {}'.format(output_location))
    job_name = output_folder + str(k)

    create_training_params = \
    {
        "AlgorithmSpecification": {
            "TrainingImage": image,
            "TrainingInputMode": "File"
        },
        "RoleArn": role,
        "OutputDataConfig": {
            "S3OutputPath": output_location
        },
        "ResourceConfig": {
            "InstanceCount": INSTANCE_COUNT,
            "InstanceType": "ml.c4.xlarge",
            "VolumeSizeInGB": 20
        },
        "TrainingJobName": job_name,
        "HyperParameters": {
            "k": str(k),
            "feature_dim": "2",
          	"epochs": "100",
            "init_method": "kmeans++",
            "mini_batch_size": "800"
        },
        "StoppingCondition": {
            "MaxRuntimeInSeconds": 60 * 60
        },
            "InputDataConfig": [
            {
                "ChannelName": "train",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },

                "CompressionType": "None",
                "RecordWrapperType": "None"
            }
        ]
    }

    sagemaker = boto3.client('sagemaker')

    sagemaker.create_training_job(**create_training_params)

Evaluate the model

The number of clusters (k) is the most important hyperparameter in k-means clustering. Because we don’t know the value of k, we can use various methods to find the optimal value of k. In this section, we discuss two methods.

Elbow method

The elbow method is an empirical method to find the optimal number of clusters for a dataset. In this method, we select a range of candidate values of k, then apply k-means clustering using each of the values of k. We find the average distance of each point in a cluster to its centroid, and represent it in a plot. We select the value of k where the average distance falls suddenly. See the following code:

plt.plot()
models = {}
distortions = []
for k in K:
s3_client = boto3.client('s3')
key = 'kmeans_example/output/' + output_folder +'/' + output_folder + str(k) + '/output/model.tar.gz'
s3_client.download_file(bucket, key, 'model.tar.gz')
print("Model for k={} ({})".format(k, key))
!tar -xvf model.tar.gz
kmeans_model=mx.ndarray.load('model_algo-1')
kmeans_numpy = kmeans_model[0].asnumpy()
print(kmeans_numpy)
distortions.append(sum(np.min(cdist(train_data, kmeans_numpy, 'euclidean'), axis=1)) / train_data.shape[0])
models[k] = kmeans_numpy

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('distortion')
plt.title('Elbow graph')
plt.show()

We select a k range from 3–15 and train the model with a built-in k-means clustering algorithm. When the model is fit with 10 clusters, we can see an elbow shape in the graph. This is an optimal cluster number.

Silhouette method

The silhouette method is another method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters by providing a succinct graphical representation of how well each object has been classified.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The value of the silhouette ranges between [1, -1], where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

First, we must deploy the model and predict the y value as silhouette input:

import json
runtime = boto3.Session().client('runtime.sagemaker')
endpointName="kmeans-30-2021-08-06-00-48-38-963"
response = runtime.invoke_endpoint(EndpointName=endpointName,
ContentType='text/csv',
Body=b"-86.77971153,36.16336978\n-86.77971153,36.16336978")
r=response['Body'].read()
response_json = json.loads(r)
y_km=[]
for item in response_json['predictions']:
y_km.append(int(item['closest_cluster']))

Next, we call the silhouette:

import numpy as np
from matplotlib import cm
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score,silhouette_samples

cluster_labels=np.unique(y_km)
print(cluster_labels)
n_clusters=cluster_labels.shape[0]
silhouette_score_cluster_10=silhouette_score(X, y_km)
print("Silhouette Score When Cluster Number Set to 10: %.3f" % silhouette_score_cluster_10)
silhouette_vals=silhouette_samples(X,y_km,metric='euclidean')
y_ax_lower,y_ax_upper=0,0
yticks=[]
for i,c in enumerate(cluster_labels):
c_silhouette_vals=silhouette_vals[y_km==c]
c_silhouette_vals.sort()
y_ax_upper+=len(c_silhouette_vals)
color=cm.jet(float(i)/n_clusters)
plt.barh(range(y_ax_lower,y_ax_upper),
c_silhouette_vals,
height=1.0,
edgecolor='none',
color=color)
yticks.append((y_ax_lower+y_ax_upper)/2.0)
y_ax_lower+=len(c_silhouette_vals)

silhouette_avg=np.mean(silhouette_vals)
plt.axvline(silhouette_avg,
color='red',
linestyle='--')
plt.yticks(yticks,cluster_labels+1)
plt.ylabel("Cluster")
plt.xlabel("Silhouette Coefficients k=10,Score=%.3f" % silhouette_score_cluster_10)
plt.savefig('./figure.png')
plt.show()

When the silhouette score is closer to 1, it means clusters are well apart from each other. In the following experiment result, when k is set to 8, each cluster is well apart from each other.

We can use different model evaluation methods to get different values for the best k. In our experiment, we choose k=10 as optimal clusters.

Now we can display the k-means clustering result via Amazon Location. The following code marks selected locations on the map:

new maplibregl.Marker().setLngLat([-86.755974, 36.19235]).addTo(map);
new maplibregl.Marker().setLngLat([-86.710972, 36.203389]).addTo(map);
new maplibregl.Marker().setLngLat([-86.733895, 36.150209]).addTo(map);
new maplibregl.Marker().setLngLat([-86.795974, 36.165639]).addTo(map);
new maplibregl.Marker().setLngLat([-86.786743, 36.222799]).addTo(map);
new maplibregl.Marker().setLngLat([-86.701209, 36.267679]).addTo(map);
new maplibregl.Marker().setLngLat([-86.820134, 36.209863]).addTo(map);
new maplibregl.Marker().setLngLat([-86.769743, 36.131246]).addTo(map);
new maplibregl.Marker().setLngLat([-86.803346, 36.142358]).addTo(map);
new maplibregl.Marker().setLngLat([-86.833890, 36.113466]).addTo(map);

The following map visualization shows our results, with 10 clusters.

We also need to consider the scale of the charging station. Here, we divide the number of points around the center of each cluster by a coefficient (for example, the coefficient value is 100, which means every 100 cars share a charger pile). The following visualization includes charging station scale.

Conclusion

In this post, we explained an end-to-end scenario for creating a clustering model in SageMaker based on simulated driving data. The solution includes training an MXNet model and creating an endpoint for real-time model hosting. We also explained how you can display the clustering results via the Amazon Location SDK.

You should also consider charging type and quantity. Plug-in charging is categorized by voltage and power levels, leading to different charging times. Slow charging usually takes several hours to charge, whereas fast charging can achieve a 50% charge in 10–15 minutes. We cover these factors in a later post.

Many other industries are also affected by location planning problems, including retail stores and warehouses. If you have feedback about this post, submit comments in the Comments section below.


About the Author

Zhang Zheng is a Sr. Partner Solutions Architect with AWS, helping industry partners on their journey to well-architected machine learning solutions at scale.