Executing the PySpark Job
To run the PySpark job you must use the spark-submit script available in the PySpark’s driver directory in the distribution.
Jar File:
The Java jar file in the distribution should be provided in the ‘–-jar’ option.
/precisely/addressing/software/pyspark/driver/spectrum-bigdata-addressing-sdk-spark2_2.12-sdk_version.jarPython Files:
The Python SDK ZIP file in the distribution should be provided in the ‘–-py-files’ option.
/precisely/addressing/software/pyspark/driver/spectrum-bigdata-addressing-sdk-pyspark-sdk_version.zipPython Driver File:
The Python driver file containing the main function should be provided in the spark-submit command.
/precisely/addressing/software/pyspark/driver/spark-submit/AddressingDriver.py
For example:
spark-submit
--py-files /precisely/addressing/software/pyspark/driver/spectrum-bigdata-addressing-sdk-pyspark-sdk_version.zip
--master yarn --deploy-mode cluster -–jars
/precisely/addressing/software/pyspark/driver/spectrum-bigdata-addressing-sdk-spark2_2.12-sdk_version.jar
/precisely/addressing/software/pyspark/driver/spark-submit/AddressingDriver.py
--operation geocode
--resources-location hdfs:///precisely/addressing/software/resources/
--data-location hdfs:///precisely/geo_addr/data/
--download-location /precisely/downloads
--preferences-filepath
hdfs:///precisely/addressing/software/resources/config/preferences.yaml
--input /user/sdkuser/customers/addresses.csv
--input-format=csv
--csv header=false
--output /user/sdkuser/customers_addresses
--output-format=parquet
--parquet compression=gzip
--input-fields addressLines[0]=0 addressLines[1]=1
--output-fields address.formattedStreetAddress address.formattedLocationAddress location.feature.geometry.coordinates.x
location.feature.geometry.coordinates.y
--combine
--limit 20
Job Parameters
All parameters are declared with a double dash. The required fields are bolded.
| Parameter | Example |
|---|---|
--inputThe location to the input file. |
--input /user/sdkuser/customers/addresses.csv |
--outputThe location of the directory for the
output, which will include all input columns along with the fields requested in
the |
--output /user/sdkuser/customers_geocoded |
--output-fieldsThe requested fields to be included in the output. Multiple output field expressions should be separated by a space and each individual expression should be surrounded by double quotes. For more information, see Output Fields. |
--output-fields "location.feature.geometry.coordinates.x as x"
"location.feature.geometry.coordinates.y as y" |
--error-fieldAdd the error field to your output to see any error information. |
--error-field error |
|
Add the Json Output field to your output to see the Json Response. |
--json-output-field jsonOutput |
--resources-locationLocation of the resources directory which contains the configurations and libraries. If using a remote
path, e.g. HDFS or S3, then set |
--resources-location
hdfs:///precisely/addressing/software/resources/ |
--data-locationFile path(s) to one or more geocoding datasets. A path may be a single dataset (extracted or an unextracted SPD), or a directory of datasets. Multiple paths must be separated with a space. If
using a remote path, e.g. HDFS or S3, then you must set
|
--data-location hdfs:///precisely/geo_addr/data/ |
--operationThe operation to be performed. One of the following:
|
--operation verify |
--preferences-filepathFile path of the addressing preferences file. This optional file can be edited by advanced users to change the behavior of the geocoder. If using a remote path, e.g. HDFS or S3, then set
|
--preferences-filepath hdfs:///precisely/addressing/
software/resources/config/preferences.yaml |
--input-fieldsInput fields as address field mappings, using mixed or camelCase form. For more information, see Input Fields. |
|
--download-locationLocation of the directory where reference data will be downloaded to. This path must exist on
every data node. Note: This parameter is required if the reference data is
distributed remotely via HDFS or S3. |
--download-location /precisely/downloads |
--download-group
This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group. For more information, see Download Permissions. Note: Use only if reference data is distributed remotely via HDFS or
S3. |
--download-group dm_users |
--extraction-locationFile path to where the geocoding datasets will be extracted. If not specified, the default location is the same directory as the SPD. |
--extraction-location
/precisely/geo_addr/data/extractionDirectory |
--countryIf your input data does not have country
information then you can specify the country as a parameter. Alternatively, you
can use a column reference in |
--country USA
|
--overwriteIncluding this parameter will tell the job to overwrite the output directory. Otherwise the job will fail if this directory already has content. This parameter does not have a value. |
--overwrite
|
--num-partitionsThe minimum number of partitions used to split up the input file. |
--num-partitions=15 |
--combineIncluding this parameter will tell the job to combine all output files into a single output file.
Otherwise the job will create multiple output files and the number of output files
will depend on number of partitions specified by user. Note: Using this parameter
may increase your job's execution time since the entire output dataset must be
collected on a single node. As the size of the data to be combined grows,
especially past the size of the space available on a single node, there is a
chance of getting errors. |
--combine |
--input-formatThe input format. Valid values: csv or parquet. If not specified, the default is csv. |
--input-format=parquet |
--output-formatThe output format. Valid values: csv or
parquet. If not specified, the default is the |
--output-format=csv |
--csvSpecify the options to be used when reading and writing CSV input and output files. Common options and their default values:
|
|
--parquetSpecify the options to be used when reading and writing parquet input and output files. |
--parquet compression=gzip |
--limitThe maximum number of records to be processed in the job. |
--limit 5000 |