Executing the PySpark Job

To run the PySpark job you must use the spark-submit script available in the PySpark’s driver directory in the distribution.

Jar File:

The Java jar file in the distribution should be provided in the ‘–-jar’ option.

/precisely/addressing/software/pyspark/driver/spectrum-bigdata-addressing-sdk-spark2_2.12-sdk_version.jar

Python Files:

The Python SDK ZIP file in the distribution should be provided in the ‘–-py-files’ option.

/precisely/addressing/software/pyspark/driver/spectrum-bigdata-addressing-sdk-pyspark-sdk_version.zip

Python Driver File:

The Python driver file containing the main function should be provided in the spark-submit command.

/precisely/addressing/software/pyspark/driver/spark-submit/AddressingDriver.py

For example:

spark-submit
--py-files /precisely/addressing/software/pyspark/driver/spectrum-bigdata-addressing-sdk-pyspark-sdk_version.zip
--master yarn --deploy-mode cluster -–jars 
/precisely/addressing/software/pyspark/driver/spectrum-bigdata-addressing-sdk-spark2_2.12-sdk_version.jar
/precisely/addressing/software/pyspark/driver/spark-submit/AddressingDriver.py
--operation geocode
--resources-location hdfs:///precisely/addressing/software/resources/
--data-location hdfs:///precisely/geo_addr/data/
--download-location /precisely/downloads
--preferences-filepath
hdfs:///precisely/addressing/software/resources/config/preferences.yaml
--input /user/sdkuser/customers/addresses.csv
--input-format=csv
--csv header=false
--output /user/sdkuser/customers_addresses
--output-format=parquet
--parquet compression=gzip
--input-fields addressLines[0]=0 addressLines[1]=1
--output-fields address.formattedStreetAddress address.formattedLocationAddress location.feature.geometry.coordinates.x
 location.feature.geometry.coordinates.y 
--combine 
--limit 20

Job Parameters

All parameters are declared with a double dash. The required fields are bolded.


Parameter	Example
`--input` The location to the input file.	`--input /user/sdkuser/customers/addresses.csv`
`--output` The location of the directory for the output, which will include all input columns along with the fields requested in the `output-fields` parameter as well as any errors.	`--output /user/sdkuser/customers_geocoded`
`--output-fields` The requested fields to be included in the output. Multiple output field expressions should be separated by a space and each individual expression should be surrounded by double quotes. For more information, see Output Fields.	`--output-fields "location.feature.geometry.coordinates.x as x" "location.feature.geometry.coordinates.y as y"`
`--error-field` Add the error field to your output to see any error information.	`--error-field error`
`--json-output-field` Add the Json Output field to your output to see the Json Response.	`--json-output-field jsonOutput`
`--resources-location` Location of the resources directory which contains the configurations and libraries. If using a remote path, e.g. HDFS or S3, then set `--download-location`. Local paths must be present on all nodes that tasks will run on.	`--resources-location hdfs:///precisely/addressing/software/resources/`
`--data-location` File path(s) to one or more geocoding datasets. A path may be a single dataset (extracted or an unextracted SPD), or a directory of datasets. Multiple paths must be separated with a space. If using a remote path, e.g. HDFS or S3, then you must set `--download-location`. Local paths must be present on all nodes that tasks will run on.	`--data-location hdfs:///precisely/geo_addr/data/`
`--operation` The operation to be performed. One of the following: `verify` `geocode` `reverseGeocode` `lookup`	`--operation verify`
`--preferences-filepath` File path of the addressing preferences file. This optional file can be edited by advanced users to change the behavior of the geocoder. If using a remote path, e.g. HDFS or S3, then set `--download-location`. Local paths must be present on all nodes that tasks will run on.	`--preferences-filepath hdfs:///precisely/addressing/ software/resources/config/preferences.yaml`
`--input-fields` Input fields as address field mappings, using mixed or camelCase form. For more information, see Input Fields.	Specifying individual address fields by input column index: `--input-fields street=0 city=1 admin1=2 postalCode=3` Using column names from the input CSV file (requires a header in the CSV file and setting `--csv header=true`): `--input-fields street=street city=city admin1=state postalCode=zip` Specifying input as a single line, where multiple input fields are concatenated into one address field: `--input-fields addressLines[0]=0,1,2,3`
`--download-location` Location of the directory where reference data will be downloaded to. This path must exist on every data node. Note: This parameter is required if the reference data is distributed remotely via HDFS or S3.	`--download-location /precisely/downloads`
`--download-group` This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group. For more information, see Download Permissions. Note: Use only if reference data is distributed remotely via HDFS or S3.	`--download-group dm_users`
`--extraction-location` File path to where the geocoding datasets will be extracted. If not specified, the default location is the same directory as the SPD.	`--extraction-location /precisely/geo_addr/data/extractionDirectory`
`--country` If your input data does not have country information then you can specify the country as a parameter. Alternatively, you can use a column reference in `--input-fields` (for example: `--input-fields country=2`)	`--country USA`
`--overwrite` Including this parameter will tell the job to overwrite the output directory. Otherwise the job will fail if this directory already has content. This parameter does not have a value.	`--overwrite`
`--num-partitions` The minimum number of partitions used to split up the input file.	`--num-partitions=15`
`--combine` Including this parameter will tell the job to combine all output files into a single output file. Otherwise the job will create multiple output files and the number of output files will depend on number of partitions specified by user. Note: Using this parameter may increase your job's execution time since the entire output dataset must be collected on a single node. As the size of the data to be combined grows, especially past the size of the space available on a single node, there is a chance of getting errors.	`--combine`
`--input-format` The input format. Valid values: csv or parquet. If not specified, the default is csv.	`--input-format=parquet`
`--output-format` The output format. Valid values: csv or parquet. If not specified, the default is the `input-format` value.	`--output-format=csv`
`--csv` Specify the options to be used when reading and writing CSV input and output files. Common options and their default values: `delimiter:,` `quote:"` `escape:\` `header: false`	Specify individual options: `--csv header=true` `--csv delimiter='\t'` Specify multiple options: `--csv header=true delimiter='\t'`
`--parquet` Specify the options to be used when reading and writing parquet input and output files.	`--parquet compression=gzip`
`--limit` The maximum number of records to be processed in the job.	`--limit 5000`