1

I want to create JSON array for emr steps. I have created the array for single json string. Here is my bash code -

export source="s3a://sourcebucket"
export destination="s3a://destinationbucket"

EMR_DISTCP_STEPS=$( jq -n \
                  --arg source "$source" \
                  --arg destination "$destination" \
                  '{
                    "Name":"S3DistCp step",
                    "HadoopJarStep": {
                    "Args":["s3-dist-cp","--s3Endpoint=s3.amazonaws.com", "'"--src=${sourcepath}"'" ,"'"--dest=${destinationpath}"'"],
                    "Jar":"command-runner.jar"
                    },
                     "ActionOnFailure":"CONTINUE"
                   }' )

output

echo $EMR_DISTCP_STEPS

[{ "Name": "S3DistCp step", "HadoopJarStep": { "Args": [ "s3-dist-cp", "--s3Endpoint=s3.amazonaws.com", "--src=s3a://sourcebucket", "--dest=s3a://destinationbucket" ], "Jar": "command-runner.jar" }, "ActionOnFailure": "CONTINUE" }]

Now I want to create JSON array with multiple source and destination output

[{ "Name": "S3DistCp step", "HadoopJarStep": { "Args": [ "s3-dist-cp", "--s3Endpoint=s3.amazonaws.com", "--src=s3a://sourcebucket1", "--dest=s3a://destinationbucket1" ], "Jar": "command-runner.jar" }, "ActionOnFailure": "CONTINUE" },
{ "Name": "S3DistCp step", "HadoopJarStep": { "Args": [ "s3-dist-cp", "--s3Endpoint=s3.amazonaws.com", "--src=s3a://sourcebucket2", "--dest=s3a://destinationbucket2" ], "Jar": "command-runner.jar" }, "ActionOnFailure": "CONTINUE" },
{ "Name": "S3DistCp step", "HadoopJarStep": { "Args": [ "s3-dist-cp", "--s3Endpoint=s3.amazonaws.com", "--src=s3a://sourcebucket3", "--dest=s3a://destinationbucket3" ], "Jar": "command-runner.jar" }, "ActionOnFailure": "CONTINUE" }]

How to generate JSON Array with multiple sources and destinations (JSON string) in Bash?

5
  • Are the three items supposed to be different from each other in any way? If so, where is the data used to distinguish them supposed to come from? Commented Sep 23, 2019 at 15:28
  • BTW, note that you don't want to conflate syntactic quotes with literal quotes. That is, in a shell command that contains an argument 'foo', the quote ' is an instruction to your shell, not part of the argument foo, and you want to leave it out of any JSON or other higher-level representation of the data. Commented Sep 23, 2019 at 15:31
  • Consider converting your bash script into python - you will avoid so many pitfalls down the road. Commented Sep 23, 2019 at 15:53
  • @mvp, ...I fully agree that languages that are able to represent the structures you're trying to manipulate in-memory (and with parsing/generation facilities compliant with the format spec in question) are the right tool for manipulating structured data. That said, jq is such a language, just as much as Python is, and also happens to be already tagged in the question. :) Commented Sep 23, 2019 at 15:54
  • @lucy, ...by the way, a note about variable names -- POSIX specifies that all-caps names are reserved for variables that modify behavior of POSIX-defined utilities (and/or the shell itself), whereas lowercase names are reserved for application use and guaranteed not to modify compliant tools' behavior. As non-exported shell variables modify any preexisting environment variable under the same name, this guidance applies to both types; see pubs.opengroup.org/onlinepubs/9699919799/basedefs/…, fourth paragraph. Commented Sep 23, 2019 at 16:04

1 Answer 1

6

One way to do this is to provide a jq function that generates your repeated structure, given the specific inputs you want to modify. Consider the following:

# generate this however you want to -- hardcoded, built by a loop, whatever.
source_dest_pairs=(
  sourcebucket1:destinationbucket1
  sourcebucket2:destinationbucket2
  sourcebucket3:destinationbucket3
)

# -R accepts plain text, not JSON, as input; -n doesn't read any input automatically
# ...but instead lets "inputs" or "input" be used later in your jq code.
jq -Rn '
  def instructionsForPair($source; $dest): {
    "Name":"S3DistCp step",
    "HadoopJarStep": {
      "Args":[
        "s3-dist-cp",
        "--s3Endpoint=s3.amazonaws.com",
        "--src=\($source)",
        "--dest=\($dest)"
      ],
      "Jar":"command-runner.jar"
    }
  };

  [ inputs 
  | capture("^(?<source>[^:]+):(?<dest>.*)$"; "")
  | select(.)
  | instructionsForPair(.source; .dest) ]
' < <(printf '%s\n' "${source_dest_pairs[@]}")

...correctly emits as output:

[
  {
    "Name": "S3DistCp step",
    "HadoopJarStep": {
      "Args": [
        "s3-dist-cp",
        "--s3Endpoint=s3.amazonaws.com",
        "--src=sourcebucket1",
        "--dest=destinationbucket1"
      ],
      "Jar": "command-runner.jar"
    }
  },
  {
    "Name": "S3DistCp step",
    "HadoopJarStep": {
      "Args": [
        "s3-dist-cp",
        "--s3Endpoint=s3.amazonaws.com",
        "--src=sourcebucket2",
        "--dest=destinationbucket2"
      ],
      "Jar": "command-runner.jar"
    }
  },
  {
    "Name": "S3DistCp step",
    "HadoopJarStep": {
      "Args": [
        "s3-dist-cp",
        "--s3Endpoint=s3.amazonaws.com",
        "--src=sourcebucket3",
        "--dest=destinationbucket3"
      ],
      "Jar": "command-runner.jar"
    }
  }
]
Sign up to request clarification or add additional context in comments.

5 Comments

I got this error syntax error near unexpected token <' when running the above script
@DucTran, make sure you run it with bash not sh
I tried changing the order of the command then it worked. Something like: printf ... | jq -Rn ...
@DucTran, even though it isn't part of POSIX sh (and thus can only be used with bash and other shells supporting process substitution), the <(...) syntax is better for the reasons described in BashFAQ #24.
Thank you, I ran the original script with bash successfully.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.