I worked with a customer recently who had a need to share CSVs for an auditing situation. They had a lot of external customers that they needed to collect CSVs from for the audit process. There were a lot of discussions happening on how to best do it, whether we’d pull data from their environment or have them push them into theirs. Folks weren’t sure on that so I tried to come up with something that would work for both.
What I came up with is something that will grab CSVs in an Azure storage account, no matter where it lives. It then is pulled into their data lake in a consolidated ADLS Gen 2 location in a parquet format so they can do all the fun big data things on it in Azure Synapse in an aggregated data lake. It setups up an Azure Key Vault where you can store the connection string for the Azure storage account that contains the CSV files. NOTE – YOU NEED ELEVATED ACCESS TO STORAGE ACCOUNTS. You may need approval or separate storage accounts both on the sending side with the CSVs and the receiving end in your Azure subscription. GET PROPER APPROVAL/BLESSINGS before setting this up.
If you want to setup the full solution including the components in Azure. I’ve published the code to build out everything at https://aka.ms/DataSharePipeline. All you need to do is store the files located in the DataShare Full Sample folder locally. You’ll then run some or all the PowerShell scripts below.
- Update the paramfile03.json file with the values of what you want to build or use. You can use existing pieces if you’d like, the script checks if it exists before creating all the pieces.
- Load the CSVs to a storage account that you can get the connection string with the storage keys. Note the connection string as you’ll get prompted for it during the run.
- Run ./”01 – Create Resources DataShare.ps1″ -filepath ./paramfile03.json to create all the Azure pieces.
- Run ./”02 – GrantRightsDataShareCLI.ps1″ -filepath ./paramfile03.json to grant rights needed to admin and MSI for Synapse to storage and Azure Key Vault.
- Run ./”03 – Create Pipeline Parts DataShare.ps1″ -filepath ./paramfile03.json to build the Synapse pipeline components.
- Run the pipeline with the parameters filled in below and verify your CSVs have been processed into your data lake as parquet files. This pattern allows you to setup separate key vault secrets to pull CSV files from as many storage accounts as you need. There are a couple sample CSVs within the folder as well if you’d like to use them. They contain data for hospital capacity that were extracted from the site https://healthdata.gov/dataset/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/4cnb-m4rz.
|Parameter Name||Value details|
|ParamSecret||Name of the secret containing connection string|
|Container||Name of container holding CSV files|
|FilePrefix||All CSVs that begin with this text will be processed|
|DirectoryPath||directory path (i.e. foldername1/foldername2/)|
Let me know if this helps you or you have suggestions to make it better/clearer.