Difference between revisions of "How to Transfer Data"

From DeepSense Docs
Jump to: navigation, search
(Mac OSX)
(From the World Wide Web)
 
(23 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
== To and From Your Personal Computer ==
 
== To and From Your Personal Computer ==
  
=== Login Nodes ===
+
=== Small Transfers ===
  
Since the two login nodes are the primary point of access for the platform, they may be in heavy use.  We do not want to overload them unnecessarily for data transfer.  Please only use this for small amounts of data.
+
For small transfers (<5Gb), you can use the two login nodes.  Since they are the primary point of access for the platform, they may be in heavy use.  We do not want to overload them unnecessarily for data transfer.  Please only use this for small amounts of data.
  
The most common method for transferring data securely between machines will be <code>scp</code>.  This is pretty straightforward to use.
+
The most common method for transferring data securely between machines will be <code>scp</code>.  This is pretty straightforward to use, however the destination files will have the wrong permissions set. It will remove group permissions, so while you will be able to access the data, no one else in your group will be.  This is fine if you are the only one working on the project.
 +
 
 +
'''Example''': scp -r /path/to/files/ username@login2.deepsense.ca:/data/projectdir/
  
 
One can also use <code>rsync</code> (see the [https://linux.die.net/man/1/rsync man page]).  This has more options than <code>scp</code>, and can be used to sync files
 
One can also use <code>rsync</code> (see the [https://linux.die.net/man/1/rsync man page]).  This has more options than <code>scp</code>, and can be used to sync files
 
between two machines.  
 
between two machines.  
  
  '''Note''': do not use the <code>-p</code>, or <code>-a</code> options, as they preserve file permissions.  This could cause problems with user quotas, as they are based on the owner/group of files. Suggested options are:<br>rsync -rzvh sourceDirectory destinationDirectory<br>
+
  '''Example''': rsync -azvhP /path/to/files/ username@login2.deepsense.ca:/data/projectdir/
 +
 
 +
The rsync options above are:
 +
* a - archive mode, equal to rlptgoD (recursive, preserve links, times, permissions, group, owner, etc)
 +
* z - use compression when copying
 +
* v - verbose: list files copied
 +
* h - human readable: output numbers in human readable format
 +
* P - same as --partial --progress.  Show progress while transferring, and keep partial files.
 +
 
 +
'''Note''': We recommend always using the option <code>-p</code> (using <code>-a</code> also invokes <code>-p</code>).  This ensures that everyone in your group should have the same permissions to the file as you do.
  
=== Protocol Nodes ===
+
=== Medium Size ===
  
The protocol nodes (<code>protocol1.deepsense.ca</code>, <code>protocol2.deepsense.ca</code>) are specifically meant for large data transfers.  However, they are only accessible via samba.   
+
For medium sized transfers (between 5Gb and 100Gb), you should use the protocol nodes.  They (<code>protocol1.deepsense.ca</code>, <code>protocol2.deepsense.ca</code>) are specifically meant for data transfers.  However, they are only accessible via samba.   
  
 
==== Mac OSX ====
 
==== Mac OSX ====
On a Mac, open finder and hit ⌘-K, or use the menu ''Go -> Connect to Server''.  In the dialog box (see below), type the address for either protocol node, and you can login.  This will connect you to the <code>/data</code> filesystem.
 
  
[[File:macSambaConnect.png]]
+
[[File:macSambaConnect.png|thumb|Connect via samba on OSX]]
  
 +
On a Mac, open finder and hit ⌘-K, or use the menu ''Go -> Connect to Server''.  In the dialog box (see image), type the address for either protocol node, and you can login.  This will connect you to the <code>/data</code> filesystem.
  
 
If you want to use <code>rsync</code> to transfer data via the protocol nodes, you have to mount one.  On a Mac, the easiest way is to connect to the protocol node as in the previous paragraph.  This will mount it at <code>/Volumes/data/</code>.  You can now use rsync to copy files to your project's subdirectory.
 
If you want to use <code>rsync</code> to transfer data via the protocol nodes, you have to mount one.  On a Mac, the easiest way is to connect to the protocol node as in the previous paragraph.  This will mount it at <code>/Volumes/data/</code>.  You can now use rsync to copy files to your project's subdirectory.
 +
 +
'''Example''': rsync&nbsp;&#8209;rzvh&nbsp;/path/to/files/&nbsp;/Volumes/data/projectdir/
  
 
==== Windows ====
 
==== Windows ====
  
On windows computer, you should connect to <code>//protocol1.deepsense.ca/data</code> or <code>//protocol2.deepsense.ca/data</code>.  You may also have to change a SMB security level setting as follows (this was necessary in Windows 10):
+
On windows computer, you should connect to <code>\\protocol1.deepsense.ca\data</code> or <code>\\protocol2.deepsense.ca\data</code>.  To do this the first time, open a file explorer window. 
 +
Right-click on This PC, and select "add a network location".  In the wizard, click next and then select "Choose a custom network location" (this was the only option I saw).  Highlight it, and click next.  On the following screen, enter one of the addresses above, and click next.  You may now enter a name for this location.  Do so, and click next again.  On the last screen, you should be able to look over your selections, and then click Finish.  The name you chose should now be available under "This PC" in your file explorer. 
 +
 
 +
You may also have to change a SMB security level setting as follows (this was necessary in Windows 10):
  
 
Control Panel&nbsp;>&nbsp;System and Security&nbsp;>&nbsp;Administrative tools&nbsp;>&nbsp;Local Security Policy&nbsp;>&nbsp;expand Local Policies&nbsp;>&nbsp;Security options&nbsp;>&nbsp;click on Network security: Lan Manager authentication level&nbsp;>&nbsp;Then in the field choose&nbsp;>&nbsp;Send NTLMv2 responses only&nbsp;>&nbsp;click on Apply, then ok and close all.
 
Control Panel&nbsp;>&nbsp;System and Security&nbsp;>&nbsp;Administrative tools&nbsp;>&nbsp;Local Security Policy&nbsp;>&nbsp;expand Local Policies&nbsp;>&nbsp;Security options&nbsp;>&nbsp;click on Network security: Lan Manager authentication level&nbsp;>&nbsp;Then in the field choose&nbsp;>&nbsp;Send NTLMv2 responses only&nbsp;>&nbsp;click on Apply, then ok and close all.
  
 +
==== File Permissions ====
  
 +
Unfortunately, samba won't preserve the proper file permissions.  We find it strips the executable bit from any file that has it switched on.  You can change an individual file by using <code>chmod ug+x filename</code>.  If you want to change many files at once, and are unsure of how to, please send us an email at ([mailto:support@deepsense.ca support@deepsense.ca]).
 +
 +
== Large Transfers ==
 +
 +
For large transfers (>100Gb), We generally find it is best to put the data on an external drive.  To make such arrangements, please email [mailto:support@deepsense.ca support@deepsense.ca].  We can then plug it in directly in our server room, and transfer the data for you.
  
 
== From the World Wide Web ==
 
== From the World Wide Web ==
Line 38: Line 60:
 
The standard tool for downloading data from websites is [https://en.wikipedia.org/wiki/Wget wget]. Also available is [https://curl.haxx.se/ curl]. The two are compared in this [https://unix.stackexchange.com/questions/47434/what-is-the-difference-between-curl-and-wget StackExchange article].
 
The standard tool for downloading data from websites is [https://en.wikipedia.org/wiki/Wget wget]. Also available is [https://curl.haxx.se/ curl]. The two are compared in this [https://unix.stackexchange.com/questions/47434/what-is-the-difference-between-curl-and-wget StackExchange article].
  
== Between DeepSense Filesystems ==
 
  
You may want to transfer data from your home directory to your data or scratch directories.  To do this, you should '''not''' use the <code>mv</code> commandPlease instead use the <code>cp</code> command (you can delete them from the original filesystem after).  When files are copied to a new filesystem, new files are created with the proper group name.  Using the <code>mv</code> command will keep the original group name, and can affect the quota reporting.
+
If you have URLs for multiple datasets, you can also use python code (or others) to download the files you needYou can write a script that will look like this:
 +
 
 +
<nowiki>import urllib
 +
urls=[ "url1", "url2", ...]
 +
 
 +
...
 +
 
 +
for url in urls:
 +
  urllib.request.urlretrieve( url, filename=destination)</nowiki>
 +
 
 +
Of course, you'll have to properly specify the filename <code>destination</code>.

Latest revision as of 16:04, 22 June 2021

There are different methods for transferring data to and from the DeepSense platform. Which method you use will depend from where you are transferring the data, as well as the size of the data.

To and From Your Personal Computer

Small Transfers

For small transfers (<5Gb), you can use the two login nodes. Since they are the primary point of access for the platform, they may be in heavy use. We do not want to overload them unnecessarily for data transfer. Please only use this for small amounts of data.

The most common method for transferring data securely between machines will be scp. This is pretty straightforward to use, however the destination files will have the wrong permissions set. It will remove group permissions, so while you will be able to access the data, no one else in your group will be. This is fine if you are the only one working on the project.

Example: scp -r /path/to/files/ username@login2.deepsense.ca:/data/projectdir/

One can also use rsync (see the man page). This has more options than scp, and can be used to sync files between two machines.

Example: rsync -azvhP /path/to/files/ username@login2.deepsense.ca:/data/projectdir/

The rsync options above are:

  • a - archive mode, equal to rlptgoD (recursive, preserve links, times, permissions, group, owner, etc)
  • z - use compression when copying
  • v - verbose: list files copied
  • h - human readable: output numbers in human readable format
  • P - same as --partial --progress. Show progress while transferring, and keep partial files.

Note: We recommend always using the option -p (using -a also invokes -p). This ensures that everyone in your group should have the same permissions to the file as you do.

Medium Size

For medium sized transfers (between 5Gb and 100Gb), you should use the protocol nodes. They (protocol1.deepsense.ca, protocol2.deepsense.ca) are specifically meant for data transfers. However, they are only accessible via samba.

Mac OSX

Connect via samba on OSX

On a Mac, open finder and hit ⌘-K, or use the menu Go -> Connect to Server. In the dialog box (see image), type the address for either protocol node, and you can login. This will connect you to the /data filesystem.

If you want to use rsync to transfer data via the protocol nodes, you have to mount one. On a Mac, the easiest way is to connect to the protocol node as in the previous paragraph. This will mount it at /Volumes/data/. You can now use rsync to copy files to your project's subdirectory.

Example: rsync ‑rzvh /path/to/files/ /Volumes/data/projectdir/

Windows

On windows computer, you should connect to \\protocol1.deepsense.ca\data or \\protocol2.deepsense.ca\data. To do this the first time, open a file explorer window. Right-click on This PC, and select "add a network location". In the wizard, click next and then select "Choose a custom network location" (this was the only option I saw). Highlight it, and click next. On the following screen, enter one of the addresses above, and click next. You may now enter a name for this location. Do so, and click next again. On the last screen, you should be able to look over your selections, and then click Finish. The name you chose should now be available under "This PC" in your file explorer.

You may also have to change a SMB security level setting as follows (this was necessary in Windows 10):

Control Panel > System and Security > Administrative tools > Local Security Policy > expand Local Policies > Security options > click on Network security: Lan Manager authentication level > Then in the field choose > Send NTLMv2 responses only > click on Apply, then ok and close all.

File Permissions

Unfortunately, samba won't preserve the proper file permissions. We find it strips the executable bit from any file that has it switched on. You can change an individual file by using chmod ug+x filename. If you want to change many files at once, and are unsure of how to, please send us an email at (support@deepsense.ca).

Large Transfers

For large transfers (>100Gb), We generally find it is best to put the data on an external drive. To make such arrangements, please email support@deepsense.ca. We can then plug it in directly in our server room, and transfer the data for you.

From the World Wide Web

The standard tool for downloading data from websites is wget. Also available is curl. The two are compared in this StackExchange article.


If you have URLs for multiple datasets, you can also use python code (or others) to download the files you need. You can write a script that will look like this:

import urllib
urls=[ "url1", "url2", ...]

...

for url in urls:
  urllib.request.urlretrieve( url, filename=destination)

Of course, you'll have to properly specify the filename destination.