Discussion:
Multithreading
Nikolaus Rath
2010-06-20 15:39:45 UTC
Permalink
Hello,

I would like to use an SFTPClient instance concurrently with several
threads, but I couldn't find any information about thread safety in the
API documentation.

- Can I just share the SFTPClient instance between several threads?
- Or can I share the SSHClient object, but each thread needs its own
SFTPClient?
- Or do I need a separate SSHClient for each thread?
- Or is there another layer in between that I can share between threads?
- Or do I have to avoid multithreading at all when using paramiko?


Thanks!

-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Nikolaus Rath
2010-06-28 02:29:33 UTC
Permalink
Hi,

Really no one around who knows anything about this?

-Nikolaus
Post by Nikolaus Rath
Hello,
I would like to use an SFTPClient instance concurrently with several
threads, but I couldn't find any information about thread safety in the
API documentation.
- Can I just share the SFTPClient instance between several threads?
- Or can I share the SSHClient object, but each thread needs its own
SFTPClient?
- Or do I need a separate SSHClient for each thread?
- Or is there another layer in between that I can share between threads?
- Or do I have to avoid multithreading at all when using paramiko?
Thanks!
-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«
PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
_______________________________________________
paramiko mailing list
http://www.lag.net/cgi-bin/mailman/listinfo/paramiko
-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Nikolaus Rath
2010-06-30 13:33:14 UTC
Permalink
Post by Nikolaus Rath
Hello,
I would like to use an SFTPClient instance concurrently with several
threads, but I couldn't find any information about thread safety in the
API documentation.
- Can I just share the SFTPClient instance between several threads?
Why use SFTPClient? Its performance might not be very good plus
compatibility issues with some SSH servers might crop up. I know I had
compatibility issues even with some pretty standard SSH servers on some
platforms (esp. Solaris).
I would avoid SFTPClient if I were you.
I wrote my own multithreaded SCP class (handling both upload and
download) which I can post if you're interested. It's been in use for a
while by several users and I think it's pretty well debugged by now.
All I need is a Python API for uploading, downloading and renaming files
over SSH. I chose SFTPClient since it seemed to be the simplest
solution, and I don't remember seeing any warnings about performance or
compatibility. Can you tell me what exactly the problem with SFTPClient
is? Are there any better options within paramiko? In any case, I am
certainly interested in taking a look at your solution.
for instance, what if you need to close down the thread in the middle of
operation but SFTPClient doesn't allow that?
When I'm in the middle of an operation, then I am in the middle of an
SFTPClient method. Obviously I can't shut down the thread while the
interpreter is not executing my code. This doesn't seem to be an
SFTPClient or even multithreading specific problem to me.
What if you're shutting
down an interpreter and SFTPClient throws an exception which is visible
for end user?
The interpreter should keep running while at least one thread is alive.
It seems to me that if SFTPClient throws an exception, obviously
something went wrong and it is a good think to know about it.
What if there's no SCP/sftp on the other end (and this
does happen from time to time)
If the user tries to establish an SFTP connection to a server that does
not support SFTP, then things will obviously break. But that's not a bug
in the program.
I have done quite a lot of work on
getting my class to work reasonably under such circumstances: for
instance, thread's file sending/downloading methods watch value of
thread.abort flag and if it's set to True by an external class, they
shut down gracefully.
I think you are using multithreading in a different way. Let me guess:
you have a main thread that has to stay responsive and therefore
delegates time-consuming operations to individual worker threads. These
worker threads must shut down when the main threads asks them to do so.
In this situation you have to deal with the problems you describe, but
they are not specific to SFTPClient and they do not always arise when
using multiple threads.

For example, my application is much simpler. I have several threads
which work independently of each other. There is no main controlling
thread. The application terminates when all the threads have finished
their work. (I am essentially programming a server with individual
threads handling client requests).
remembering to sleep just in case after releasing locks to prevent
starvation
I never heard of that. Could you explain in more detail what you mean?



Best,

-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Marcin Krol
2010-06-30 15:13:23 UTC
Permalink
This post might be inappropriate. Click to display it.
james bardin
2010-06-30 18:41:49 UTC
Permalink
Post by Marcin Krol
Post by Nikolaus Rath
Obviously, all the normal caveats about multithreading apply: remembering
to sleep just in case after releasing locks to prevent
starvation
I never heard of that. Could you explain in more detail what you mean?
http://linuxgazette.net/107/pai.html
Caveat: I don't know if this has been improved in Python thread handling
since that article has been written, but I add a bit of sleeping after lock
release anyway just to be safe.
That's a poor example of python coding. It's highlighting a GIL issue,
but if you have that sort of contention, you shouldn't be using
multiple threads. The author may have encountered a problem, but he
didn't fully understand it, or his own solution; which is why adding a
"magic" sleep seems to work. Note that you will get slightly different
results on a multicore system.

The main loop is a busy loop! It's hogging the GIL itself, and pegging
a cpu core at %100.
Because the threads only do work with a lock, there is no time for the
GIL to switch threads., The sleep() simply allows a few cycles for the
GIL to be released.


-jim
Marcin Krol
2010-07-01 13:22:28 UTC
Permalink
Post by james bardin
The main loop is a busy loop! It's hogging the GIL itself, and pegging
a cpu core at %100.
Because the threads only do work with a lock, there is no time for the
GIL to switch threads., The sleep() simply allows a few cycles for the
GIL to be released.
Oops! I didn't analyze the code in question (just skimmed the article)
and accepted the conclusions in good faith. Quality of internet texts
for you. I assumed, apparently incorrectly, that there was some critical
peer review for stuff published in Linux Gazette. :-(

Anyway, how would you design such a thing that I will describe shortly:
a multithreaded network server for file copying onto remote machines,
controlled by a web application. (for reasons of potential high system
load and security reasons a web application server cannot do it itself)

I used SocketServer.TCPServer as basis, with 2 global queues and 1
global lock. On incoming request, the handler class for TCPServer
acquires lock, adds item to queue and releases the lock.

The "main" thread that handles items in the queue periodically acquires
lock, processes items in the queue (spawning sending SSH threads),
updates item statuses (if e.g. sending SSH thread finished working) etc.
and releases lock and sleeps for relatively long time (like 0.5 second).

My implementation of this design works really well so far (no contention
issues, capable of handling many simultaneous requests and transfers),
but if this is a bad design, I would like to know.
--
Regards,
mk

--
Premature optimization is the root of all fun.
Eric S. Johansson
2010-07-01 17:24:32 UTC
Permalink
Post by Marcin Krol
The "main" thread that handles items in the queue periodically
acquires lock, processes items in the queue (spawning sending SSH
threads), updates item statuses (if e.g. sending SSH thread finished
working) etc. and releases lock and sleeps for relatively long time
(like 0.5 second).
My implementation of this design works really well so far (no
contention issues, capable of handling many simultaneous requests and
transfers), but if this is a bad design, I would like to know.
personally, I find threading to be more trouble than it is worth. I
would use the python multiprocessing module and distribute the load
across multiple processes rather than threads. Well it's never as
simple as a single thread implementation, it's far less complex and
easier to debug multiple process applications (in my opinion) than
threaded applications.
Marcin Krol
2010-07-01 19:12:48 UTC
Permalink
Post by Eric S. Johansson
personally, I find threading to be more trouble than it is worth. I
would use the python multiprocessing module and distribute the load
across multiple processes rather than threads. Well it's never as
simple as a single thread implementation, it's far less complex and
easier to debug multiple process applications (in my opinion) than
threaded applications.
Thanks, Eric. I also considered using multiprocessing module, but my
main problem is that it seems to be less frequently used and there are
much fewer good practices and books around on doing stuff with this (I
plan to learn this when time allows). Nevertheless, it might be option
worth using, esp. given that it might be scaled to multiple CPUs/cores.
--
Regards,
mk

--
Premature optimization is the root of all fun.
james bardin
2010-07-01 19:23:41 UTC
Permalink
personally, I find threading to be more trouble than it is worth. I would
use the python multiprocessing module and distribute the load across
multiple processes rather than threads.  Well it's never as simple as a
single thread implementation, it's far less complex and easier to debug
multiple process applications (in my opinion) than threaded applications.
Thanks, Eric. I also considered using multiprocessing module, but my main
problem is that it seems to be less frequently used and there are much fewer
good practices and books around on doing stuff with this (I plan to learn
this when time allows). Nevertheless, it might be option worth using, esp.
given that it might be scaled to multiple CPUs/cores.
Due to the current nature of python, I like to think of it this way
(when I don't feel like dealing with asynchronous programming):

Need to get around something blocking - threading.py
Need to get work done on multiple cores - multiprocessing.py

It usually makes the choice pretty cut-and-dry for me.
Eric S. Johansson
2010-07-01 19:48:12 UTC
Permalink
Post by james bardin
Thanks, Eric. I also considered using multiprocessing module, but my main
problem is that it seems to be less frequently used and there are much fewer
good practices and books around on doing stuff with this (I plan to learn
this when time allows). Nevertheless, it might be option worth using, esp.
given that it might be scaled to multiple CPUs/cores.
Due to the current nature of python, I like to think of it this way
Need to get around something blocking - threading.py
Need to get work done on multiple cores - multiprocessing.py
I will admit it's been a while since I've used it. My focus has been on
working on techniques for disabled programmers. It's hard enough to make
it work in a single process, let alone on multiple processes on multiple
machines. :-)

I think the suggestions given are good ones. I would add that like
threading, multiprocessing is useful when you have a queue of work that
comes at a different rate than the consumption of the work elements. In
a traditional environment, you just put a simple queue between the
producer and consumer (i.e. print queue) and everything is happy. Where
we get into trouble is when we have multiple producers and multiple
consumers and either side has critical regions. It looks like the
multiprocessor module handles critical regions on both queue data
injection and consumption. It's nice to have one less thing to worry
about. It also provides locking across processes so that consumers can
coordinate. will the locks also work on the producer side.

Python has a big blind spot on locking. It can become more sophisicated
without being much more complex. the current model is okay for simple
things and will work for single threaded critical regions multithreaded
(or should I say multiprocess) critical regions are much more difficult
to handle. I'll elaborate only if people insist.

And like I said earlier, I choose multiprocessing because it's much
easier to debug. The separate process code can be debugged standalone or
as a remote attachment to debugger. winpdb rocks albeit very slowly.
Andrew Bennetts
2010-07-02 03:28:39 UTC
Permalink
Marcin Krol wrote:
[...]
Post by Marcin Krol
Anyway, how would you design such a thing that I will describe
shortly: a multithreaded network server for file copying onto
remote machines, controlled by a web application. (for reasons of
potential high system load and security reasons a web application
server cannot do it itself)
Why is “multithreaded” a requirement? Threads are usually a means, not
an end.

If you meant “can handle many concurrent connections” instead, I'd
suggest Twisted, it tends to excel at that sort of task (and without
threads, usually). Personally, even if threads are required I'd probably
lean towards using it anyway :)

-Andrew.
Marcin Krol
2010-07-02 10:25:54 UTC
Permalink
Post by Andrew Bennetts
If you meant “can handle many concurrent connections” instead, I'd
suggest Twisted, it tends to excel at that sort of task (and without
threads, usually). Personally, even if threads are required I'd probably
lean towards using it anyway :)
Threads are not a hard requirement, and Twisted would certainly be
interesting to learn, but there are a few cons against it:

- first of all, event-driven programming is still a bit exotic and many
more people are familiar with threads; and the code I'm writing will
probably be in use for a long time and not just me will work on it. I
can't realistically expect them to learn Twisted just to deal with my
stuff (even being allowed to use Python was a bit of a challenge, my
environment is almost all Java).

- in the long run, the threads support in core Python is much more
certain to be there in at least a few years time, but future of Twisted,
however good it is, is not so certain.

Joys of corporate pressures and conformity for you.
--
Regards,
mk

--
Premature optimization is the root of all fun.
james bardin
2010-06-30 15:12:19 UTC
Permalink
Post by Nikolaus Rath
Post by Nikolaus Rath
Hello,
I would like to use an SFTPClient instance concurrently with several
threads, but I couldn't find any information about thread safety in the
API documentation.
 - Can I just share the SFTPClient instance between several threads?
Yes. Though you may have a use case, there's no performance benefit,
as the client can only handle one operation at a time.
Post by Nikolaus Rath
Why use SFTPClient? Its performance might not be very good plus
compatibility issues with some SSH servers might crop up. I know I had
compatibility issues even with some pretty standard SSH servers on some
platforms (esp. Solaris).
There are throughput performance issues with older ssh servers that
don't support prefetch and pipelining. On slower machines with fast
network, crypto and transport overhead will be the limiting factor.
Post by Nikolaus Rath
for instance, what if you need to close down the thread in the middle of
operation but SFTPClient doesn't allow that?
When I'm in the middle of an operation, then I am in the middle of an
SFTPClient method. Obviously I can't shut down the thread while the
interpreter is not executing my code. This doesn't seem to be an
SFTPClient or even multithreading specific problem to me.
The same thing goes for SSHClient, and SFTPClient. In a way, they are
"convenience" classes, that bundle a bunch of the library together. If
you do things out of the ordinary, you may need to bypass their
abstractions, and work with the lower level pieces yourself. The
*Client code is fairly straightforward to use as an example if needed.
Post by Nikolaus Rath
remembering to sleep just in case after releasing locks to prevent
starvation
If you need to sleep after releasing a lock, there's something wrong
(and arbitrary delays to solve contention are poor programming
practice). As far as the python language is concerned, lock operations
are atomic.


-jim
Nikolaus Rath
2010-06-30 21:49:09 UTC
Permalink
Post by james bardin
Post by Nikolaus Rath
Hello,
I would like to use an SFTPClient instance concurrently with several
threads, but I couldn't find any information about thread safety in the
API documentation.
 - Can I just share the SFTPClient instance between several threads?
Yes. Though you may have a use case, there's no performance benefit,
as the client can only handle one operation at a time.
I am mostly concerned about reducing the network latency. Suppose I want
to create 100 1-bit files, then I hope that it is going to be faster to
send 100 requests at once from 100 threads rather than having one thread
that works through the files sequentially. That should still work even
with a single threaded client, right?
Post by james bardin
Why use SFTPClient? Its performance might not be very good plus
compatibility issues with some SSH servers might crop up. I know I had
compatibility issues even with some pretty standard SSH servers on some
platforms (esp. Solaris).
There are throughput performance issues with older ssh servers that
don't support prefetch and pipelining.
Ah, so the problem is with the server and not with the SFTPClient class?
My servers are given and only speak SFTP, so I am to live with them in
any case.


Best,

-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Nikolaus Rath
2010-07-01 13:31:42 UTC
Permalink
Post by james bardin
Post by Nikolaus Rath
Hello,
I would like to use an SFTPClient instance concurrently with several
threads, but I couldn't find any information about thread safety in the
API documentation.
 - Can I just share the SFTPClient instance between several threads?
Yes. Though you may have a use case, there's no performance benefit,
as the client can only handle one operation at a time.
Hmm. As soon as I start to share an SFTPClient instance, I get the
following error:


File "/home/nikratio/projekte/s3ql/src/s3ql/backends/sftp.py", line 52, in __contains__
self.sftp.stat(entry)
File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 337, in stat
t, msg = self._request(CMD_STAT, path)
File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 628, in _request
return self._read_response(num)
File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 658, in _read_response
t, data = self._read_packet()
File "/usr/lib/pymodules/python2.6/paramiko/sftp.py", line 179, in _read_packet
raise SFTPError('Garbage packet received')


Any idea?


Bests,

-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
james bardin
2010-07-01 13:36:08 UTC
Permalink
Post by Nikolaus Rath
Post by james bardin
Yes. Though you may have a use case, there's no performance benefit,
as the client can only handle one operation at a time.
Hmm. As soon as I start to share an SFTPClient instance, I get the
File "/home/nikratio/projekte/s3ql/src/s3ql/backends/sftp.py", line 52, in __contains__
   self.sftp.stat(entry)
 File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 337, in stat
   t, msg = self._request(CMD_STAT, path)
 File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 628, in _request
   return self._read_response(num)
 File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 658, in _read_response
   t, data = self._read_packet()
 File "/usr/lib/pymodules/python2.6/paramiko/sftp.py", line 179, in _read_packet
   raise SFTPError('Garbage packet received')
Any idea?
You can share it, like you can share anything you want between
threads, you need proper locking, as the client only has one channel
for communication.
Post by Nikolaus Rath
I am mostly concerned about reducing the network latency. Suppose I want
to create 100 1-bit files, then I hope that it is going to be faster to
send 100 requests at once from 100 threads rather than having one thread
that works through the files sequentially. That should still work even
with a single threaded client, right?
If the client is single threaded, and communicating over a single
channel, you *can't* send multiple requests at once, it just doesn't
work that way. Each operation has to complete, before the next one can
start, so doing this from multiple threads is only adding unnecessary
complication. If you want to do file operations in parallel, you will
need multiple sftp sessions.
james bardin
2010-07-01 16:43:08 UTC
Permalink
Post by james bardin
You can share it, like you can share anything you want between
threads, you need proper locking, as the client only has one channel
for communication.
Since I can share anything I want if I synchronize access to it myself,
my question was meant as "can I share it without explicit locking".
Generally in python, the only objects you can share "without explicit
locking" are single instances of core data types - basically lists and
dicts.
Nikolaus Rath
2010-07-01 17:00:06 UTC
Permalink
Post by james bardin
Post by james bardin
You can share it, like you can share anything you want between
threads, you need proper locking, as the client only has one channel
for communication.
Since I can share anything I want if I synchronize access to it myself,
my question was meant as "can I share it without explicit locking".
Generally in python, the only objects you can share "without explicit
locking" are single instances of core data types - basically lists and
dicts.
As well as third-party modules that have been designed to be threadsafe.


Best,

-Niko
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
james bardin
2010-07-01 17:28:25 UTC
Permalink
Post by Nikolaus Rath
Post by james bardin
Generally in python, the only objects you can share "without explicit
locking" are single instances of core data types - basically lists and
dicts.
As well as third-party modules that have been designed to be threadsafe.
*Modules* can be thread-safe, but what's an example of a module that
advertises it's classes as thread-safe? Any class that does that would
need to self-lock on all non-atomic operations. It's normally just
easier to expect locking to be handled outside of the class.
Nikolaus Rath
2010-07-02 02:49:21 UTC
Permalink
Post by james bardin
Post by Nikolaus Rath
Post by james bardin
Generally in python, the only objects you can share "without explicit
locking" are single instances of core data types - basically lists and
dicts.
As well as third-party modules that have been designed to be threadsafe.
*Modules* can be thread-safe, but what's an example of a module that
advertises it's classes as thread-safe? Any class that does that would
need to self-lock on all non-atomic operations. It's normally just
easier to expect locking to be handled outside of the class.
In most cases you are probably right. But I think there are also good
cases where locking is better done in the class itself. The class does
not need to self-lock all non-atomic operations, only those that
actually operate on instance (or global) variables. And even when
locking is required, the method is able to lock just the one variable it
is working with rather than the entire method.

Example: I am working with a class that uploads data to Amazon S3
(basically an online storage service with a simple HTTP API). The class
provides methods like put_from_fh(key, fh) and get_to_fh(key, fh) which
are designed to be threadsafe. When called, they first compress and
encrypt the data, then they briefly obtain a lock to get a HTTP
connection from a pool, release the lock and upload the data.

The methods have to be multithreaded because they provide a file system
backend (and it would be rather annoying if you would have to wait for
you 100 MB write into file1 to complete before you can read 10 bytes
from file2).

There are of course other possible implementations, like giving every
thread its own S3 storage instance or managing the storage classes in a
pool (instead of the HTTP connections), but I consider those to be less
elegant. The S3 storage class has semantics like a dict (only that the
amount of stored data is larger and stored elsewhere), so I would
consider it quite awkward if I had to bother with locking or pooling
when using it.


Best,

-Nikolaus


Btw, I am about to write a class that provides the same functionality
over SFTP, thus my initial question.
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Nikolaus Rath
2010-07-01 16:06:21 UTC
Permalink
Post by james bardin
Post by Nikolaus Rath
Post by james bardin
Yes. Though you may have a use case, there's no performance benefit,
as the client can only handle one operation at a time.
Hmm. As soon as I start to share an SFTPClient instance, I get the
File "/home/nikratio/projekte/s3ql/src/s3ql/backends/sftp.py", line 52, in __contains__
self.sftp.stat(entry)
File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 337, in stat
t, msg = self._request(CMD_STAT, path)
File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 628, in _request
return self._read_response(num)
File "/usr/lib/pymodules/python2.6/paramiko/sftp_client.py", line 658, in _read_response
t, data = self._read_packet()
File "/usr/lib/pymodules/python2.6/paramiko/sftp.py", line 179, in _read_packet
raise SFTPError('Garbage packet received')
Any idea?
You can share it, like you can share anything you want between
threads, you need proper locking, as the client only has one channel
for communication.
Since I can share anything I want if I synchronize access to it myself,
my question was meant as "can I share it without explicit locking".
Post by james bardin
Post by Nikolaus Rath
I am mostly concerned about reducing the network latency. Suppose I want
to create 100 1-bit files, then I hope that it is going to be faster to
send 100 requests at once from 100 threads rather than having one thread
that works through the files sequentially. That should still work even
with a single threaded client, right?
If the client is single threaded, and communicating over a single
channel, you *can't* send multiple requests at once, it just doesn't
work that way.
I didn't know the internals of the SFTP protocol, thanks for clarifying!


Best,

-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Marcin Krol
2010-06-30 13:03:15 UTC
Permalink
This post might be inappropriate. Click to display it.
Continue reading on narkive:
Loading...